156 Commits

Author SHA1 Message Date
Kevin Hu
ed5f81b02e
Fix: abnormal cell mergeing. (#6991)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-04-14 11:00:11 +08:00
gsmini
53c653b099
fix RAGFlowPdfParser AttributeError: 'PdfReader' object has no attribute 'close' err (#6859)
i use PdfParser in local(refer to this case:
https://github.com/infiniflow/ragflow/blob/main/rag/app/paper.py) like
this:
```
import re
import openpyxl

from ragflow.api.db import ParserType
from ragflow.rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bullets_category, \
    title_frequency, \
    tokenize_chunks
from ragflow.rag.utils import num_tokens_from_string
from ragflow.deepdoc.parser import PdfParser, ExcelParser, DocxParser,PlainParser


def logger(prog=None, msg=""):
    print(msg)


class Pdf(PdfParser):
    def __init__(self):
        self.model_speciess = ParserType.MANUAL.value
        super().__init__()

    def __call__(self, filename, binary=None, from_page=0,
                 to_page=100000, zoomin=3, callback=None):
        from timeit import default_timer as timer
        start = timer()
        callback(msg="OCR is running...")

        self.__images__(
            filename if not binary else binary,
            zoomin,
            from_page,
            to_page,
            callback
        )
        callback(msg="OCR finished.")
        print("OCR:", timer() - start)
   
        self._layouts_rec(zoomin)
        callback(0.65, "Layout analysis finished.")
        print("layouts:", timer() - start)

        self._table_transformer_job(zoomin)
        callback(0.67, "Table analysis finished.")


        self._text_merge()
        tbls = self._extract_table_figure(True, zoomin, True, True)
        self._concat_downward()  
        self._filter_forpages()   
        callback(0.68, "Text merging finished")

        # clean mess
        for b in self.boxes:
            b["text"] = re.sub(r"([\t  ]|\u3000){2,}", " ", b["text"].strip())

        return [(b["text"], b.get("layout_no", ""), self.get_position(b, zoomin))
                for i, b in enumerate(self.boxes)], tbls


```

show err like this:
```
  File "xxxxx/third_party/ragflow/deepdoc/parser/pdf_parser.py", line 1039, in __images__
    self.pdf.close()
AttributeError: 'PdfReader' object has no attribute 'close'
```

i found ragflow source code use
`pdfplumber.open`(https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1007C28-L1007C43)

and replace` self.pdf `with ` pdf2_read` (from pypdf import PdfReader as
pdf2_read)in line 1024
(https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1024)
```
self.pdf = pdf2_read
```


---
and I found that `pdfplumber` can be used in this way:
```
file_path="xxx.pdf"
res = pdfplumber.open(file_path)
res.close()
```

but `pypdf.PdfReader` source code do not has `close` func, source code
use like this

```
 with open(stream, "rb") as fh:
         stream = BytesIO(fh.read())
          self._stream_opened = True
```
> https://github.com/py-pdf/pypdf/blob/main/pypdf/_reader.py#L156

so I moved the `self.pdf.close` function call and fixed this problem
hoping to help the project😊
2025-04-14 09:40:13 +08:00
Kevin Hu
3bb1e012e6
Fix: assistant deleteion issue. (#6906)
### What problem does this PR solve?

#6875

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-04-09 20:29:40 +08:00
Kevin Hu
2caf15b24c
Refa: trival. (#6802)
### What problem does this PR solve?


### Type of change


- [x] Refactoring
2025-04-03 19:01:24 +08:00
donblack01
0b48a2e0d1
Fix: When Excel is a formula, the parsed result is a formula, but cannot be correctly parsed as a value type (#6613)
### What problem does this PR solve?

Fix: When Excel is a formula, the parsed result is a formula, but cannot
be correctly parsed as a value type

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: tangyu <1@1.com>
2025-03-28 09:33:49 +08:00
Stephen Hu
d77380f024
Feat: support pic base bullet for PPT (#6406)
### What problem does this PR solve?

support pic base bullet for PPT

modify one mistake in document

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-24 09:31:31 +08:00
Yongteng Lei
9611185eb4
Feat: add VLM-boosted DocX parser (#6307)
### What problem does this PR solve?

Add VLM-boosted DocX parser

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-20 11:24:44 +08:00
Yongteng Lei
1d6760dd84
Feat: add VLM-boosted PDF parser (#6278)
### What problem does this PR solve?

Add VLM-boosted PDF parser if VLM is set.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-20 09:39:32 +08:00
Yongteng Lei
5cf610af40
Feat: add vision LLM PDF parser (#6173)
### What problem does this PR solve?

Add vision LLM PDF parser

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-03-18 14:52:20 +08:00
Stephen Hu
b0b4b7ba33
Feat: Improve Recognizer.py performance (#6185)
### What problem does this PR solve?

For the create_inputs method based on np operation to replace for loop

### Type of change

- [x] Performance Improvement
2025-03-18 09:39:49 +08:00
Stephen Hu
79482ff672
Refa: Improve ppt_parser better handle list (#6162)
### What problem does this PR solve?
This pull request (PR) incorporates codes for parsing PPTX files, aiming
to more precisely depict text in list formats (hint list by .).

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2025-03-17 17:02:39 +08:00
Kevin Hu
3a99c2b5f4
Refa: PARALLEL_DEVICES is a static parameter. (#6168)
### What problem does this PR solve?


### Type of change

- [x] Refactoring
2025-03-17 16:49:54 +08:00
Kevin Hu
bfa8d342b3
Fix: retrieval debug mode issue. (#6150)
### What problem does this PR solve?

#6139

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-17 13:07:13 +08:00
Debug Doctor
3e19044dee
Feat: add OCR's muti-gpus and parallel processing support (#5972)
### What problem does this PR solve?

Add OCR's muti-gpus and parallel processing support

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

@yuzhichang I've tried to resolve the comments in #5697. OCR jobs can
now be done on both CPU and GPU. ( By the way, I've encountered a
“Generate embedding error” issue #5954 that might be due to my outdated
GPUs? idk. ) Please review it and give me suggestions.

GPU:

![gpu_ocr](https://github.com/user-attachments/assets/0ee2ecfb-a665-4e50-8bc7-15941b9cd80e)

![smi](https://github.com/user-attachments/assets/a2312f8c-cf24-443d-bf89-bec50503546d)

CPU:

![cpu_ocr](https://github.com/user-attachments/assets/1ba6bb0b-94df-41ea-be79-790096da4bf1)
2025-03-17 11:58:40 +08:00
Yongteng Lei
4ff609b6a8
Fix: optimize OCR garbage identification to reduce unnecessary filtering (#6027)
### What problem does this PR solve?

Optimize OCR garbage identification to reduce unnecessary filtering.
#5713

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-13 18:48:32 +08:00
Yongteng Lei
7cd37c37cd
Feat: add CSV file parsing support (#5989)
### What problem does this PR solve?

Add CSV file parsing support #4552, #5849, #5870

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-12 19:20:50 +08:00
donblack01
b1a46d5adc Fix:when start with source code not in docker env report 'UnicodeDec… (#5802)
### What problem does this PR solve?

fix:when start with  source code not in docker env report
"UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 5:
illegal multibyte sequence" in windows

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: tangyu <1@1.com>
2025-03-10 11:22:06 +08:00
liwenju0
5b0e38060a
Feat:Optimize the table extraction logic in the Markdown parser: (#5663)
Enhance the recognition of both borderless and bordered Markdown tables.
Add support for extracting HTML tables, including various scenarios with
nested HTML tags. Improve performance by using conditional checks to
reduce unnecessary regular expression matching.

### What problem does this PR solve?

Optimize the table extraction logic in the Markdown parser:
Enhance the recognition of both borderless and bordered Markdown tables.
Add support for extracting HTML tables, including various scenarios with
nested HTML tags.
Improve performance by using conditional checks to reduce unnecessary
regular expression matching.

### Type of change

- [x] Performance Improvement

Co-authored-by: wenju.li <wenju.li@deepctr.cn>
2025-03-07 17:02:35 +08:00
Kevin Hu
8fb8374dfc
Fix: delimiter issue. (#5720)
### What problem does this PR solve?

#5704

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-06 17:51:22 +08:00
yihong
4326873af6
refactor: no need to inherit in python3 clean the code (#5659)
### What problem does this PR solve?

As title

### Type of change


- [x] Refactoring

Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-03-05 18:03:53 +08:00
非法操作
ca04ae9540
Minor: improve doc and rm unused file (#5634)
### What problem does this PR solve?

The `ocr.res` file is already included in the model directory
`rag/res/deepdoc`, but it doesn't seem to be utilized here.

### Type of change

- [x] Documentation Update
2025-03-05 12:59:54 +08:00
hy89
b0c21b00d9
Refactor: Optimize error handling and support parsing of XLS(EXCEL97—2003) files. (#5633)
Optimize error handling and support parsing of XLS(EXCEL97—2003) files.
2025-03-05 11:55:27 +08:00
Zhichang Yu
c813c1ff4c
Made task_executor async to speedup parsing (#5530)
### What problem does this PR solve?

Made task_executor async to speedup parsing

### Type of change

- [x] Performance Improvement
2025-03-03 18:59:49 +08:00
yihong
8a2542157f
Fix: possible memory leaks close #5277 (#5500)
### What problem does this PR solve?

close #5277 by make sure the file close

### Type of change

- [x] Performance Improvement

---------

Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-03-03 10:26:45 +08:00
yihong
37aacb3960
Refa: drop useless fasttext (#5470)
### What problem does this PR solve?

This patch drop useless fastext which is seems useless in the code base 
and its very kind of hard install
should close #4498


### Type of change

- [x] Refactoring

Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-02-28 14:30:56 +08:00
Yongteng Lei
83d0949498
Fix: fix special delimiter parsing issue (#5448)
### What problem does this PR solve?

Fix special delimiter parsing issue #5382 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-27 18:33:55 +08:00
Zhichang Yu
db42d0e0ae
Optimize ocr (#5297)
### What problem does this PR solve?

Introduced OCR.recognize_batch

### Type of change

- [x] Performance Improvement
2025-02-24 16:21:55 +08:00
Zhichang Yu
0151d42156
Reuse loaded modules if possible (#5231)
### What problem does this PR solve?

Reuse loaded modules if possible

### Type of change

- [x] Refactoring
2025-02-21 17:21:01 +08:00
Zhichang Yu
c326f14fed
Optimized Recognizer.sort_X_firstly and Recognizer.sort_Y_firstly (#5182)
### What problem does this PR solve?

Optimized Recognizer.sort_X_firstly and Recognizer.sort_Y_firstly

### Type of change

- [x] Performance Improvement
2025-02-20 15:41:12 +08:00
Kevin Hu
b08bb56f6c
Display thinking for deepseek r1 (#4904)
### What problem does this PR solve?
#4903
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-02-12 15:43:13 +08:00
Mathias Panzenböck
6b389e01b5
Remove use of eval() from operators.py (#4888)
Use `np.float32()` instead.

### What problem does this PR solve?

Using `eval()` can lead to code injections.

I think `eval()` is only used to parse a floating point number here.
This change preserves the correct behavior if the string `"None"` is
supplied. But if that behavior isn't intended then this part could be
just deleted instead, since `np.float32()` is parsing strings anyway:

```Python
        if isinstance(scale, str):
            scale = eval(scale)
```

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-12 12:53:42 +08:00
SkyfireWXY
8fcca1b958
fix: big xls file error (#4859)
### What problem does this PR solve?

if *.xls file is too large, .eg >50M, I get error.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-12 12:39:25 +08:00
Zhichang Yu
3411d0a2ce
Added cuda_is_available (#4725)
### What problem does this PR solve?

Added cuda_is_available

### Type of change

- [x] Refactoring
2025-02-05 18:01:23 +08:00
Zhichang Yu
e1526846da
Fixed GPU detection on CPU only environment (#4711)
### What problem does this PR solve?

Fixed GPU detection on CPU only environment. Close #4692

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-05 12:02:43 +08:00
Kevin Hu
6f30397bb5
Infinity adapt to graphrag. (#4663)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-27 18:35:18 +08:00
Kevin Hu
1bff6b7333
Fix t_ocr.py for PNG image. (#4625)
### What problem does this PR solve?
#4586

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-24 11:47:27 +08:00
Zhichang Yu
4230402fbb
deepdoc use GPU if possible (#4618)
### What problem does this PR solve?

deepdoc use GPU if possible

### Type of change

- [x] Refactoring
2025-01-24 09:48:02 +08:00
Mathias Panzenböck
1a367664f1
Remove usage of eval() from postprocess.py (#4571)
Remove usage of `eval()` from postprocess.py

### What problem does this PR solve?

The use of `eval()` is a potential security risk. While the use of
`eval()` is guarded and thus not a security risk normally, `assert`s
aren't run if `-O` or `-OO` is passed to the interpreter, and as such
then the guard would not apply. In any case there is no reason to use
`eval()` here at all.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Other (please describe):

Potential security fix if somehow the passed `modul_name` could be user
controlled.
2025-01-22 19:37:24 +08:00
Jin Hai
3894de895b
Update comments (#4569)
### What problem does this PR solve?

Add license statement.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-01-21 20:52:28 +08:00
Mathias Panzenböck
75e1981e13
Remove use of eval() from recognizer.py (#4480)
`eval(op_type)` -> `getattr(operators, op_type)`

### What problem does this PR solve?

Using `eval()` can lead to code injections and is entirely unnecessary
here.

### Type of change

- [x] Other (please describe):

Best practice code improvement, preventing the possibility of code
injection.
2025-01-20 09:52:47 +08:00
Mathias Panzenböck
4f9f9405b8
Remove use of eval() from ocr.py (#4481)
`eval(op_name)` -> `getattr(operators, op_name)`

### What problem does this PR solve?

Using `eval()` can lead to code injections and is entirely unnecessary
here.

### Type of change

- [x] Other (please describe):

Best practice code improvement, preventing the possibility of code
injection.
2025-01-20 09:52:30 +08:00
Kevin Hu
c852a6dfbf
Accelerate titles' embeddings. (#4492)
### What problem does this PR solve?


### Type of change

- [x] Performance Improvement
2025-01-15 15:20:29 +08:00
Kevin Hu
e478586a8e
Refactor. (#4487)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2025-01-15 14:06:46 +08:00
Zhi-Qiang You
b7ce4e7e62
fix:t_recognizer TypeError: 'super' object is not callable (#4404)
### What problem does this PR solve?

[Bug]: layout recognizer failed for wrong boxes class type #4230
(https://github.com/infiniflow/ragflow/issues/4230)

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: youzhiqiang <zhiqiang.you@aminer.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-01-08 10:59:35 +08:00
Kevin Hu
2e40c2a6f6
Fix t_recognizer issue. (#4387)
### What problem does this PR solve?

#4230

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-07 13:17:46 +08:00
Kevin Hu
983ec0666c
Fix param error. (#4355)
### What problem does this PR solve?

#4230

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-06 13:54:17 +08:00
Kevin Hu
59a78408be
Fix t_recognizer.py after model updating. (#4330)
### What problem does this PR solve?

#4230

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-02 17:00:11 +08:00
Kevin Hu
76cd23eecf
Catch the exception while parsing pptx. (#4202)
### What problem does this PR solve?
#4189

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-24 10:49:28 +08:00
Kevin Hu
2cbe064080
Add Llama3.3 (#4174)
### What problem does this PR solve?

#4168

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-12-23 11:18:01 +08:00
ly0303521
101b8ff813
fix chunk method "Table" losing content when the Excel file has multi… (#4123)
…ple sheets

### What problem does this PR solve?
discussed in https://github.com/infiniflow/ragflow/pull/4102
- In excel_parser.py, `total` means the total number of rows in Excel,
but it return in the first iterate, that lead to the wrong `to_page`
- In table.py, it when Excel file has multiple sheets, it will be
divided into multiple parts, every part size is 3000, `data` may be
empty, because it has recorded in the last iterate.
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-19 17:30:26 +08:00