129 Commits

Author SHA1 Message Date
giiiiiithub
6ba5a4348a
set PARALLEL_DEVICES default value= 0 (#7935)
### What problem does this PR solve?


it would be fail if PARALLEL_DEVICES = None in OCR class , because it
pass 0 to TextDetector and TextRecognizer init method.

and It would be simpler to set 0 as the default value for
PARALLEL_DEVICES.

### Type of change

- [x] Refactoring
2025-05-29 13:32:16 +08:00
Emmanuel Ferdman
d4a123d6dd
Fix: resolve regex library warnings (#7782)
### What problem does this PR solve?
This small PR resolves the regex library warnings showing in Python3.11:
```python
DeprecationWarning: 'count' is passed as positional argument
```

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-05-22 10:06:28 +08:00
Yongteng Lei
b908c33464
Fix: uncaptured image data with position information (#7683)
### What problem does this PR solve?

Fixed uncaptured figure data with position information. #7466, #7681

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-05-19 19:33:28 +08:00
liuzhenghua
ea5e8caa69
feat: Enable antialiasing for PDF image extraction to improve OCR accuracy (#7562)
### What problem does this PR solve?

When the PDF uses vector fonts, the rendered text in the captured page
image often has missing strokes, leading to numerous OCR errors and
incorrect characters. Similar issues also occur in the extracted chart
images.

**Before**

![0089e1f76205b5b3](https://github.com/user-attachments/assets/a84f8cd7-48ae-4da4-81ca-fc0bd93320f1)

**After**

![03053149e919773a](https://github.com/user-attachments/assets/45fa5ebb-a2de-42b1-9535-1ea087877eb2)

You can use the following document for testing.

[Casio说明书.pdf](https://github.com/user-attachments/files/20119690/Casio.pdf)


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>
2025-05-12 09:50:21 +08:00
Kevin Hu
a14865e6bb
Fix: empty query issue. (#7551)
### What problem does this PR solve?

#5214

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-09 12:20:19 +08:00
Kevin Hu
9d3dd13fef
Refa: text order be robuster. (#7525)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2025-05-08 12:58:10 +08:00
Stephen Hu
953b3e1b3f
Fix: Sometimes VisionFigureParser.figures may is tuple (#7477)
### What problem does this PR solve?
https://github.com/infiniflow/ragflow/issues/7466
I think due to some times we can not get position 

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-06 17:38:22 +08:00
liuzhenghua
2f768b96e8
perf: optimze figure parser (#7392)
### What problem does this PR solve?

When parsing documents containing images, the current code uses a
single-threaded approach to call the VL model, resulting in extremely
slow parsing speed (e.g., parsing a Word document with dozens of images
takes over 20 minutes).

By switching to a multithreaded approach to call the VL model, the
parsing speed can be improved to an acceptable level.

### Type of change

- [x] Performance Improvement

---------

Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>
2025-05-06 14:39:45 +08:00
zhudongwork
10432a1be7
Refa: Optimize pptx shape extraction to reduce content loss (#6703)
### What problem does this PR solve?

When parsing pptx files, some shapes do not contain the `shape_type`
attribute, which causes the original code to throw an exception during
extraction, leading to failure in content extraction. This optimization
introduces handling logic for such anomalous shapes, providing a safer
and more robust processing mechanism.

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [x] Performance Improvement
- [ ] Other (please describe):
2025-04-22 10:16:24 +08:00
gsmini
53c653b099
fix RAGFlowPdfParser AttributeError: 'PdfReader' object has no attribute 'close' err (#6859)
i use PdfParser in local(refer to this case:
https://github.com/infiniflow/ragflow/blob/main/rag/app/paper.py) like
this:
```
import re
import openpyxl

from ragflow.api.db import ParserType
from ragflow.rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bullets_category, \
    title_frequency, \
    tokenize_chunks
from ragflow.rag.utils import num_tokens_from_string
from ragflow.deepdoc.parser import PdfParser, ExcelParser, DocxParser,PlainParser


def logger(prog=None, msg=""):
    print(msg)


class Pdf(PdfParser):
    def __init__(self):
        self.model_speciess = ParserType.MANUAL.value
        super().__init__()

    def __call__(self, filename, binary=None, from_page=0,
                 to_page=100000, zoomin=3, callback=None):
        from timeit import default_timer as timer
        start = timer()
        callback(msg="OCR is running...")

        self.__images__(
            filename if not binary else binary,
            zoomin,
            from_page,
            to_page,
            callback
        )
        callback(msg="OCR finished.")
        print("OCR:", timer() - start)
   
        self._layouts_rec(zoomin)
        callback(0.65, "Layout analysis finished.")
        print("layouts:", timer() - start)

        self._table_transformer_job(zoomin)
        callback(0.67, "Table analysis finished.")


        self._text_merge()
        tbls = self._extract_table_figure(True, zoomin, True, True)
        self._concat_downward()  
        self._filter_forpages()   
        callback(0.68, "Text merging finished")

        # clean mess
        for b in self.boxes:
            b["text"] = re.sub(r"([\t  ]|\u3000){2,}", " ", b["text"].strip())

        return [(b["text"], b.get("layout_no", ""), self.get_position(b, zoomin))
                for i, b in enumerate(self.boxes)], tbls


```

show err like this:
```
  File "xxxxx/third_party/ragflow/deepdoc/parser/pdf_parser.py", line 1039, in __images__
    self.pdf.close()
AttributeError: 'PdfReader' object has no attribute 'close'
```

i found ragflow source code use
`pdfplumber.open`(https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1007C28-L1007C43)

and replace` self.pdf `with ` pdf2_read` (from pypdf import PdfReader as
pdf2_read)in line 1024
(https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1024)
```
self.pdf = pdf2_read
```


---
and I found that `pdfplumber` can be used in this way:
```
file_path="xxx.pdf"
res = pdfplumber.open(file_path)
res.close()
```

but `pypdf.PdfReader` source code do not has `close` func, source code
use like this

```
 with open(stream, "rb") as fh:
         stream = BytesIO(fh.read())
          self._stream_opened = True
```
> https://github.com/py-pdf/pypdf/blob/main/pypdf/_reader.py#L156

so I moved the `self.pdf.close` function call and fixed this problem
hoping to help the project😊
2025-04-14 09:40:13 +08:00
donblack01
0b48a2e0d1
Fix: When Excel is a formula, the parsed result is a formula, but cannot be correctly parsed as a value type (#6613)
### What problem does this PR solve?

Fix: When Excel is a formula, the parsed result is a formula, but cannot
be correctly parsed as a value type

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: tangyu <1@1.com>
2025-03-28 09:33:49 +08:00
Stephen Hu
d77380f024
Feat: support pic base bullet for PPT (#6406)
### What problem does this PR solve?

support pic base bullet for PPT

modify one mistake in document

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-24 09:31:31 +08:00
Yongteng Lei
9611185eb4
Feat: add VLM-boosted DocX parser (#6307)
### What problem does this PR solve?

Add VLM-boosted DocX parser

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-20 11:24:44 +08:00
Yongteng Lei
1d6760dd84
Feat: add VLM-boosted PDF parser (#6278)
### What problem does this PR solve?

Add VLM-boosted PDF parser if VLM is set.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-20 09:39:32 +08:00
Yongteng Lei
5cf610af40
Feat: add vision LLM PDF parser (#6173)
### What problem does this PR solve?

Add vision LLM PDF parser

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-03-18 14:52:20 +08:00
Stephen Hu
79482ff672
Refa: Improve ppt_parser better handle list (#6162)
### What problem does this PR solve?
This pull request (PR) incorporates codes for parsing PPTX files, aiming
to more precisely depict text in list formats (hint list by .).

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2025-03-17 17:02:39 +08:00
Kevin Hu
3a99c2b5f4
Refa: PARALLEL_DEVICES is a static parameter. (#6168)
### What problem does this PR solve?


### Type of change

- [x] Refactoring
2025-03-17 16:49:54 +08:00
Kevin Hu
bfa8d342b3
Fix: retrieval debug mode issue. (#6150)
### What problem does this PR solve?

#6139

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-17 13:07:13 +08:00
Debug Doctor
3e19044dee
Feat: add OCR's muti-gpus and parallel processing support (#5972)
### What problem does this PR solve?

Add OCR's muti-gpus and parallel processing support

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

@yuzhichang I've tried to resolve the comments in #5697. OCR jobs can
now be done on both CPU and GPU. ( By the way, I've encountered a
“Generate embedding error” issue #5954 that might be due to my outdated
GPUs? idk. ) Please review it and give me suggestions.

GPU:

![gpu_ocr](https://github.com/user-attachments/assets/0ee2ecfb-a665-4e50-8bc7-15941b9cd80e)

![smi](https://github.com/user-attachments/assets/a2312f8c-cf24-443d-bf89-bec50503546d)

CPU:

![cpu_ocr](https://github.com/user-attachments/assets/1ba6bb0b-94df-41ea-be79-790096da4bf1)
2025-03-17 11:58:40 +08:00
Yongteng Lei
7cd37c37cd
Feat: add CSV file parsing support (#5989)
### What problem does this PR solve?

Add CSV file parsing support #4552, #5849, #5870

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-12 19:20:50 +08:00
donblack01
b1a46d5adc Fix:when start with source code not in docker env report 'UnicodeDec… (#5802)
### What problem does this PR solve?

fix:when start with  source code not in docker env report
"UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 5:
illegal multibyte sequence" in windows

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: tangyu <1@1.com>
2025-03-10 11:22:06 +08:00
liwenju0
5b0e38060a
Feat:Optimize the table extraction logic in the Markdown parser: (#5663)
Enhance the recognition of both borderless and bordered Markdown tables.
Add support for extracting HTML tables, including various scenarios with
nested HTML tags. Improve performance by using conditional checks to
reduce unnecessary regular expression matching.

### What problem does this PR solve?

Optimize the table extraction logic in the Markdown parser:
Enhance the recognition of both borderless and bordered Markdown tables.
Add support for extracting HTML tables, including various scenarios with
nested HTML tags.
Improve performance by using conditional checks to reduce unnecessary
regular expression matching.

### Type of change

- [x] Performance Improvement

Co-authored-by: wenju.li <wenju.li@deepctr.cn>
2025-03-07 17:02:35 +08:00
Kevin Hu
8fb8374dfc
Fix: delimiter issue. (#5720)
### What problem does this PR solve?

#5704

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-06 17:51:22 +08:00
yihong
4326873af6
refactor: no need to inherit in python3 clean the code (#5659)
### What problem does this PR solve?

As title

### Type of change


- [x] Refactoring

Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-03-05 18:03:53 +08:00
非法操作
ca04ae9540
Minor: improve doc and rm unused file (#5634)
### What problem does this PR solve?

The `ocr.res` file is already included in the model directory
`rag/res/deepdoc`, but it doesn't seem to be utilized here.

### Type of change

- [x] Documentation Update
2025-03-05 12:59:54 +08:00
hy89
b0c21b00d9
Refactor: Optimize error handling and support parsing of XLS(EXCEL97—2003) files. (#5633)
Optimize error handling and support parsing of XLS(EXCEL97—2003) files.
2025-03-05 11:55:27 +08:00
Zhichang Yu
c813c1ff4c
Made task_executor async to speedup parsing (#5530)
### What problem does this PR solve?

Made task_executor async to speedup parsing

### Type of change

- [x] Performance Improvement
2025-03-03 18:59:49 +08:00
yihong
8a2542157f
Fix: possible memory leaks close #5277 (#5500)
### What problem does this PR solve?

close #5277 by make sure the file close

### Type of change

- [x] Performance Improvement

---------

Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-03-03 10:26:45 +08:00
Yongteng Lei
83d0949498
Fix: fix special delimiter parsing issue (#5448)
### What problem does this PR solve?

Fix special delimiter parsing issue #5382 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-27 18:33:55 +08:00
Zhichang Yu
db42d0e0ae
Optimize ocr (#5297)
### What problem does this PR solve?

Introduced OCR.recognize_batch

### Type of change

- [x] Performance Improvement
2025-02-24 16:21:55 +08:00
Zhichang Yu
c326f14fed
Optimized Recognizer.sort_X_firstly and Recognizer.sort_Y_firstly (#5182)
### What problem does this PR solve?

Optimized Recognizer.sort_X_firstly and Recognizer.sort_Y_firstly

### Type of change

- [x] Performance Improvement
2025-02-20 15:41:12 +08:00
SkyfireWXY
8fcca1b958
fix: big xls file error (#4859)
### What problem does this PR solve?

if *.xls file is too large, .eg >50M, I get error.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-12 12:39:25 +08:00
Kevin Hu
6f30397bb5
Infinity adapt to graphrag. (#4663)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-27 18:35:18 +08:00
Jin Hai
3894de895b
Update comments (#4569)
### What problem does this PR solve?

Add license statement.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-01-21 20:52:28 +08:00
Kevin Hu
e478586a8e
Refactor. (#4487)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2025-01-15 14:06:46 +08:00
Kevin Hu
76cd23eecf
Catch the exception while parsing pptx. (#4202)
### What problem does this PR solve?
#4189

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-24 10:49:28 +08:00
ly0303521
101b8ff813
fix chunk method "Table" losing content when the Excel file has multi… (#4123)
…ple sheets

### What problem does this PR solve?
discussed in https://github.com/infiniflow/ragflow/pull/4102
- In excel_parser.py, `total` means the total number of rows in Excel,
but it return in the first iterate, that lead to the wrong `to_page`
- In table.py, it when Excel file has multiple sheets, it will be
divided into multiple parts, every part size is 3000, `data` may be
empty, because it has recorded in the last iterate.
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-19 17:30:26 +08:00
Jin Hai
275b5d14f2
Fix json file parse (#4004)
### What problem does this PR solve?

Fix json file parsing

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: jinhai <haijin.chn@gmail.com>
2024-12-12 20:34:46 +08:00
Zhichang Yu
1254ecf445
Added static check at PR CI (#3921)
### What problem does this PR solve?

Added static check at PR CI

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
2024-12-08 21:23:51 +08:00
Zhichang Yu
0d68a6cd1b
Fix errors detected by Ruff (#3918)
### What problem does this PR solve?

Fix errors detected by Ruff

### Type of change

- [x] Refactoring
2024-12-08 14:21:12 +08:00
Jin Hai
821fdf02b4
Fix parsing JSON file error (#3829)
### What problem does this PR solve?

Close issue: #3828

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Signed-off-by: jinhai <haijin.chn@gmail.com>
2024-12-03 19:02:03 +08:00
Yuhao Tsui
7b6a5ffaff
Fix: page_chars attribute does not exist in some formats of PDF (#3796)
### What problem does this PR solve?

In #3335 someone suggested to upgrade pdfplumber==0.11.1, but that
didn't solve it.
It's actually the special formatting in some of the pdfs that triggers
the problem.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-03 11:08:06 +08:00
Kevin Hu
7058ac0041
Fix out of boundary. (#3786)
### What problem does this PR solve?

#3769
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-02 11:38:53 +08:00
Zhichang Yu
bc701d7b4c
Edit chunk shall update instead of insert it (#3709)
### What problem does this PR solve?

Edit chunk shall update instead of insert it. Close #3679 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-28 13:00:38 +08:00
Zhichang Yu
2249d5d413
Always open text file for write with UTF-8 (#3688)
### What problem does this PR solve?

Always open text file for write with UTF-8. Close #932 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-27 16:24:16 +08:00
Zhichang Yu
cad341e794 Added kb_id filter to knn. Fix #3458 (#3513)
### What problem does this PR solve?

Added kb_id filter to knn. Fix #3458

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-20 20:53:30 +08:00
Zhichang Yu
4413683898
Introduced beartype (#3460)
### What problem does this PR solve?

Introduced [beartype](https://github.com/beartype/beartype) for runtime
type-checking.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-11-18 17:38:17 +08:00
Jin Hai
1e90a1bf36
Move settings initialization after module init phase (#3438)
### What problem does this PR solve?

1. Module init won't connect database any more.
2. Config in settings need to be used with settings.CONFIG_NAME

### Type of change

- [x] Refactoring

Signed-off-by: jinhai <haijin.chn@gmail.com>
2024-11-15 17:30:56 +08:00
Zhichang Yu
30f6421760
Use consistent log file names, introduced initLogger (#3403)
### What problem does this PR solve?

Use consistent log file names, introduced initLogger

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2024-11-14 17:13:48 +08:00
Kevin Hu
4caf932808
fix bug about fetching knowledge graph (#3394)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-14 12:29:15 +08:00