### What problem does this PR solve?
it would be fail if PARALLEL_DEVICES = None in OCR class , because it
pass 0 to TextDetector and TextRecognizer init method.
and It would be simpler to set 0 as the default value for
PARALLEL_DEVICES.
### Type of change
- [x] Refactoring
### What problem does this PR solve?
When the PDF uses vector fonts, the rendered text in the captured page
image often has missing strokes, leading to numerous OCR errors and
incorrect characters. Similar issues also occur in the extracted chart
images.
**Before**

**After**

You can use the following document for testing.
[Casio说明书.pdf](https://github.com/user-attachments/files/20119690/Casio.pdf)
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>
i use PdfParser in local(refer to this case:
https://github.com/infiniflow/ragflow/blob/main/rag/app/paper.py) like
this:
```
import re
import openpyxl
from ragflow.api.db import ParserType
from ragflow.rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bullets_category, \
title_frequency, \
tokenize_chunks
from ragflow.rag.utils import num_tokens_from_string
from ragflow.deepdoc.parser import PdfParser, ExcelParser, DocxParser,PlainParser
def logger(prog=None, msg=""):
print(msg)
class Pdf(PdfParser):
def __init__(self):
self.model_speciess = ParserType.MANUAL.value
super().__init__()
def __call__(self, filename, binary=None, from_page=0,
to_page=100000, zoomin=3, callback=None):
from timeit import default_timer as timer
start = timer()
callback(msg="OCR is running...")
self.__images__(
filename if not binary else binary,
zoomin,
from_page,
to_page,
callback
)
callback(msg="OCR finished.")
print("OCR:", timer() - start)
self._layouts_rec(zoomin)
callback(0.65, "Layout analysis finished.")
print("layouts:", timer() - start)
self._table_transformer_job(zoomin)
callback(0.67, "Table analysis finished.")
self._text_merge()
tbls = self._extract_table_figure(True, zoomin, True, True)
self._concat_downward()
self._filter_forpages()
callback(0.68, "Text merging finished")
# clean mess
for b in self.boxes:
b["text"] = re.sub(r"([\t ]|\u3000){2,}", " ", b["text"].strip())
return [(b["text"], b.get("layout_no", ""), self.get_position(b, zoomin))
for i, b in enumerate(self.boxes)], tbls
```
show err like this:
```
File "xxxxx/third_party/ragflow/deepdoc/parser/pdf_parser.py", line 1039, in __images__
self.pdf.close()
AttributeError: 'PdfReader' object has no attribute 'close'
```
i found ragflow source code use
`pdfplumber.open`(https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1007C28-L1007C43)
and replace` self.pdf `with ` pdf2_read` (from pypdf import PdfReader as
pdf2_read)in line 1024
(https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1024)
```
self.pdf = pdf2_read
```
---
and I found that `pdfplumber` can be used in this way:
```
file_path="xxx.pdf"
res = pdfplumber.open(file_path)
res.close()
```
but `pypdf.PdfReader` source code do not has `close` func, source code
use like this
```
with open(stream, "rb") as fh:
stream = BytesIO(fh.read())
self._stream_opened = True
```
> https://github.com/py-pdf/pypdf/blob/main/pypdf/_reader.py#L156
so I moved the `self.pdf.close` function call and fixed this problem
hoping to help the project😊
### What problem does this PR solve?
Add VLM-boosted PDF parser if VLM is set.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Add vision LLM PDF parser
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
The `ocr.res` file is already included in the model directory
`rag/res/deepdoc`, but it doesn't seem to be utilized here.
### Type of change
- [x] Documentation Update
### What problem does this PR solve?
close#5277 by make sure the file close
### Type of change
- [x] Performance Improvement
---------
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
### What problem does this PR solve?
Optimized Recognizer.sort_X_firstly and Recognizer.sort_Y_firstly
### Type of change
- [x] Performance Improvement
### What problem does this PR solve?
Added static check at PR CI
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
### What problem does this PR solve?
In #3335 someone suggested to upgrade pdfplumber==0.11.1, but that
didn't solve it.
It's actually the special formatting in some of the pdfs that triggers
the problem.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Edit chunk shall update instead of insert it. Close#3679
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
1. Module init won't connect database any more.
2. Config in settings need to be used with settings.CONFIG_NAME
### Type of change
- [x] Refactoring
Signed-off-by: jinhai <haijin.chn@gmail.com>
### What problem does this PR solve?
Use consistent log file names, introduced initLogger
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What problem does this PR solve?
when use slim image, remove ```torch``` denpendency.
### Type of change
- [✓] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: chongchuanbing <chongchuanbing@gmail.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
Fix: renrank_model and pdf_parser bugs | Update: session API
#2575#2559
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Co-authored-by: liuhua <10215101452@stu.ecun.edu.cn>
### What problem does this PR solve?
#2295
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What problem does this PR solve?
#1704
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
_fix: When parsing the bold content in PDF, the result is duplicated._
the detail: [When using OCR to recognize Chinese titles, the structure
appears to be
duplicated](https://github.com/infiniflow/ragflow/issues/1718)
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
pypdf and PyPDF2 possible Infinite Loop when a comment isn't followed by
a character #59
### Type of change
- [x] Refactoring
### What problem does this PR solve?
After checking the language of the pdf, the line will hardcode the
language into Chinese
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fix occasional errors in pdf table recognition
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
- Update readme
- Add license
### Type of change
- [x] Documentation Update
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
When pdfparser call `_naive_vertical_merge` method,there is a "not
concating feature " value by computing difference between `b` and `b_`'s
layoutno ,but actually is `b` and `b`. I think it's a bug, so fix it.
Please check again.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)