i use PdfParser in local(refer to this case:
https://github.com/infiniflow/ragflow/blob/main/rag/app/paper.py) like
this:
```
import re
import openpyxl
from ragflow.api.db import ParserType
from ragflow.rag.nlp import rag_tokenizer, tokenize, tokenize_table, add_positions, bullets_category, \
title_frequency, \
tokenize_chunks
from ragflow.rag.utils import num_tokens_from_string
from ragflow.deepdoc.parser import PdfParser, ExcelParser, DocxParser,PlainParser
def logger(prog=None, msg=""):
print(msg)
class Pdf(PdfParser):
def __init__(self):
self.model_speciess = ParserType.MANUAL.value
super().__init__()
def __call__(self, filename, binary=None, from_page=0,
to_page=100000, zoomin=3, callback=None):
from timeit import default_timer as timer
start = timer()
callback(msg="OCR is running...")
self.__images__(
filename if not binary else binary,
zoomin,
from_page,
to_page,
callback
)
callback(msg="OCR finished.")
print("OCR:", timer() - start)
self._layouts_rec(zoomin)
callback(0.65, "Layout analysis finished.")
print("layouts:", timer() - start)
self._table_transformer_job(zoomin)
callback(0.67, "Table analysis finished.")
self._text_merge()
tbls = self._extract_table_figure(True, zoomin, True, True)
self._concat_downward()
self._filter_forpages()
callback(0.68, "Text merging finished")
# clean mess
for b in self.boxes:
b["text"] = re.sub(r"([\t ]|\u3000){2,}", " ", b["text"].strip())
return [(b["text"], b.get("layout_no", ""), self.get_position(b, zoomin))
for i, b in enumerate(self.boxes)], tbls
```
show err like this:
```
File "xxxxx/third_party/ragflow/deepdoc/parser/pdf_parser.py", line 1039, in __images__
self.pdf.close()
AttributeError: 'PdfReader' object has no attribute 'close'
```
i found ragflow source code use
`pdfplumber.open`(https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1007C28-L1007C43)
and replace` self.pdf `with ` pdf2_read` (from pypdf import PdfReader as
pdf2_read)in line 1024
(https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py#L1024)
```
self.pdf = pdf2_read
```
---
and I found that `pdfplumber` can be used in this way:
```
file_path="xxx.pdf"
res = pdfplumber.open(file_path)
res.close()
```
but `pypdf.PdfReader` source code do not has `close` func, source code
use like this
```
with open(stream, "rb") as fh:
stream = BytesIO(fh.read())
self._stream_opened = True
```
> https://github.com/py-pdf/pypdf/blob/main/pypdf/_reader.py#L156
so I moved the `self.pdf.close` function call and fixed this problem
hoping to help the project😊
### What problem does this PR solve?
Fix: When Excel is a formula, the parsed result is a formula, but cannot
be correctly parsed as a value type
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: tangyu <1@1.com>
### What problem does this PR solve?
support pic base bullet for PPT
modify one mistake in document
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Add VLM-boosted PDF parser if VLM is set.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Add vision LLM PDF parser
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
For the create_inputs method based on np operation to replace for loop
### Type of change
- [x] Performance Improvement
### What problem does this PR solve?
This pull request (PR) incorporates codes for parsing PPTX files, aiming
to more precisely depict text in list formats (hint list by .).
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What problem does this PR solve?
Optimize OCR garbage identification to reduce unnecessary filtering.
#5713
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Add CSV file parsing support #4552, #5849, #5870
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
fix:when start with source code not in docker env report
"UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 5:
illegal multibyte sequence" in windows
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: tangyu <1@1.com>
Enhance the recognition of both borderless and bordered Markdown tables.
Add support for extracting HTML tables, including various scenarios with
nested HTML tags. Improve performance by using conditional checks to
reduce unnecessary regular expression matching.
### What problem does this PR solve?
Optimize the table extraction logic in the Markdown parser:
Enhance the recognition of both borderless and bordered Markdown tables.
Add support for extracting HTML tables, including various scenarios with
nested HTML tags.
Improve performance by using conditional checks to reduce unnecessary
regular expression matching.
### Type of change
- [x] Performance Improvement
Co-authored-by: wenju.li <wenju.li@deepctr.cn>
### What problem does this PR solve?
The `ocr.res` file is already included in the model directory
`rag/res/deepdoc`, but it doesn't seem to be utilized here.
### Type of change
- [x] Documentation Update
### What problem does this PR solve?
close#5277 by make sure the file close
### Type of change
- [x] Performance Improvement
---------
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
### What problem does this PR solve?
This patch drop useless fastext which is seems useless in the code base
and its very kind of hard install
should close#4498
### Type of change
- [x] Refactoring
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
### What problem does this PR solve?
Fix special delimiter parsing issue #5382
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Optimized Recognizer.sort_X_firstly and Recognizer.sort_Y_firstly
### Type of change
- [x] Performance Improvement
Use `np.float32()` instead.
### What problem does this PR solve?
Using `eval()` can lead to code injections.
I think `eval()` is only used to parse a floating point number here.
This change preserves the correct behavior if the string `"None"` is
supplied. But if that behavior isn't intended then this part could be
just deleted instead, since `np.float32()` is parsing strings anyway:
```Python
if isinstance(scale, str):
scale = eval(scale)
```
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
if *.xls file is too large, .eg >50M, I get error.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fixed GPU detection on CPU only environment. Close#4692
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Remove usage of `eval()` from postprocess.py
### What problem does this PR solve?
The use of `eval()` is a potential security risk. While the use of
`eval()` is guarded and thus not a security risk normally, `assert`s
aren't run if `-O` or `-OO` is passed to the interpreter, and as such
then the guard would not apply. In any case there is no reason to use
`eval()` here at all.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Other (please describe):
Potential security fix if somehow the passed `modul_name` could be user
controlled.
`eval(op_type)` -> `getattr(operators, op_type)`
### What problem does this PR solve?
Using `eval()` can lead to code injections and is entirely unnecessary
here.
### Type of change
- [x] Other (please describe):
Best practice code improvement, preventing the possibility of code
injection.
`eval(op_name)` -> `getattr(operators, op_name)`
### What problem does this PR solve?
Using `eval()` can lead to code injections and is entirely unnecessary
here.
### Type of change
- [x] Other (please describe):
Best practice code improvement, preventing the possibility of code
injection.
### What problem does this PR solve?
[Bug]: layout recognizer failed for wrong boxes class type #4230
(https://github.com/infiniflow/ragflow/issues/4230)
### Type of change
- [✅ ] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: youzhiqiang <zhiqiang.you@aminer.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
…ple sheets
### What problem does this PR solve?
discussed in https://github.com/infiniflow/ragflow/pull/4102
- In excel_parser.py, `total` means the total number of rows in Excel,
but it return in the first iterate, that lead to the wrong `to_page`
- In table.py, it when Excel file has multiple sheets, it will be
divided into multiple parts, every part size is 3000, `data` may be
empty, because it has recorded in the last iterate.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)