### What problem does this PR solve?
Add vision LLM PDF parser
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
For the create_inputs method based on np operation to replace for loop
### Type of change
- [x] Performance Improvement
### What problem does this PR solve?
This pull request (PR) incorporates codes for parsing PPTX files, aiming
to more precisely depict text in list formats (hint list by .).
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What problem does this PR solve?
Optimize OCR garbage identification to reduce unnecessary filtering.
#5713
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Add CSV file parsing support #4552, #5849, #5870
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
fix:when start with source code not in docker env report
"UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 5:
illegal multibyte sequence" in windows
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: tangyu <1@1.com>
Enhance the recognition of both borderless and bordered Markdown tables.
Add support for extracting HTML tables, including various scenarios with
nested HTML tags. Improve performance by using conditional checks to
reduce unnecessary regular expression matching.
### What problem does this PR solve?
Optimize the table extraction logic in the Markdown parser:
Enhance the recognition of both borderless and bordered Markdown tables.
Add support for extracting HTML tables, including various scenarios with
nested HTML tags.
Improve performance by using conditional checks to reduce unnecessary
regular expression matching.
### Type of change
- [x] Performance Improvement
Co-authored-by: wenju.li <wenju.li@deepctr.cn>
### What problem does this PR solve?
The `ocr.res` file is already included in the model directory
`rag/res/deepdoc`, but it doesn't seem to be utilized here.
### Type of change
- [x] Documentation Update
### What problem does this PR solve?
close#5277 by make sure the file close
### Type of change
- [x] Performance Improvement
---------
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
### What problem does this PR solve?
This patch drop useless fastext which is seems useless in the code base
and its very kind of hard install
should close#4498
### Type of change
- [x] Refactoring
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
### What problem does this PR solve?
Fix special delimiter parsing issue #5382
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Optimized Recognizer.sort_X_firstly and Recognizer.sort_Y_firstly
### Type of change
- [x] Performance Improvement
Use `np.float32()` instead.
### What problem does this PR solve?
Using `eval()` can lead to code injections.
I think `eval()` is only used to parse a floating point number here.
This change preserves the correct behavior if the string `"None"` is
supplied. But if that behavior isn't intended then this part could be
just deleted instead, since `np.float32()` is parsing strings anyway:
```Python
if isinstance(scale, str):
scale = eval(scale)
```
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
if *.xls file is too large, .eg >50M, I get error.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fixed GPU detection on CPU only environment. Close#4692
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Remove usage of `eval()` from postprocess.py
### What problem does this PR solve?
The use of `eval()` is a potential security risk. While the use of
`eval()` is guarded and thus not a security risk normally, `assert`s
aren't run if `-O` or `-OO` is passed to the interpreter, and as such
then the guard would not apply. In any case there is no reason to use
`eval()` here at all.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Other (please describe):
Potential security fix if somehow the passed `modul_name` could be user
controlled.
`eval(op_type)` -> `getattr(operators, op_type)`
### What problem does this PR solve?
Using `eval()` can lead to code injections and is entirely unnecessary
here.
### Type of change
- [x] Other (please describe):
Best practice code improvement, preventing the possibility of code
injection.
`eval(op_name)` -> `getattr(operators, op_name)`
### What problem does this PR solve?
Using `eval()` can lead to code injections and is entirely unnecessary
here.
### Type of change
- [x] Other (please describe):
Best practice code improvement, preventing the possibility of code
injection.
### What problem does this PR solve?
[Bug]: layout recognizer failed for wrong boxes class type #4230
(https://github.com/infiniflow/ragflow/issues/4230)
### Type of change
- [✅ ] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: youzhiqiang <zhiqiang.you@aminer.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
…ple sheets
### What problem does this PR solve?
discussed in https://github.com/infiniflow/ragflow/pull/4102
- In excel_parser.py, `total` means the total number of rows in Excel,
but it return in the first iterate, that lead to the wrong `to_page`
- In table.py, it when Excel file has multiple sheets, it will be
divided into multiple parts, every part size is 3000, `data` may be
empty, because it has recorded in the last iterate.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fix json file parsing
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Signed-off-by: jinhai <haijin.chn@gmail.com>
### What problem does this PR solve?
Added static check at PR CI
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
### What problem does this PR solve?
Close issue: #3828
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Signed-off-by: jinhai <haijin.chn@gmail.com>
### What problem does this PR solve?
In #3335 someone suggested to upgrade pdfplumber==0.11.1, but that
didn't solve it.
It's actually the special formatting in some of the pdfs that triggers
the problem.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)