mirror of
https://git.mirrors.martin98.com/https://github.com/infiniflow/ragflow.git
synced 2025-08-10 19:18:57 +08:00
feat: Enable antialiasing for PDF image extraction to improve OCR accuracy (#7562)
### What problem does this PR solve? When the PDF uses vector fonts, the rendered text in the captured page image often has missing strokes, leading to numerous OCR errors and incorrect characters. Similar issues also occur in the extracted chart images. **Before**  **After**  You can use the following document for testing. [Casio说明书.pdf](https://github.com/user-attachments/files/20119690/Casio.pdf) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>
This commit is contained in:
parent
473aa28422
commit
ea5e8caa69
@ -1015,7 +1015,7 @@ class RAGFlowPdfParser:
|
||||
with sys.modules[LOCK_KEY_pdfplumber]:
|
||||
with (pdfplumber.open(fnm) if isinstance(fnm, str) else pdfplumber.open(BytesIO(fnm))) as pdf:
|
||||
self.pdf = pdf
|
||||
self.page_images = [p.to_image(resolution=72 * zoomin).annotated for i, p in
|
||||
self.page_images = [p.to_image(resolution=72 * zoomin, antialias=True).annotated for i, p in
|
||||
enumerate(self.pdf.pages[page_from:page_to])]
|
||||
|
||||
try:
|
||||
|
Loading…
x
Reference in New Issue
Block a user