From 7e75b9d77846591227356ce56c54344acd72f628 Mon Sep 17 00:00:00 2001 From: Vitaliy Groshev Date: Sat, 14 Sep 2024 08:14:39 +0300 Subject: [PATCH] fix parsing spaces in russian language PDFs (#1987) (#2427) ### What problem does this PR solve? [#1987](https://github.com/infiniflow/ragflow/issues/1987) When scanning PDF files character by character, the parser excluded spaces if the string did not match regex. Text from [Russian documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf) needs spaces, but it does not match the regex because it uses different alphabet. That's why PDFs were parsed incorrectly and were almost unusable as source. Fixed that by adding Russian alphabet to regex. There might be problems with other languages that use different alphabets. I additionally tested [PDF in Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816) and old [a-zA-Z...] regex parses it correctly with spaces. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --- deepdoc/parser/pdf_parser.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/deepdoc/parser/pdf_parser.py b/deepdoc/parser/pdf_parser.py index 0afc67ab1..5723ad618 100644 --- a/deepdoc/parser/pdf_parser.py +++ b/deepdoc/parser/pdf_parser.py @@ -299,7 +299,7 @@ class RAGFlowPdfParser: self.lefted_chars.append(c) continue if c["text"] == " " and bxs[ii]["text"]: - if re.match(r"[0-9a-zA-Z,.?;:!%%]", bxs[ii]["text"][-1]): + if re.match(r"[0-9a-zA-Zа-яА-Я,.?;:!%%]", bxs[ii]["text"][-1]): bxs[ii]["text"] += " " else: bxs[ii]["text"] += c["text"]