ragflow

AI/ragflow

mirror of https://git.mirrors.martin98.com/https://github.com/infiniflow/ragflow.git synced 2025-04-23 22:50:17 +08:00

Author	SHA1	Message	Date
donblack01	0b48a2e0d1	Fix: When Excel is a formula, the parsed result is a formula, but cannot be correctly parsed as a value type (#6613 ) ### What problem does this PR solve? Fix: When Excel is a formula, the parsed result is a formula, but cannot be correctly parsed as a value type ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: tangyu <1@1.com>	2025-03-28 09:33:49 +08:00
Stephen Hu	d77380f024	Feat: support pic base bullet for PPT (#6406 ) ### What problem does this PR solve? support pic base bullet for PPT modify one mistake in document ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-24 09:31:31 +08:00
Yongteng Lei	9611185eb4	Feat: add VLM-boosted DocX parser (#6307 ) ### What problem does this PR solve? Add VLM-boosted DocX parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-20 11:24:44 +08:00
Yongteng Lei	1d6760dd84	Feat: add VLM-boosted PDF parser (#6278 ) ### What problem does this PR solve? Add VLM-boosted PDF parser if VLM is set. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-20 09:39:32 +08:00
Yongteng Lei	5cf610af40	Feat: add vision LLM PDF parser (#6173 ) ### What problem does this PR solve? Add vision LLM PDF parser ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-03-18 14:52:20 +08:00
Stephen Hu	79482ff672	Refa: Improve ppt_parser better handle list (#6162 ) ### What problem does this PR solve? This pull request (PR) incorporates codes for parsing PPTX files, aiming to more precisely depict text in list formats (hint list by .). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2025-03-17 17:02:39 +08:00
Kevin Hu	3a99c2b5f4	Refa: PARALLEL_DEVICES is a static parameter. (#6168 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-03-17 16:49:54 +08:00
Kevin Hu	bfa8d342b3	Fix: retrieval debug mode issue. (#6150 ) ### What problem does this PR solve? #6139 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-17 13:07:13 +08:00
Debug Doctor	3e19044dee	Feat: add OCR's muti-gpus and parallel processing support (#5972 ) ### What problem does this PR solve? Add OCR's muti-gpus and parallel processing support ### Type of change - [x] New Feature (non-breaking change which adds functionality) @yuzhichang I've tried to resolve the comments in #5697. OCR jobs can now be done on both CPU and GPU. ( By the way, I've encountered a “Generate embedding error” issue #5954 that might be due to my outdated GPUs? idk. ) Please review it and give me suggestions. GPU: ![gpu_ocr](https://github.com/user-attachments/assets/0ee2ecfb-a665-4e50-8bc7-15941b9cd80e) ![smi](https://github.com/user-attachments/assets/a2312f8c-cf24-443d-bf89-bec50503546d) CPU: ![cpu_ocr](https://github.com/user-attachments/assets/1ba6bb0b-94df-41ea-be79-790096da4bf1)	2025-03-17 11:58:40 +08:00
Yongteng Lei	7cd37c37cd	Feat: add CSV file parsing support (#5989 ) ### What problem does this PR solve? Add CSV file parsing support #4552, #5849, #5870 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-12 19:20:50 +08:00
donblack01	b1a46d5adc	Fix:when start with source code not in docker env report 'UnicodeDec… (#5802 ) ### What problem does this PR solve? fix:when start with source code not in docker env report "UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 5: illegal multibyte sequence" in windows ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: tangyu <1@1.com>	2025-03-10 11:22:06 +08:00
liwenju0	5b0e38060a	Feat：Optimize the table extraction logic in the Markdown parser: (#5663 ) Enhance the recognition of both borderless and bordered Markdown tables. Add support for extracting HTML tables, including various scenarios with nested HTML tags. Improve performance by using conditional checks to reduce unnecessary regular expression matching. ### What problem does this PR solve? Optimize the table extraction logic in the Markdown parser: Enhance the recognition of both borderless and bordered Markdown tables. Add support for extracting HTML tables, including various scenarios with nested HTML tags. Improve performance by using conditional checks to reduce unnecessary regular expression matching. ### Type of change - [x] Performance Improvement Co-authored-by: wenju.li <wenju.li@deepctr.cn>	2025-03-07 17:02:35 +08:00
Kevin Hu	8fb8374dfc	Fix: delimiter issue. (#5720 ) ### What problem does this PR solve? #5704 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-06 17:51:22 +08:00
yihong	4326873af6	refactor: no need to inherit in python3 clean the code (#5659 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: yihong0618 <zouzou0208@gmail.com>	2025-03-05 18:03:53 +08:00
非法操作	ca04ae9540	Minor: improve doc and rm unused file (#5634 ) ### What problem does this PR solve? The `ocr.res` file is already included in the model directory `rag/res/deepdoc`, but it doesn't seem to be utilized here. ### Type of change - [x] Documentation Update	2025-03-05 12:59:54 +08:00
hy89	b0c21b00d9	Refactor: Optimize error handling and support parsing of XLS(EXCEL97—2003) files. (#5633 ) Optimize error handling and support parsing of XLS(EXCEL97—2003) files.	2025-03-05 11:55:27 +08:00
Zhichang Yu	c813c1ff4c	Made task_executor async to speedup parsing (#5530 ) ### What problem does this PR solve? Made task_executor async to speedup parsing ### Type of change - [x] Performance Improvement	2025-03-03 18:59:49 +08:00
yihong	8a2542157f	Fix: possible memory leaks close #5277 (#5500 ) ### What problem does this PR solve? close #5277 by make sure the file close ### Type of change - [x] Performance Improvement --------- Signed-off-by: yihong0618 <zouzou0208@gmail.com>	2025-03-03 10:26:45 +08:00
Yongteng Lei	83d0949498	Fix: fix special delimiter parsing issue (#5448 ) ### What problem does this PR solve? Fix special delimiter parsing issue #5382 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-02-27 18:33:55 +08:00
Zhichang Yu	db42d0e0ae	Optimize ocr (#5297 ) ### What problem does this PR solve? Introduced OCR.recognize_batch ### Type of change - [x] Performance Improvement	2025-02-24 16:21:55 +08:00
Zhichang Yu	c326f14fed	Optimized Recognizer.sort_X_firstly and Recognizer.sort_Y_firstly (#5182 ) ### What problem does this PR solve? Optimized Recognizer.sort_X_firstly and Recognizer.sort_Y_firstly ### Type of change - [x] Performance Improvement	2025-02-20 15:41:12 +08:00
SkyfireWXY	8fcca1b958	fix: big xls file error (#4859 ) ### What problem does this PR solve? if *.xls file is too large, .eg >50M, I get error. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-02-12 12:39:25 +08:00
Kevin Hu	6f30397bb5	Infinity adapt to graphrag. (#4663 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-01-27 18:35:18 +08:00
Jin Hai	3894de895b	Update comments (#4569 ) ### What problem does this PR solve? Add license statement. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-01-21 20:52:28 +08:00
Kevin Hu	e478586a8e	Refactor. (#4487 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-01-15 14:06:46 +08:00
Kevin Hu	76cd23eecf	Catch the exception while parsing pptx. (#4202 ) ### What problem does this PR solve? #4189 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-12-24 10:49:28 +08:00
ly0303521	101b8ff813	fix chunk method "Table" losing content when the Excel file has multi… (#4123 ) …ple sheets ### What problem does this PR solve? discussed in https://github.com/infiniflow/ragflow/pull/4102 - In excel_parser.py, `total` means the total number of rows in Excel, but it return in the first iterate, that lead to the wrong `to_page` - In table.py, it when Excel file has multiple sheets, it will be divided into multiple parts, every part size is 3000, `data` may be empty, because it has recorded in the last iterate. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-12-19 17:30:26 +08:00
Jin Hai	275b5d14f2	Fix json file parse (#4004 ) ### What problem does this PR solve? Fix json file parsing ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: jinhai <haijin.chn@gmail.com>	2024-12-12 20:34:46 +08:00
Zhichang Yu	1254ecf445	Added static check at PR CI (#3921 ) ### What problem does this PR solve? Added static check at PR CI ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2024-12-08 21:23:51 +08:00
Zhichang Yu	0d68a6cd1b	Fix errors detected by Ruff (#3918 ) ### What problem does this PR solve? Fix errors detected by Ruff ### Type of change - [x] Refactoring	2024-12-08 14:21:12 +08:00
Jin Hai	821fdf02b4	Fix parsing JSON file error (#3829 ) ### What problem does this PR solve? Close issue: #3828 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Signed-off-by: jinhai <haijin.chn@gmail.com>	2024-12-03 19:02:03 +08:00
Yuhao Tsui	7b6a5ffaff	Fix: page_chars attribute does not exist in some formats of PDF (#3796 ) ### What problem does this PR solve? In #3335 someone suggested to upgrade pdfplumber==0.11.1, but that didn't solve it. It's actually the special formatting in some of the pdfs that triggers the problem. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-12-03 11:08:06 +08:00
Kevin Hu	7058ac0041	Fix out of boundary. (#3786 ) ### What problem does this PR solve? #3769 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-12-02 11:38:53 +08:00
Zhichang Yu	bc701d7b4c	Edit chunk shall update instead of insert it (#3709 ) ### What problem does this PR solve? Edit chunk shall update instead of insert it. Close #3679 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-11-28 13:00:38 +08:00
Zhichang Yu	2249d5d413	Always open text file for write with UTF-8 (#3688 ) ### What problem does this PR solve? Always open text file for write with UTF-8. Close #932 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-11-27 16:24:16 +08:00
Zhichang Yu	cad341e794	Added kb_id filter to knn. Fix #3458 (#3513 ) ### What problem does this PR solve? Added kb_id filter to knn. Fix #3458 - [x] Bug Fix (non-breaking change which fixes an issue)	2024-11-20 20:53:30 +08:00
Zhichang Yu	4413683898	Introduced beartype (#3460 ) ### What problem does this PR solve? Introduced [beartype](https://github.com/beartype/beartype) for runtime type-checking. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-11-18 17:38:17 +08:00
Jin Hai	1e90a1bf36	Move settings initialization after module init phase (#3438 ) ### What problem does this PR solve? 1. Module init won't connect database any more. 2. Config in settings need to be used with settings.CONFIG_NAME ### Type of change - [x] Refactoring Signed-off-by: jinhai <haijin.chn@gmail.com>	2024-11-15 17:30:56 +08:00
Zhichang Yu	30f6421760	Use consistent log file names, introduced initLogger (#3403 ) ### What problem does this PR solve? Use consistent log file names, introduced initLogger ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2024-11-14 17:13:48 +08:00
Kevin Hu	4caf932808	fix bug about fetching knowledge graph (#3394 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-11-14 12:29:15 +08:00
Zhichang Yu	a2a5631da4	Rework logging (#3358 ) Unified all log files into one. ### What problem does this PR solve? Unified all log files into one. ### Type of change - [x] Refactoring	2024-11-12 17:35:13 +08:00
kuschzzp	9c6cc20356	Fix:#3230 When parsing a docx file using the Book parsing method, to_page is always -1, resulting in a block count of 0 even if parsing is successful (#3249 ) ### What problem does this PR solve? When parsing a docx file using the Book parsing method, to_page is always -1, resulting in a block count of 0 even if parsing is successful Fix:#3230 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2024-11-08 09:21:42 +08:00
Kevin Hu	2d1fbefdb5	search between multiple indiices for team function (#3079 ) ### What problem does this PR solve? #2834 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-10-29 13:19:01 +08:00
Kevin Hu	bfc07fe4f9	bigger resolution for OCR (#2919 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2024-10-21 16:25:42 +08:00
chongchuanbing	66172cef3e	fix: torch dependency start error (#2777 ) ### What problem does this PR solve? when use slim image, remove ```torch``` denpendency. ### Type of change - [✓] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: chongchuanbing <chongchuanbing@gmail.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2024-10-10 10:06:03 +08:00
Kevin Hu	daa65199e8	trival (#2650 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-09-29 13:20:02 +08:00
Kevin Hu	fc867cb959	rename get_txt to get_text (#2649 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-09-29 12:47:09 +08:00
yqkcn	aea553c3a8	Add get_txt function (#2639 ) ### What problem does this PR solve? Add get_txt function to reduce duplicate code ### Type of change - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2024-09-29 10:29:56 +08:00
liuhua	b68d349bd6	Fix: renrank_model and pdf_parser bugs \| Update: session API (#2601 ) ### What problem does this PR solve? Fix: renrank_model and pdf_parser bugs \| Update: session API #2575 #2559 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Co-authored-by: liuhua <10215101452@stu.ecun.edu.cn>	2024-09-26 16:05:25 +08:00
Kevin Hu	7bb28ca2bd	add lighten control (#2567 ) ### What problem does this PR solve? #2295 ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2024-09-24 19:22:01 +08:00

1 2 3

119 Commits