86 Commits

Author SHA1 Message Date
Zhichang Yu
bc701d7b4c
Edit chunk shall update instead of insert it (#3709)
### What problem does this PR solve?

Edit chunk shall update instead of insert it. Close #3679 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-28 13:00:38 +08:00
Zhichang Yu
2249d5d413
Always open text file for write with UTF-8 (#3688)
### What problem does this PR solve?

Always open text file for write with UTF-8. Close #932 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-27 16:24:16 +08:00
Zhichang Yu
cad341e794 Added kb_id filter to knn. Fix #3458 (#3513)
### What problem does this PR solve?

Added kb_id filter to knn. Fix #3458

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-20 20:53:30 +08:00
Zhichang Yu
4413683898
Introduced beartype (#3460)
### What problem does this PR solve?

Introduced [beartype](https://github.com/beartype/beartype) for runtime
type-checking.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-11-18 17:38:17 +08:00
Jin Hai
1e90a1bf36
Move settings initialization after module init phase (#3438)
### What problem does this PR solve?

1. Module init won't connect database any more.
2. Config in settings need to be used with settings.CONFIG_NAME

### Type of change

- [x] Refactoring

Signed-off-by: jinhai <haijin.chn@gmail.com>
2024-11-15 17:30:56 +08:00
Zhichang Yu
30f6421760
Use consistent log file names, introduced initLogger (#3403)
### What problem does this PR solve?

Use consistent log file names, introduced initLogger

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2024-11-14 17:13:48 +08:00
Kevin Hu
4caf932808
fix bug about fetching knowledge graph (#3394)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-14 12:29:15 +08:00
Zhichang Yu
a2a5631da4
Rework logging (#3358)
Unified all log files into one.

### What problem does this PR solve?

Unified all log files into one.

### Type of change

- [x] Refactoring
2024-11-12 17:35:13 +08:00
kuschzzp
9c6cc20356
Fix:#3230 When parsing a docx file using the Book parsing method, to_page is always -1, resulting in a block count of 0 even if parsing is successful (#3249)
### What problem does this PR solve?

When parsing a docx file using the Book parsing method, to_page is
always -1, resulting in a block count of 0 even if parsing is successful

Fix:#3230

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-11-08 09:21:42 +08:00
Kevin Hu
2d1fbefdb5
search between multiple indiices for team function (#3079)
### What problem does this PR solve?

#2834 
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-10-29 13:19:01 +08:00
Kevin Hu
bfc07fe4f9
bigger resolution for OCR (#2919)
### What problem does this PR solve?



### Type of change

- [x] Performance Improvement
2024-10-21 16:25:42 +08:00
chongchuanbing
66172cef3e
fix: torch dependency start error (#2777)
### What problem does this PR solve?

when use slim image, remove ```torch``` denpendency.

### Type of change

- [✓] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: chongchuanbing <chongchuanbing@gmail.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-10-10 10:06:03 +08:00
Kevin Hu
daa65199e8
trival (#2650)
### What problem does this PR solve?

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-09-29 13:20:02 +08:00
Kevin Hu
fc867cb959
rename get_txt to get_text (#2649)
### What problem does this PR solve?



### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-09-29 12:47:09 +08:00
yqkcn
aea553c3a8
Add get_txt function (#2639)
### What problem does this PR solve?

Add get_txt function to reduce duplicate code

### Type of change

- [x] Refactoring

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-09-29 10:29:56 +08:00
liuhua
b68d349bd6
Fix: renrank_model and pdf_parser bugs | Update: session API (#2601)
### What problem does this PR solve?

Fix: renrank_model and pdf_parser bugs | Update: session API
#2575
#2559
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Co-authored-by: liuhua <10215101452@stu.ecun.edu.cn>
2024-09-26 16:05:25 +08:00
Kevin Hu
7bb28ca2bd
add lighten control (#2567)
### What problem does this PR solve?

#2295

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2024-09-24 19:22:01 +08:00
Vitaliy Groshev
7e75b9d778
fix parsing spaces in russian language PDFs (#1987) (#2427)
### What problem does this PR solve?

[#1987](https://github.com/infiniflow/ragflow/issues/1987)

When scanning PDF files character by character, the parser excluded
spaces if the string did not match regex. Text from [Russian
documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf)
needs spaces, but it does not match the regex because it uses different
alphabet. That's why PDFs were parsed incorrectly and were almost
unusable as source. Fixed that by adding Russian alphabet to regex.

There might be problems with other languages that use different
alphabets. I additionally tested [PDF in
Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816)
and old [a-zA-Z...] regex parses it correctly with spaces.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-09-14 13:14:39 +08:00
Kevin Hu
a0b7c78dca
optimize text parser (#2144)
### What problem does this PR solve?


### Type of change

- [x] Performance Improvement
2024-08-28 18:11:19 +08:00
Jin Hai
6b3a40be5c
Format file format from Windows/dos to Unix (#1949)
### What problem does this PR solve?

Related source file is in Windows/DOS format, they are format to Unix
format.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-08-15 09:17:36 +08:00
Kevin Hu
77f0fb03e3
fix parameter error (#1925)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-08-13 11:42:38 +08:00
Kevin Hu
cafdee536f
add sql to naive parser (#1908)
### What problem does this PR solve?


### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2024-08-12 15:29:33 +08:00
黄腾
ede733e130
add support for eml file parser (#1768)
### What problem does this PR solve?

add support for eml file parser
#1363

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Zhedong Cen <cenzhedong2@126.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-08-06 16:42:14 +08:00
H
0cb588f7bf
Fix docx parser line bug (#1715)
### What problem does this PR solve?
#1704 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-07-29 10:06:02 +08:00
Jason Lee
ebdd71ce68
fix: When parsing the bold content in PDF, the result is duplicated. (#1729)
### What problem does this PR solve?

_fix: When parsing the bold content in PDF, the result is duplicated._

the detail: [When using OCR to recognize Chinese titles, the structure
appears to be
duplicated](https://github.com/infiniflow/ragflow/issues/1718)

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-29 09:43:05 +08:00
H
b24abee364
Fix pdfparser content confusion (#1700)
### What problem does this PR solve?

#1407 #1656 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-25 11:40:23 +08:00
Kevin Hu
100b3165d8
pypdf2 to pypdf (#1684)
### What problem does this PR solve?

pypdf and PyPDF2 possible Infinite Loop when a comment isn't followed by
a character #59

### Type of change

- [x] Refactoring
2024-07-24 12:38:48 +08:00
Kevin Hu
95821f6fb6
fix bug of ragflowdocxpparser (#1642)
### What problem does this PR solve?

#1627

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-23 09:25:32 +08:00
Kevin Hu
2b5812d0a9
fix generate error (#1590)
### What problem does this PR solve?

#1550 #1210 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-18 14:33:30 +08:00
Kevin Hu
d29fd52e14
fix bug about divided by zero (#1482)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-12 12:59:56 +08:00
Yuhao Tsui
7f4c63d102
fix: Delete hardcode (#1464)
### What problem does this PR solve?

After checking the language of the pdf, the line will hardcode the
language into Chinese

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-11 15:41:31 +08:00
H
2290c2a2f0
fix pdf_paser char content confusion (#1462)
### What problem does this PR solve?

#1407 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-11 14:37:55 +08:00
H
dbb8f7b77b
fix pdf_parser content confusion (#1458)
### What problem does this PR solve?

#1407 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-11 12:36:55 +08:00
Zhedong Cen
a95c1d45f0
Support table for markdown file in general parser (#1278)
### What problem does this PR solve?

Support extracting table for markdown file in general parser

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-06-27 14:38:35 +08:00
Zhedong Cen
45853505bb
Fix occasional errors in pdf table recognition (#1277)
### What problem does this PR solve?

Fix occasional errors in pdf table recognition

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-06-27 14:37:58 +08:00
Wang Baoling
18f4a6b35c
feat: support json file (#1217)
### What problem does this PR solve?

feat: support json file.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: KevinHuSh <kevinhu.sh@gmail.com>
2024-06-21 10:42:29 +08:00
KevinHuSh
e35f7610e7
fix too long query exception (#1195)
### What problem does this PR solve?

#1161 
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-06-18 09:50:59 +08:00
KevinHuSh
4454ba7a1e
add self-rag (#1070)
### What problem does this PR solve?

#1069 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-06-06 11:13:39 +08:00
Jin Hai
cdea1d0a85
Update readme and add license (#1018)
### What problem does this PR solve?

- Update readme
- Add license

### Type of change

- [x] Documentation Update

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-06-01 16:24:10 +08:00
KevinHuSh
843720f958
fix bug in pdf parser (#986)
### What problem does this PR solve?

#963 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-30 11:47:36 +08:00
KevinHuSh
0171082cc5
fix create dialog bug (#982)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-30 09:25:05 +08:00
Zhedong Cen
8dd45459be
Add support for HTML file (#973)
### What problem does this PR solve?

Add support for HTML file

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-05-30 09:12:55 +08:00
KevinHuSh
7eee193956
fix #917 #915 (#946)
### What problem does this PR solve?

#917 
#915

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-28 11:13:02 +08:00
xinzhuang
3bbdf3b770
fixbug for computing 'not concating feature' (#896)
### What problem does this PR solve?

When pdfparser call `_naive_vertical_merge` method,there is a "not
concating feature " value by computing difference between `b` and `b_`'s
layoutno ,but actually is `b` and `b`. I think it's a bug, so fix it.
Please check again.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-23 14:29:42 +08:00
KevinHuSh
a12fcf9156
fix minio helth bug (#850)
### What problem does this PR solve?

#643 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-20 19:35:30 +08:00
GYH
c27c02ea67
Split Excel file into different chunks (#847)
### What problem does this PR solve?


Split Excel into different chunk
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-05-20 18:35:15 +08:00
KevinHuSh
99be226c7c
fix coordinate error (#686)
### What problem does this PR solve?

#683 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-08 20:00:14 +08:00
KevinHuSh
7013d7f620
refine text decode (#657)
### What problem does this PR solve?
#651 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-07 12:25:47 +08:00
KevinHuSh
cab274f560
remove PyMuPDF (#618)
### What problem does this PR solve?
#613 

### Type of change


- [x] Other (please describe):
2024-04-30 12:38:09 +08:00
KevinHuSh
8c07992b6c
refine code (#595)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-28 19:13:33 +08:00