49 Commits

Author SHA1 Message Date
Kevin Hu
7bb28ca2bd
add lighten control (#2567)
### What problem does this PR solve?

#2295

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2024-09-24 19:22:01 +08:00
Vitaliy Groshev
7e75b9d778
fix parsing spaces in russian language PDFs (#1987) (#2427)
### What problem does this PR solve?

[#1987](https://github.com/infiniflow/ragflow/issues/1987)

When scanning PDF files character by character, the parser excluded
spaces if the string did not match regex. Text from [Russian
documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf)
needs spaces, but it does not match the regex because it uses different
alphabet. That's why PDFs were parsed incorrectly and were almost
unusable as source. Fixed that by adding Russian alphabet to regex.

There might be problems with other languages that use different
alphabets. I additionally tested [PDF in
Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816)
and old [a-zA-Z...] regex parses it correctly with spaces.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-09-14 13:14:39 +08:00
H
0cb588f7bf
Fix docx parser line bug (#1715)
### What problem does this PR solve?
#1704 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-07-29 10:06:02 +08:00
Jason Lee
ebdd71ce68
fix: When parsing the bold content in PDF, the result is duplicated. (#1729)
### What problem does this PR solve?

_fix: When parsing the bold content in PDF, the result is duplicated._

the detail: [When using OCR to recognize Chinese titles, the structure
appears to be
duplicated](https://github.com/infiniflow/ragflow/issues/1718)

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-29 09:43:05 +08:00
H
b24abee364
Fix pdfparser content confusion (#1700)
### What problem does this PR solve?

#1407 #1656 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-25 11:40:23 +08:00
Kevin Hu
100b3165d8
pypdf2 to pypdf (#1684)
### What problem does this PR solve?

pypdf and PyPDF2 possible Infinite Loop when a comment isn't followed by
a character #59

### Type of change

- [x] Refactoring
2024-07-24 12:38:48 +08:00
Kevin Hu
d29fd52e14
fix bug about divided by zero (#1482)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-12 12:59:56 +08:00
Yuhao Tsui
7f4c63d102
fix: Delete hardcode (#1464)
### What problem does this PR solve?

After checking the language of the pdf, the line will hardcode the
language into Chinese

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-11 15:41:31 +08:00
H
2290c2a2f0
fix pdf_paser char content confusion (#1462)
### What problem does this PR solve?

#1407 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-11 14:37:55 +08:00
H
dbb8f7b77b
fix pdf_parser content confusion (#1458)
### What problem does this PR solve?

#1407 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-11 12:36:55 +08:00
Zhedong Cen
45853505bb
Fix occasional errors in pdf table recognition (#1277)
### What problem does this PR solve?

Fix occasional errors in pdf table recognition

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-06-27 14:37:58 +08:00
KevinHuSh
4454ba7a1e
add self-rag (#1070)
### What problem does this PR solve?

#1069 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-06-06 11:13:39 +08:00
Jin Hai
cdea1d0a85
Update readme and add license (#1018)
### What problem does this PR solve?

- Update readme
- Add license

### Type of change

- [x] Documentation Update

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-06-01 16:24:10 +08:00
KevinHuSh
843720f958
fix bug in pdf parser (#986)
### What problem does this PR solve?

#963 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-30 11:47:36 +08:00
KevinHuSh
7eee193956
fix #917 #915 (#946)
### What problem does this PR solve?

#917 
#915

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-28 11:13:02 +08:00
xinzhuang
3bbdf3b770
fixbug for computing 'not concating feature' (#896)
### What problem does this PR solve?

When pdfparser call `_naive_vertical_merge` method,there is a "not
concating feature " value by computing difference between `b` and `b_`'s
layoutno ,but actually is `b` and `b`. I think it's a bug, so fix it.
Please check again.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-23 14:29:42 +08:00
KevinHuSh
99be226c7c
fix coordinate error (#686)
### What problem does this PR solve?

#683 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-08 20:00:14 +08:00
KevinHuSh
cab274f560
remove PyMuPDF (#618)
### What problem does this PR solve?
#613 

### Type of change


- [x] Other (please describe):
2024-04-30 12:38:09 +08:00
KevinHuSh
8c07992b6c
refine code (#595)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-28 19:13:33 +08:00
KevinHuSh
d589b0f568
fix exception in pdf parser (#584)
### What problem does this PR solve?
#451 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-28 14:23:53 +08:00
KevinHuSh
9d60a84958
refactor code (#583)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-28 13:19:54 +08:00
KevinHuSh
66f8d35632
Refactor (#537)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-25 14:14:28 +08:00
KevinHuSh
0dfc8ddc0f
enlarge docker memory usage (#501)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-04-23 14:41:10 +08:00
KevinHuSh
962c66714e
fix divide by zero bug (#447)
### What problem does this PR solve?

#445 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-19 11:26:38 +08:00
加帆
39f1feaccb
Bug fix pdf parse index out of range (#440)
### What problem does this PR solve?

fix a bug comes when parse some pdf file #436 

### Type of change

- [☑️ ] Bug Fix (non-breaking change which fixes an issue)
2024-04-19 08:44:51 +08:00
KevinHuSh
0499a3f621
rm page number exception for pdf parser (#424)
### What problem does this PR solve?

#423 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-18 12:09:56 +08:00
KevinHuSh
453c29170f
make sure the models will not be load twice (#422)
### What problem does this PR solve?

#381 
### Type of change

- [x] Refactoring
2024-04-18 09:37:23 +08:00
KevinHuSh
a5384446e3
let's load model from local (#163) 2024-03-28 16:10:47 +08:00
KevinHuSh
fd7fcb5baf
apply pep8 formalize (#155) 2024-03-27 11:33:46 +08:00
KevinHuSh
979b3a5b4b
support snapshot download from local (#153)
* support snapshot download from local

* let snapshot download from local
2024-03-27 09:53:42 +08:00
KevinHuSh
da21320b88
fix plainPdf bugs (#152) 2024-03-26 15:11:07 +08:00
KevinHuSh
71fe314955
refine page ranges (#147) 2024-03-25 13:11:57 +08:00
KevinHuSh
f6aee7f230
add use layout or not option (#145)
* add use layout or not option

* trival
2024-03-22 19:21:09 +08:00
KevinHuSh
6c6b144de2
refine manual parser (#140) 2024-03-21 18:17:32 +08:00
KevinHuSh
6999598101
refine for English corpus (#135) 2024-03-20 16:56:16 +08:00
KevinHuSh
9a843667b3
fix github account login issue (#132) 2024-03-19 15:31:47 +08:00
KevinHuSh
9da671b951
refine manul parser (#131) 2024-03-19 12:26:04 +08:00
KevinHuSh
675a9f8d9a
add dockerfile for cuda envirement. Refine table search strategy, (#123) 2024-03-14 19:45:29 +08:00
KevinHuSh
8f86ab9f7f
refine pdf parser, add time zone to userinfo (#112) 2024-03-08 11:24:24 +08:00
KevinHuSh
602038ac49
fix task cancling bug (#98) 2024-03-05 16:33:47 +08:00
KevinHuSh
8a57f2afd5
change callback strategy, add timezone to docker (#96) 2024-03-05 12:08:41 +08:00
KevinHuSh
7bfaf0df29
fix position extraction bug (#93)
* fix position extraction bug

* remove delimiter for naive parser
2024-03-04 17:08:35 +08:00
KevinHuSh
685b4d8a95
fix table desc bugs, add positions to chunks (#91) 2024-03-04 14:42:26 +08:00
KevinHuSh
8a726fb04b
solve task execution issues (#90) 2024-03-01 19:48:01 +08:00
KevinHuSh
3d4315c42a
resolve the issue of naive parser (#87) 2024-02-29 18:53:02 +08:00
KevinHuSh
0429107e80
fix user login issue (#85) 2024-02-29 14:03:07 +08:00
KevinHuSh
4568a4b2cb
refine admin initialization (#75) 2024-02-27 14:57:34 +08:00
KevinHuSh
d32322c081
rename vision, add layour and tsr recognizer (#70)
* rename vision, add layour and tsr recognizer

* trivial fixing
2024-02-22 19:11:37 +08:00
KevinHuSh
cacd36c5e1
use onnx models, new deepdoc (#68) 2024-02-21 16:32:38 +08:00