KevinHuSh
7013d7f620
refine text decode ( #657 )
...
### What problem does this PR solve?
#651
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-07 12:25:47 +08:00
KevinHuSh
8c07992b6c
refine code ( #595 )
...
### What problem does this PR solve?
### Type of change
- [x] Refactoring
2024-04-28 19:13:33 +08:00
Jin Hai
f1c98aad6b
Update version info ( #564 )
...
### What problem does this PR solve?
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
### Type of change
- [x] Documentation Update
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-04-26 20:07:26 +08:00
chrysanthemum-boy
72384b191d
Add .doc
file parser. ( #497 )
...
### What problem does this PR solve?
Add `.doc` file parser, using tika.
```
pip install tika
```
```
from tika import parser
from io import BytesIO
def extract_text_from_doc_bytes(doc_bytes):
file_like_object = BytesIO(doc_bytes)
parsed = parser.from_buffer(file_like_object)
return parsed["content"]
```
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: chrysanthemum-boy <fannc@qq.com>
2024-04-23 15:31:43 +08:00
KevinHuSh
0dfc8ddc0f
enlarge docker memory usage ( #501 )
...
### What problem does this PR solve?
### Type of change
- [x] Refactoring
2024-04-23 14:41:10 +08:00
KevinHuSh
a38e163035
remove doc from supported processing types ( #488 )
...
### What problem does this PR solve?
#474
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-22 15:46:09 +08:00
KevinHuSh
ed6081845a
Fit a lot of encodings for text file. ( #458 )
...
### What problem does this PR solve?
#384
### Type of change
- [x] Performance Improvement
2024-04-19 18:02:53 +08:00
KevinHuSh
f6c7204002
refine log format ( #312 )
...
### What problem does this PR solve?
Issue link:#264
### Type of change
- [x] Documentation Update
- [x] Refactoring
2024-04-11 10:13:43 +08:00
KevinHuSh
a0a480b708
continue add layout model for 'laws' ( #292 )
...
### What problem does this PR solve?
Issue link:#289
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
2024-04-10 14:06:36 +08:00
KevinHuSh
243de6ac90
add a new model for 'Laws' ( #290 )
...
### What problem does this PR solve?
Issue link:#289
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
2024-04-10 11:59:00 +08:00
KevinHuSh
fd7fcb5baf
apply pep8 formalize ( #155 )
2024-03-27 11:33:46 +08:00
KevinHuSh
da21320b88
fix plainPdf bugs ( #152 )
2024-03-26 15:11:07 +08:00
KevinHuSh
f6aee7f230
add use layout or not option ( #145 )
...
* add use layout or not option
* trival
2024-03-22 19:21:09 +08:00
KevinHuSh
d7c362f237
adjust hierarchical_merge strategy ( #100 )
2024-03-06 09:09:16 +08:00
KevinHuSh
602038ac49
fix task cancling bug ( #98 )
2024-03-05 16:33:47 +08:00
KevinHuSh
8a57f2afd5
change callback strategy, add timezone to docker ( #96 )
2024-03-05 12:08:41 +08:00
KevinHuSh
685b4d8a95
fix table desc bugs, add positions to chunks ( #91 )
2024-03-04 14:42:26 +08:00
KevinHuSh
7fd1eca582
init README of deepdoc, add picture processer. ( #71 )
...
* init README of deepdoc, add picture processer.
* add resume parsing
2024-02-23 18:28:12 +08:00
KevinHuSh
cacd36c5e1
use onnx models, new deepdoc ( #68 )
2024-02-21 16:32:38 +08:00
KevinHuSh
a8294f2168
Refine resume parts and fix bugs in retrival using sql ( #66 )
2024-02-19 19:22:17 +08:00
KevinHuSh
407b2523b6
remove unused codes, seperate layout detection out as a new api. Add new rag methed 'table' ( #55 )
2024-02-05 18:08:17 +08:00
KevinHuSh
51482f3e2a
Some document API refined. ( #53 )
...
Add naive chunking method to RAG
2024-02-02 19:21:37 +08:00
KevinHuSh
e6acaf6738
Add Q&A and Book, fix task running bugs ( #50 )
2024-02-01 18:53:56 +08:00
KevinHuSh
6224edcd1b
Add task moduel, and pipline the task and every parser ( #49 )
2024-01-31 19:57:45 +08:00
KevinHuSh
96a1a44cb6
add paper & manual parser ( #46 )
2024-01-30 18:28:09 +08:00
KevinHuSh
072f9dd5bc
Add app to rag module: presentaion & laws ( #43 )
2024-01-25 18:57:39 +08:00