### What problem does this PR solve?
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
### Type of change
- [x] Refactoring
---------
Signed-off-by: jinhai <haijin.chn@gmail.com>
### What problem does this PR solve?
_Choosing Laws Chunk Method results in an error when parsing a document.
The error is caused by a missing import in the `laws.py` file._
```
Traceback (most recent call last):
File "/ragflow/rag/svr/task_executor.py", line 445, in handle_task
do_handle_task(task)
File "/ragflow/rag/svr/task_executor.py", line 384, in do_handle_task
cks = build(r)
^^^^^^^^
File "/ragflow/rag/svr/task_executor.py", line 196, in build
cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"],
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ragflow/rag/app/laws.py", line 161, in chunk
for txt, poss in pdf_parser(filename if not binary else binary,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ragflow/rag/app/laws.py", line 124, in __call__
logging.debug("layouts:".format(
^^^^^^^
NameError: name 'logging' is not defined. Did you forget to import 'logging'
```
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
Co-authored-by: Michal Masrna <m.marna1@gmail.com>
### What problem does this PR solve?
Use consistent log file names, introduced initLogger
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What problem does this PR solve?
Add get_txt function to reduce duplicate code
### Type of change
- [x] Refactoring
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
Related source file is in Windows/DOS format, they are format to Unix
format.
### Type of change
- [x] Refactoring
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Optimize docx handle method in laws parser
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
### Type of change
- [x] Documentation Update
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Add `.doc` file parser, using tika.
```
pip install tika
```
```
from tika import parser
from io import BytesIO
def extract_text_from_doc_bytes(doc_bytes):
file_like_object = BytesIO(doc_bytes)
parsed = parser.from_buffer(file_like_object)
return parsed["content"]
```
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: chrysanthemum-boy <fannc@qq.com>