136 Commits

Author SHA1 Message Date
Kevin Hu
c28bc41a96
Fix docx table issue. (#5117)
### What problem does this PR solve?

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-19 12:40:06 +08:00
Kevin Hu
c24137bd11
Fix too long integer for Table. (#4651)
### What problem does this PR solve?

#4594

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-26 12:54:58 +08:00
Kevin Hu
9d717f0b6e
Fix csv reader exception. (#4628)
### What problem does this PR solve?

#4552
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-24 14:47:19 +08:00
Kevin Hu
13f04b7cca
Fix pdf applying Q&A issue. (#4599)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-23 12:30:46 +08:00
Kevin Hu
dd0ebbea35
Light GraphRAG (#4585)
### What problem does this PR solve?

#4543

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-01-22 19:43:14 +08:00
Jin Hai
3894de895b
Update comments (#4569)
### What problem does this PR solve?

Add license statement.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-01-21 20:52:28 +08:00
Kevin Hu
f556f0239c
Fix dify retrieval issue. (#4473)
### What problem does this PR solve?

#4464
#4469 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-14 13:16:05 +08:00
Kevin Hu
e098fcf6ad
Fix csv for TAG. (#4454)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-13 12:03:18 +08:00
Kevin Hu
c5da3cdd97
Tagging (#4426)
### What problem does this PR solve?

#4367

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-01-09 17:07:21 +08:00
Yingfeng
50f209204e
Synchronize with enterprise version (#4325)
### Type of change

- [x] Refactoring
2025-01-02 13:44:44 +08:00
Kevin Hu
8fb18f37f6
Code refactor. (#4291)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-12-30 18:38:51 +08:00
TeslaZY
dd13a5d05c
Fix some bugs in text2sql.(#4279)(#4281) (#4280)
Fix some bugs in text2sql.(#4279)(#4281)

### What problem does this PR solve?
- The incorrect results in parsing CSV files of the QA knowledge base in
the text2sql scenario. Process CSV files using the csv library. Decouple
CSV parsing from TXT parsing
- Most llm return results in markdown format ```sql query ```, Fix
execution error caused by LLM output SQLmarkdown format.### Type of
change
- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-30 10:32:19 +08:00
ly0303521
101b8ff813
fix chunk method "Table" losing content when the Excel file has multi… (#4123)
…ple sheets

### What problem does this PR solve?
discussed in https://github.com/infiniflow/ragflow/pull/4102
- In excel_parser.py, `total` means the total number of rows in Excel,
but it return in the first iterate, that lead to the wrong `to_page`
- In table.py, it when Excel file has multiple sheets, it will be
divided into multiple parts, every part size is 3000, `data` may be
empty, because it has recorded in the last iterate.
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-19 17:30:26 +08:00
liuhua
1d65299791
Fix rerank_model bug in chat and markdown bug (#4061)
### What problem does this PR solve?

Fix rerank_model bug in chat and markdown bug
#4000
#3992
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: liuhua <10215101452@stu.ecun.edu.cn>
2024-12-17 16:03:37 +08:00
Zhichang Yu
03f00c9e6f
Rename page_num_list, top_list, position_list (#3940)
### What problem does this PR solve?

Rename page_num_list, top_list, position_list to page_num_int, top_int,
position_int

### Type of change

- [x] Refactoring
2024-12-10 16:32:58 +08:00
Kevin Hu
927873bfa6
Fix syn error. (#3953)
### What problem does this PR solve?

Close #3696
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-10 10:54:54 +08:00
Zhichang Yu
0d68a6cd1b
Fix errors detected by Ruff (#3918)
### What problem does this PR solve?

Fix errors detected by Ruff

### Type of change

- [x] Refactoring
2024-12-08 14:21:12 +08:00
Jin Hai
821fdf02b4
Fix parsing JSON file error (#3829)
### What problem does this PR solve?

Close issue: #3828

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Signed-off-by: jinhai <haijin.chn@gmail.com>
2024-12-03 19:02:03 +08:00
Jin Hai
08c1a5e1e8
Refactor parse progress (#3781)
### What problem does this PR solve?

Refactor parse file progress

### Type of change

- [x] Refactoring

Signed-off-by: jinhai <haijin.chn@gmail.com>
2024-12-01 22:28:00 +08:00
Jin Hai
e079656473
Update progress info and start welcome info (#3768)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Refactoring

---------

Signed-off-by: jinhai <haijin.chn@gmail.com>
2024-11-30 18:48:06 +08:00
kuschzzp
e678819f70
Fix RGBA error (#3707)
### What problem does this PR solve?

**Passing cv_mdl.describe() is not an RGB converted image**

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-28 13:09:02 +08:00
Zhichang Yu
bc701d7b4c
Edit chunk shall update instead of insert it (#3709)
### What problem does this PR solve?

Edit chunk shall update instead of insert it. Close #3679 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-28 13:00:38 +08:00
Kevin Hu
609236f5c1
Let 'One' applicable for tables in docx (#3619)
### What problem does this PR solve?

#3598

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Performance Improvement
2024-11-25 09:57:54 +08:00
Zhichang Yu
482c1b59c8
Check tika.parser return result (#3564)
### What problem does this PR solve?

Check tika.parser return result. Close #3229

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2024-11-22 11:05:06 +08:00
Michal Masrna
c4f2464935 fix: laws.py added missing import logging (#3501)
### What problem does this PR solve?

_Choosing Laws Chunk Method results in an error when parsing a document.
The error is caused by a missing import in the `laws.py` file._

```
Traceback (most recent call last):
  File "/ragflow/rag/svr/task_executor.py", line 445, in handle_task
    do_handle_task(task)
  File "/ragflow/rag/svr/task_executor.py", line 384, in do_handle_task
    cks = build(r)
          ^^^^^^^^
  File "/ragflow/rag/svr/task_executor.py", line 196, in build
    cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"],
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ragflow/rag/app/laws.py", line 161, in chunk
    for txt, poss in pdf_parser(filename if not binary else binary,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ragflow/rag/app/laws.py", line 124, in __call__
    logging.debug("layouts:".format(
    ^^^^^^^
NameError: name 'logging' is not defined. Did you forget to import 'logging'

```

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

Co-authored-by: Michal Masrna <m.marna1@gmail.com>
2024-11-20 20:52:05 +08:00
Zhichang Yu
30f6421760
Use consistent log file names, introduced initLogger (#3403)
### What problem does this PR solve?

Use consistent log file names, introduced initLogger

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2024-11-14 17:13:48 +08:00
Kevin Hu
83c6b1f308
set DLA active for KG (#3386)
### What problem does this PR solve?

### Type of change


- [x] Refactoring
2024-11-13 16:59:19 +08:00
Zhichang Yu
a2a5631da4
Rework logging (#3358)
Unified all log files into one.

### What problem does this PR solve?

Unified all log files into one.

### Type of change

- [x] Refactoring
2024-11-12 17:35:13 +08:00
Zhichang Yu
f4c52371ab
Integration with Infinity (#2894)
### What problem does this PR solve?

Integration with Infinity

- Replaced ELASTICSEARCH with dataStoreConn
- Renamed deleteByQuery with delete
- Renamed bulk to upsertBulk
- getHighlight, getAggregation
- Fix KGSearch.search
- Moved Dealer.sql_retrieval to es_conn.py


### Type of change

- [x] Refactoring
2024-11-12 14:59:41 +08:00
Kevin Hu
f86826b7a0
refactor error message of qwen (#3074)
### What problem does this PR solve?
#3055

### Type of change
- [x] Refactoring
2024-10-29 10:08:08 +08:00
Kevin Hu
1fce6caf80
make titles in markdown not be splited with following content (#2971)
### What problem does this PR solve?

#2970 
### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2024-10-22 15:25:23 +08:00
Kevin Hu
b540d41cdc
let presentation do raptor (#2838)
### What problem does this PR solve?

#2837

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-10-15 10:11:09 +08:00
lidp
20e63f8ec4
Fix docx images (#2756)
### What problem does this PR solve?

#2755 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-10-09 19:37:32 +08:00
yqkcn
570ad420a8
remove unused import (#2679)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2024-09-30 16:59:39 +08:00
Kevin Hu
fc867cb959
rename get_txt to get_text (#2649)
### What problem does this PR solve?



### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-09-29 12:47:09 +08:00
yqkcn
aea553c3a8
Add get_txt function (#2639)
### What problem does this PR solve?

Add get_txt function to reduce duplicate code

### Type of change

- [x] Refactoring

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-09-29 10:29:56 +08:00
yqkcn
34abcf7704
style: fix typo and format code (#2618)
### What problem does this PR solve?

- Fix typo
- Remove unused import
- Format code

### Type of change

- [x] Other (please describe): typo and format
2024-09-27 13:17:25 +08:00
Kevin Hu
78856703c4
make excel parsing configurable (#2517)
### What problem does this PR solve?

#2516

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-09-20 15:33:38 +08:00
Kevin Hu
01acc3fd5a
fix duplicated llm name betweeen different suppliers (#2477)
### What problem does this PR solve?

#2465

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-09-18 16:09:22 +08:00
黄腾
e4765ebe0c
add support for markdown file in one parse way (#2052)
### What problem does this PR solve?

#2021 add support for markdown file in one parse way

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

Co-authored-by: Zhedong Cen <cenzhedong2@126.com>
2024-08-22 15:32:35 +08:00
Jin Hai
6b3a40be5c
Format file format from Windows/dos to Unix (#1949)
### What problem does this PR solve?

Related source file is in Windows/DOS format, they are format to Unix
format.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-08-15 09:17:36 +08:00
Kevin Hu
d73a75506e
fix mind map bug (#1934)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-08-13 19:42:28 +08:00
Kevin Hu
cafdee536f
add sql to naive parser (#1908)
### What problem does this PR solve?


### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2024-08-12 15:29:33 +08:00
Kung Quang
19ded65c66
Fix a "TypeError: expected string or buffer bug" in docx files extracted using Knowledge Graph.#1859 (#1865)
### What problem does this PR solve?

Fix a "TypeError: expected string or buffer bug" in docx files extracted
using Knowledge Graph. #1859
```
Traceback (most recent call last):
  File "//Users/XXX/ragflow/rag/svr/task_executor.py", line 149, in build
    cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"],
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/ragflow/rag/app/knowledge_graph.py", line 18, in chunk
    chunks = build_knowlege_graph_chunks(tenant_id, sections, callback,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/ragflow/graphrag/index.py", line 87, in build_knowlege_graph_chunks
    tkn_cnt = num_tokens_from_string(chunks[i])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/github/ragflow/rag/utils/__init__.py", line 79, in num_tokens_from_string
    num_tokens = len(encoder.encode(string))
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/tiktoken/core.py", line 116, in encode
    if match := _special_token_regex(disallowed_special).search(text):
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected string or buffer
```
This type is `Dict`
<img width="1689" alt="Pasted Graphic 3"
src="https://github.com/user-attachments/assets/e5ba5c45-df1d-4697-98c9-14365c839f20">
The correct type should be ` Str`
<img width="1725" alt="Pasted Graphic 2"
src="https://github.com/user-attachments/assets/e54d5e60-4ce4-4180-b394-24e485013534">

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2024-08-08 12:03:01 +08:00
黄腾
ede733e130
add support for eml file parser (#1768)
### What problem does this PR solve?

add support for eml file parser
#1363

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Zhedong Cen <cenzhedong2@126.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-08-06 16:42:14 +08:00
Kevin Hu
fe797bcc66
be better chunks before graphrag (#1811)
### What problem does this PR solve?

#1594

### Type of change

- [x] Refactoring
2024-08-05 16:21:52 +08:00
Kevin Hu
a5c03ccd4c
refine mindmap prompt (#1808)
### What problem does this PR solve?



### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-08-05 15:33:44 +08:00
H
d2213141e0
Fix graphrag callback (#1806)
### What problem does this PR solve?

#1800 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-08-05 14:44:54 +08:00
Kevin Hu
152072f900
Add graphrag (#1793)
### What problem does this PR solve?

#1594

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-08-02 18:51:14 +08:00
Yuhao Tsui
a973b9e01f
Fix: Embedding err when docx contains unsupported images (#1720)
### What problem does this PR solve?

Fix the problem of not being able to embedding when docx document
contains unsupported images.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-07-29 19:38:47 +08:00