183 Commits

Author SHA1 Message Date
Kevin Hu
93f5df716f
Fix: order chunks from docx by positions. (#7979)
### What problem does this PR solve?

#7934

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-30 17:20:53 +08:00
Yongteng Lei
bd4678bca6
Fix: Unnecessary truncation in markdown parser (#7972)
### What problem does this PR solve?

Fix unnecessary truncation in markdown parser. So that markdown can work
perfectly like
[this](https://github.com/infiniflow/ragflow/issues/7824#issuecomment-2921312576)
in #7824, supporting multiple special delimiters.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-30 15:04:21 +08:00
Yongteng Lei
46963ab1ca
Fix: add advanced delimiter detection for naive merge (#7941)
### What problem does this PR solve?

Add advanced delimiter detection for naive merge. #7824

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-05-29 16:17:22 +08:00
Yongteng Lei
0c562f0a9f
Refa: change citation mark as [ID:n] (#7923)
### What problem does this PR solve?

Change citation mark as [ID:n], it's easier for LLMs to follow the
instruction :) #7904

### Type of change

- [x] Refactoring
2025-05-29 10:03:51 +08:00
Sol
0d7cfce6e1
Update rag/nlp/query.py (#7816)
### What problem does this PR solve?
Fix tokenizer resulting in low recall

![37743d3a495f734aa69f1e173fa77457](https://github.com/user-attachments/assets/1394757e-8fcb-4f87-96af-a92716144884)

![4aba633a17f34269a4e17e84fafb34c4](https://github.com/user-attachments/assets/a1828e32-3e17-4394-a633-ba3f09bd506d)

![image](https://github.com/user-attachments/assets/61308f32-2a4f-44d5-a034-d65bbec554ef)



### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-05-23 17:13:37 +08:00
Stephen Hu
db4371c745
Fix: Improve First Chunk Size (#7806)
### What problem does this PR solve?

https://github.com/infiniflow/ragflow/issues/7790

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-23 14:30:19 +08:00
Emmanuel Ferdman
d4a123d6dd
Fix: resolve regex library warnings (#7782)
### What problem does this PR solve?
This small PR resolves the regex library warnings showing in Python3.11:
```python
DeprecationWarning: 'count' is passed as positional argument
```

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-05-22 10:06:28 +08:00
Kevin Hu
321a280031
Feat: add image preview to retrieval test. (#7610)
### What problem does this PR solve?

#7608

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-05-13 14:30:36 +08:00
Stephen Hu
573d46a4ef
FIX:ZeroDivisionError when using large page_size in client.retrieve() (#7595)
### What problem does this PR solve?

Close #7592

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-13 10:46:31 +08:00
Kevin Hu
a14865e6bb
Fix: empty query issue. (#7551)
### What problem does this PR solve?

#5214

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-09 12:20:19 +08:00
Kevin Hu
c7310f7fb2
Refa: similarity calculations. (#7381)
### What problem does this PR solve?


### Type of change

- [x] Refactoring
2025-04-28 19:17:11 +08:00
Stephen Hu
1662c7eda3
Feat: Markdown add image (#7124)
### What problem does this PR solve?

https://github.com/infiniflow/ragflow/issues/6984

1. Markdown parser supports get pictures
2. For Native, when handling Markdown, it will handle images
3. improve merge and 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-04-25 18:35:28 +08:00
Yongteng Lei
67dee2d74e
Fix: fix retrieval tesing wrong pagination (#7174)
### What problem does this PR solve?

Fix retrieval testing wrong pagination. #7171 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-04-22 15:16:04 +08:00
alulala
d9266ed65a
Fix: incorrect total chunks count in retrieval function after similarity filtering (#6741) (#6932)
### Related Issue:
https://github.com/infiniflow/ragflow/issues/6741

### Environment:
Using nightly version
Commit version:
[[6051abb](6051abb4a3)]

### Bug Description:
The retrieval function in rag/nlp/search.py returns the original total
chunks number
even after chunks are filtered by similarity_threshold. This creates
inconsistency
between the actual returned chunks and the reported total.

### Changes Made:
Added code to count how many search results actually meet or exceed the
configured similarity threshold
Positioned the calculation after the doc_ids conditional logic to ensure
special cases are handled correctly
Updated the ranks["total"] value to store this filtered count instead of
using the raw search result count
Using NumPy leverages optimized C-level batch operations to optimize
speed
2025-04-11 12:31:36 +08:00
kaiyuan Zhang
ead5f7aba9
Fix infinite recursion in RagTokenizer when processing repetitive characters (#6109)
### What problem does this PR solve?
fix #6085 
RagTokenizer's dfs_() function falls into infinite recursion when
processing text with repetitive Chinese characters (e.g.,
"一一一一一十一十一十一..." or "一一一一一一十十十十十十十二十二十二..."), causing memory leaks.
### Type of change
Implemented three optimizations to the dfs_() function:
1.Added memoization with _memo dictionary to cache computed results
2.Added recursion depth limiting with _depth parameter (max 10 levels)
3.Implemented special handling for repetitive character sequences
- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-04-01 13:59:52 +08:00
Kevin Hu
0758c04941
Refa: token similarity calculations. (#6614)
### What problem does this PR solve?

#6507

### Type of change

- [x] Performance Improvement
2025-03-28 09:33:08 +08:00
Kevin Hu
cc8029a732
Fix: uploading in chat box issue. (#6547)
### What problem does this PR solve?

#6228

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-26 15:37:48 +08:00
Kevin Hu
ee5aa51d43
Fix: point in tag issue. (#6436)
### What problem does this PR solve?

#6414

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-24 10:45:29 +08:00
Kevin Hu
a087d13ccb
Feat: text file support position retaining. (#6231)
### What problem does this PR solve?

#5832

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-18 16:55:11 +08:00
Kevin Hu
6e8d0e3177
Fix: rank feat issue. (#6225)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-18 16:07:29 +08:00
Kevin Hu
1333d3c02a
Fix: float transfer exception. (#6197)
### What problem does this PR solve?

#6177

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-18 11:13:44 +08:00
Kevin Hu
fabc5e9259
Refa: fix re-rank scope. (#6152)
### What problem does this PR solve?

#6140

### Type of change


- [x] Refactoring
2025-03-17 13:26:29 +08:00
Kevin Hu
e5a8b23684
Fix: empty tag field issue. (#6103)
### What problem does this PR solve?

#6102

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-14 17:35:57 +08:00
Kevin Hu
485bc7d7d6
Fix: limit the depth of DFS (#6101)
### What problem does this PR solve?

#6085

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-14 17:10:38 +08:00
Kevin Hu
e05cdc2f9c
Fix: encode detect error. (#6006)
### What problem does this PR solve?

#5967

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-13 10:47:58 +08:00
Kevin Hu
15736c57c3
Fix: empty query issue. (#5830)
### What problem does this PR solve?

#5214

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-10 13:56:56 +08:00
Kevin Hu
c190086707
Fix: bad case for tokenizer. (#5543)
### What problem does this PR solve?

#5492

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-03 15:36:16 +08:00
Kevin Hu
4f40f685d9
Code refactor (#5371)
### What problem does this PR solve?

#5173

### Type of change

- [x] Refactoring
2025-02-26 15:40:52 +08:00
Kevin Hu
53b9e7b52f
Add tavily as web searh tool. (#5349)
### What problem does this PR solve?

#5198

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-02-26 10:21:04 +08:00
Kevin Hu
daddfc9e1b
Remove dup gb2312, solve currupt error. (#5326)
### What problem does this PR solve?

#5252 
#5325

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-25 12:22:37 +08:00
Kevin Hu
3444cb15e3
Refine search query. (#5235)
### What problem does this PR solve?

#5173
#5214

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-21 18:32:32 +08:00
Kevin Hu
cdb3e6434a
Fix empty question issue. (#5225)
### What problem does this PR solve?

#5241

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-21 15:47:39 +08:00
Kevin Hu
7b3d700d5f
Apply agentic searching. (#5196)
### What problem does this PR solve?

#5173

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-02-20 17:41:01 +08:00
Kevin Hu
e6c024f8bf
Fix too many clause while searching. (#5119)
### What problem does this PR solve?

#5100

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-19 13:18:39 +08:00
ubbg
29a59ed7e2
Fix: Use self.dataStore.indexExist in all_tags method of Dealer (#5108)
### What problem does this PR solve?

This PR fixes an AttributeError in the all_tags method of the Dealer
class. Previously, the method incorrectly called
self.docStoreConn.indexExist instead of self.dataStore.indexExist. Since
self.docStoreConn was never set (and self.dataStore is already
initialized in init), this resulted in an error when attempting to check
if the index exists. This change ensures that the proper connector is
used for the index existence check, thereby resolving the issue._

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-19 11:50:57 +08:00
Kevin Hu
9ff825f39d
Ignore exceptions when no index ahead. (#5047)
### What problem does this PR solve?

### Type of change

- [x] Refactoring
2025-02-18 09:09:22 +08:00
Mathias Panzenböck
9bcccadebd
Remove use of eval() from search.py (#4887)
Use `json.loads()` instead.

### What problem does this PR solve?

Using `eval()` can lead to code injections. I think this loads a JSON
field, right? If yes, why is this done via `eval()` and not
`json.loads()`?

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-12 13:15:38 +08:00
Kevin Hu
f374dd38b6
Fix divided by zero issue. (#4784)
### What problem does this PR solve?

#4779

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-08 10:36:26 +08:00
Kevin Hu
448fa1c4d4
Robust for abnormal response from LLMs. (#4747)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-06 17:34:53 +08:00
Kevin Hu
6f2c3a3c3c
Fix too long query exception. (#4729)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-06 10:11:52 +08:00
Kevin Hu
4011c8f68c
Fix potential error. (#4650)
### What problem does this PR solve?
#4622

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-26 12:38:32 +08:00
Kevin Hu
86892959a0
Rebuild graph when it's out of time. (#4607)
### What problem does this PR solve?

#4543

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
2025-01-23 17:26:20 +08:00
Kevin Hu
dd0ebbea35
Light GraphRAG (#4585)
### What problem does this PR solve?

#4543

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-01-22 19:43:14 +08:00
Kevin Hu
c5da3cdd97
Tagging (#4426)
### What problem does this PR solve?

#4367

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-01-09 17:07:21 +08:00
Kevin Hu
d9a4e4cc3b
Fix page size error. (#4401)
### What problem does this PR solve?

#4400

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-07 19:06:31 +08:00
Kevin Hu
f948c0d9f1
Clean query. (#4259)
### What problem does this PR solve?

#4239

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-27 14:25:03 +08:00
Kevin Hu
7e063283ba
Removing invisible chars before tokenization. (#4233)
### What problem does this PR solve?

#4223

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-26 11:48:16 +08:00
Bo Liu
321e9f3719
fix: stop rerank by model when search result is empty (#4203)
### What problem does this PR solve?


stop rerank by model when search result is empty, otherwise rerank may
raise an error (qwen).

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: 刘博 <liubo@ynby.cn>
2024-12-24 14:33:46 +08:00
Kevin Hu
c373dba0bc
Fix raptor bug. (#4192)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-23 18:59:48 +08:00
Kevin Hu
31d67c850e
Fetch chunk by batches. (#4177)
### What problem does this PR solve?

#4173

### Type of change

- [x] Performance Improvement
2024-12-23 12:12:15 +08:00