Fix a "TypeError: expected string or buffer bug" in docx files extracted using Knowledge Graph.#1859 (#1865)

AI/ragflow

mirror of https://git.mirrors.martin98.com/https://github.com/infiniflow/ragflow.git synced 2025-08-12 06:38:59 +08:00

### What problem does this PR solve?

Fix a "TypeError: expected string or buffer bug" in docx files extracted
using Knowledge Graph. #1859
```
Traceback (most recent call last):
  File "//Users/XXX/ragflow/rag/svr/task_executor.py", line 149, in build
    cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"],
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/ragflow/rag/app/knowledge_graph.py", line 18, in chunk
    chunks = build_knowlege_graph_chunks(tenant_id, sections, callback,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/ragflow/graphrag/index.py", line 87, in build_knowlege_graph_chunks
    tkn_cnt = num_tokens_from_string(chunks[i])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/github/ragflow/rag/utils/__init__.py", line 79, in num_tokens_from_string
    num_tokens = len(encoder.encode(string))
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/tiktoken/core.py", line 116, in encode
    if match := _special_token_regex(disallowed_special).search(text):
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected string or buffer
```
This type is `Dict`
<img width="1689" alt="Pasted Graphic 3"
src="https://github.com/user-attachments/assets/e5ba5c45-df1d-4697-98c9-14365c839f20">
The correct type should be ` Str`
<img width="1725" alt="Pasted Graphic 2"
src="https://github.com/user-attachments/assets/e54d5e60-4ce4-4180-b394-24e485013534">

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

This commit is contained in:

Kung Quang

2024-08-08 12:03:01 +08:00

committed by

GitHub

parent ad6def4178

commit 19ded65c66

No known key found for this signature in database

GPG Key ID: B5690EEEBB952194

1 changed files with 3 additions and 0 deletions

									
										3

rag/app/naive.py
									
											View File
											
				@ -205,6 +205,9 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,

				                "chunk_token_num", 128)), parser_config.get(

				                "delimiter", "\n!?。；！？"))

				        if kwargs.get("section_only", False):

				            return chunks

				        res.extend(tokenize_chunks_docx(chunks, doc, eng, images))

				        cron_logger.info("naive_merge({}): {}".format(filename, timer() - st))

				        return res

Fix a "TypeError: expected string or buffer bug" in docx files extracted using Knowledge Graph.#1859 (#1865)

3 rag/app/naive.py Unescape Escape View File

3

rag/app/naive.py

View File