0321 chunkmethods (#6520)

### What problem does this PR solve? #6061 ### Type of change - [x] Documentation Update
2025-08-10 19:08:58 +08:00 · 2025-03-26 09:03:18 +08:00 · 2025-03-26 09:03:18 +08:00 · d17970ebd0
commit d17970ebd0
parent bf483fdf02
14 changed files with 73 additions and 61 deletions
--- a/docker/.env
+++ b/docker/.env
@ -125,10 +125,12 @@ TIMEZONE='Asia/Shanghai'
 # Uncomment the following line if your operating system is MacOS:
 # MACOS=1

-# The maximum file size for each uploaded file, in bytes.
-# To change the 1GB file size limit, uncomment the following line and make your changes accordingly.
+# The maximum file size limit (in bytes) for each upload to your knowledge base or File Management.
+# To change the 1GB file size limit, uncomment the line below and update as needed.
 # MAX_CONTENT_LENGTH=1073741824
-# After the change, ensure you update `client_max_body_size` in nginx/nginx.conf correspondingly.
+# After updating, ensure `client_max_body_size` in nginx/nginx.conf is updated accordingly.
+# Note that neither `MAX_CONTENT_LENGTH` nor `client_max_body_size` sets the maximum size for files uploaded to an agent.
+# See https://ragflow.io/docs/dev/begin_component for details.

 # The log level for the RAGFlow's owned packages and imported packages.
 # Available level:
--- a/docs/guides/agent/agent_component_reference/begin.mdx
+++ b/docs/guides/agent/agent_component_reference/begin.mdx
@ -50,6 +50,10 @@ If your agent's **Begin** component takes a variable, you *cannot* embed it into
  - **boolean**: Requires the user to toggle between on and off.
 - **Optional**: A toggle indicating whether the variable is optional. 

+:::danger IMPORTAN
+If you set the key type as **file**, ensure the token count of the uploaded file does not exceed your model provider's maximum token limit; otherwise, the plain text in your file will be truncated and incomplete.
+:::
+
 ## Examples

 As mentioned earlier, the **Begin** component is indispensable for an agent. Still, you can take a look at our three-step interpreter agent template, where the **Begin** component takes two global variables:
@ -64,7 +68,7 @@ As mentioned earlier, the **Begin** component is indispensable for an agent. Sti

 ### Is the uploaded file in a knowledge base?

-No. Files uploaded to an agent as input are not stored in a knowledge base and will not be chunked using RAGFlow's built-in chunk methods. However, RAGFlow's built-in OSR, DLR, and TSR models will still be applied to process the document.
+No. Files uploaded to an agent as input are not stored in a knowledge base and hence will not be processed using RAGFlow's built-in OCR, DLR or TSR models, or chunked using RAGFlow's built-in chunk methods. 

 ### How to upload a webpage or file from a URL?

@ -74,5 +78,8 @@ If you set the type of a variable as **file**, your users will be able to upload

 ### File size limit for an uploaded file

-The maximum file size for each uploaded file is determined by the variable `MAX_CONTENT_LENGTH` in `/docker/.env`. It defaults to 128 MB. If you change the default file size, ensure you also update the value of `client_max_body_size` in `/docker/nginx/nginx.conf` accordingly.
+There is no *specific* file size limit for a file uploaded to an agent. However, note that model providers typically have a default or explicit maximum token setting, which can range from 8196 to 128k: The plain text part of the uploaded file will be passed in as the key value, but if the file's token count exceeds this limit, the string will be truncated and incomplete.

+:::tip NOTE
+The variables `MAX_CONTENT_LENGTH` in `/docker/.env` and `client_max_body_size` in `/docker/nginx/nginx.conf` set the file size limit for each upload to a knowledge base or **File Management**. These settings DO NOT apply in this scenario.
+:::
--- a/docs/guides/agent/agent_component_reference/generate.mdx
+++ b/docs/guides/agent/agent_component_reference/generate.mdx
@ -56,7 +56,11 @@ Click the dropdown menu of **Model** to show the model configuration window.

 Typically, you use the system prompt to describe the task for the LLM, specify how it should respond, and outline other miscellaneous requirements. We do not plan to elaborate on this topic, as it can be as extensive as prompt engineering. However, please be aware that the system prompt is often used in conjunction with keys (variables), which serve as various data inputs for the LLM. 

-Keys in a system prompt should be enclosed in curly braces. Below is a prompt excerpt of a **Generate** component from the **Interpreter** template (component ID: **Reflect**):
+:::danger IMPORTANT
+A **Generate** component relies on keys (variables) to specify its data inputs. Its immediate upstream component is *not* necessarily its data input, and the arrows in the workflow indicate *only* the processing sequence. Keys in a **Generate** component are used in conjunction with the system prompt to specify data inputs for the LLM. Use a forward slash `/` or the **(x)** button to show the keys to use.
+:::
+
+Below is a prompt excerpt of a **Generate** component from the **Interpreter** template (component ID: **Reflect**):

 ```text
 Your task is to read a source text and a translation to {target_lang}, and give constructive suggestions to improve the translation. The source text and initial translation, delimited by XML tags <SOURCE_TEXT></SOURCE_TEXT> and <TRANSLATION></TRANSLATION>, are as follows:
@ -76,11 +80,6 @@ When writing suggestions, pay attention to whether there are ways to improve the

 Where `{source_text}` and `{target_lang}` are global variables defined by the **Begin** component, while `{translation_1}` is the output of another **Generate** component with the component ID **Translate directly**.

-
-:::danger IMPORTANT
-A **Generate** component relies on keys (variables) to specify its data inputs. Its immediate upstream component is *not* necessarily its data input, and the arrows in the workflow indicate *only* the processing sequence. Keys in a **Generate** component are used in conjunction with the system prompt to specify data inputs for the LLM. Use a forward slash `/` to show the keys to use.
-:::
-
 ### Cite 

 This toggle sets whether to cite the original text as reference. 
--- a/docs/guides/dataset/construct_knowledge_graph.md
+++ b/docs/guides/dataset/construct_knowledge_graph.md
@ -68,6 +68,10 @@ In a knowledge graph, a community is a cluster of entities linked by relationshi
   _A **Knowledge graph** entry appears under **Configuration** once a knowledge graph is created._

 3. Click **Knowledge graph** to view the details of the generated graph.
+4. To use the created knowledge graph, do either of the following:
+   
+   - In your **Chat Configuration** dialogue, click the **Assistant Setting** tab to add the corresponding knowledge base(s) and click the **Prompt Engine** tab to switch on the **Use knowledge graph** toggle.
+   - If you are using an agent, click the **Retrieval** agent component to specify the knowledge base(s) and switch on the **Use knowledge graph** toggle.

 ## Frequently asked questions

--- a/docs/references/http_api_reference.md
+++ b/docs/references/http_api_reference.md
@ -9,6 +9,22 @@ A complete reference for RAGFlow's RESTful API. Before proceeding, please ensure

 ---

+## ERROR CODES
+
+---
+
+| Code | Message               | Description                |
+|------|-----------------------|----------------------------|
+| 400  | Bad Request           | Invalid request parameters |
+| 401  | Unauthorized          | Unauthorized access        |
+| 403  | Forbidden             | Access denied              |
+| 404  | Not Found             | Resource not found         |
+| 500  | Internal Server Error | Server internal error      |
+| 1001 | Invalid Chunk ID      | Invalid Chunk ID           |
+| 1002 | Chunk Update Failed   | Chunk update failed        |
+
+---
+
 ## OpenAI-Compatible API

 ---
@ -531,24 +547,6 @@ Failure:

 ---

-## Error Codes
-
---
-
-| Code | Message               | Description                |
-| ---- | --------------------- | -------------------------- |
-| 400  | Bad Request           | Invalid request parameters |
-| 401  | Unauthorized          | Unauthorized access        |
-| 403  | Forbidden             | Access denied              |
-| 404  | Not Found             | Resource not found         |
-| 500  | Internal Server Error | Server internal error      |
-| 1001 | Invalid Chunk ID      | Invalid Chunk ID           |
-| 1002 | Chunk Update Failed   | Chunk update failed        |
-
---
-
---
-
 ## FILE MANAGEMENT WITHIN DATASET

 ---
@ -1771,7 +1769,7 @@ Lists chat assistants.
 #### Request

 - Method: GET
- URL: `/api/v1/chats?page={page}&page_size={page_size}&orderby={orderby}&desc={desc}&name={dataset_name}&id={dataset_id}`
+- URL: `/api/v1/chats?page={page}&page_size={page_size}&orderby={orderby}&desc={desc}&name={chat_name}&id={chat_id}`
 - Headers:
  - `'Authorization: Bearer <YOUR_API_KEY>'`

@ -1779,7 +1777,7 @@ Lists chat assistants.

 ```bash
 curl --request GET \
-     --url http://{address}/api/v1/chats?page={page}&page_size={page_size}&orderby={orderby}&desc={desc}&name={dataset_name}&id={dataset_id} \
+     --url http://{address}/api/v1/chats?page={page}&page_size={page_size}&orderby={orderby}&desc={desc}&name={chat_name}&id={chat_id} \
     --header 'Authorization: Bearer <YOUR_API_KEY>'
 ```

--- a/docs/references/python_api_reference.md
+++ b/docs/references/python_api_reference.md
@ -18,6 +18,22 @@ pip install ragflow-sdk

 ---

+## ERROR CODES
+
+---
+
+| Code | Message              | Description                 |
+|------|----------------------|-----------------------------|
+| 400  | Bad Request          | Invalid request parameters  |
+| 401  | Unauthorized         | Unauthorized access         |
+| 403  | Forbidden            | Access denied               |
+| 404  | Not Found            | Resource not found          |
+| 500  | Internal Server Error| Server internal error       |
+| 1001 | Invalid Chunk ID     | Invalid Chunk ID            |
+| 1002 | Chunk Update Failed  | Chunk update failed         |
+
+---
+
 ## OpenAI-Compatible API

 ---
@ -317,23 +333,6 @@ dataset = rag_object.list_datasets(name="kb_name")
 dataset.update({"embedding_model":"BAAI/bge-zh-v1.5", "chunk_method":"manual"})
 ```

---
-
-## Error Codes
-
---
-
-| Code | Message | Description |
-|------|---------|-------------|
-| 400  | Bad Request | Invalid request parameters |
-| 401  | Unauthorized | Unauthorized access |
-| 403  | Forbidden | Access denied |
-| 404  | Not Found | Resource not found |
-| 500  | Internal Server Error | Server internal error |
-| 1001 | Invalid Chunk ID | Invalid Chunk ID |
-| 1002 | Chunk Update Failed | Chunk update failed |
-
-
 ---

 ## FILE MANAGEMENT WITHIN DATASET
--- a/web/src/locales/de.ts
+++ b/web/src/locales/de.ts
@ -334,7 +334,7 @@ export default {
      useRaptorTip:
        'Rekursive Abstrakte Verarbeitung für Baumorganisierten Abruf, weitere Informationen unter https://huggingface.co/papers/2401.18059.',
      prompt: 'Prompt',
-      promptTip: 'LLM-Prompt für die Zusammenfassung.',
+      promptTip: 'Verwenden Sie den Systemprompt, um die Aufgabe für das LLM zu beschreiben, festzulegen, wie es antworten soll, und andere verschiedene Anforderungen zu skizzieren. Der Systemprompt wird oft in Verbindung mit Schlüsseln (Variablen) verwendet, die als verschiedene Dateninputs für das LLM dienen. Verwenden Sie einen Schrägstrich `/` oder die (x)-Schaltfläche, um die zu verwendenden Schlüssel anzuzeigen.',
      promptMessage: 'Prompt ist erforderlich',
      promptText: `Bitte fassen Sie die folgenden Absätze zusammen. Seien Sie vorsichtig mit den Zahlen, erfinden Sie keine Dinge. Absätze wie folgt:
        {cluster_content}
@ -372,6 +372,7 @@ export default {
    <li>Sie müssen Tag-Sets in bestimmten Formaten hochladen, bevor Sie die Auto-Tag-Funktion ausführen.</li>
    <li>Die Auto-Schlüsselwort-Funktion ist vom LLM abhängig und verbraucht eine erhebliche Anzahl an Tokens.</li>
  </ul>
+  <p>Siehe https://ragflow.io/docs/dev/use_tag_sets für Details.</p>
        `,
      topnTags: 'Top-N Tags',
      tags: 'Tags',
--- a/web/src/locales/en.ts
+++ b/web/src/locales/en.ts
@ -326,7 +326,7 @@ export default {
      useRaptorTip:
        'Recursive Abstractive Processing for Tree-Organized Retrieval, see https://huggingface.co/papers/2401.18059 for more information.',
      prompt: 'Prompt',
-      promptTip: 'LLM prompt used for summarization.',
+      promptTip: 'Use the system prompt to describe the task for the LLM, specify how it should respond, and outline other miscellaneous requirements. The system prompt is often used in conjunction with keys (variables), which serve as various data inputs for the LLM. Use a forward slash `/` or the (x) button to show the keys to use.',
      promptMessage: 'Prompt is required',
      promptText: `Please summarize the following paragraphs. Be careful with the numbers, do not make things up. Paragraphs as following:
      {cluster_content}
@ -353,9 +353,9 @@ The above is the content you need to summarize.`,
      tagTable: 'Table',
      tagSet: 'Tag sets',
      tagSetTip: `
-     <p> Select one or multiple tag knowledge bases to auto-tag chunks in your knowledge base. </p>
+     <p> Select one or multiple tag knowledge bases to auto-tag chunks in your knowledge base. See https://ragflow.io/docs/dev/use_tag_sets for details.</p>
 <p>The user query will also be auto-tagged.</p>
-This auto-tag feature enhances retrieval by adding another layer of domain-specific knowledge to the existing dataset.
+This auto-tagging feature enhances retrieval by adding another layer of domain-specific knowledge to the existing dataset.
 <p>Difference between auto-tag and auto-keyword:</p>
 <ul>
  <li>A tag knowledge base is a user-defined close set, whereas keywords extracted by the LLM can be regarded as an open set.</li>
--- a/web/src/locales/id.ts
+++ b/web/src/locales/id.ts
@ -291,7 +291,7 @@ export default {
      useRaptorTip:
        'Pemrosesan Abstraktif Rekursif untuk Pengambilan Terorganisasi Pohon, silakan merujuk ke https://huggingface.co/papers/2401.18059',
      prompt: 'Prompt',
-      promptTip: 'Prompt LLM yang digunakan untuk merangkas.',
+      promptTip: 'Gunakan prompt sistem untuk menjelaskan tugas untuk LLM, tentukan bagaimana harus merespons, dan menguraikan persyaratan lainnya. Prompt sistem sering digunakan bersama dengan kunci (variabel), yang berfungsi sebagai berbagai input data untuk LLM. Gunakan garis miring `/` atau tombol (x) untuk menampilkan kunci yang digunakan.',
      promptMessage: 'Prompt diperlukan',
      promptText: `Silakan rangkum paragraf berikut. Berhati-hatilah dengan angka, jangan membuat hal-hal yang tidak ada. Paragraf sebagai berikut:
          {cluster_content}
--- a/web/src/locales/ja.ts
+++ b/web/src/locales/ja.ts
@ -285,7 +285,7 @@ export default {
      useRaptorTip:
        'ツリー構造化検索のための再帰的抽象処理（RAPTOR）については、詳細はhttps://huggingface.co/papers/2401.18059をご覧ください',
      prompt: 'プロンプト',
-      promptTip: '要約に使用されるLLMプロンプト。',
+      promptTip: 'LLMのタスクを説明し、どのように応答すべきかを指定し、他のさまざまな要件を概説するためにシステムプロンプトを使用します。システムプロンプトは、LLMのさまざまなデータ入力として機能するキー（変数）と共に使用されることがよくあります。使用するキーを表示するには、スラッシュ `/` または (x) ボタンを使用します。',
      promptMessage: 'プロンプトは必須です',
      promptText: `以下の段落を要約してください。数字には注意し、事実を捏造しないでください。段落は以下の通りです：
      {cluster_content}
--- a/web/src/locales/pt-br.ts
+++ b/web/src/locales/pt-br.ts
@ -262,7 +262,7 @@ export default {
      useRaptorTip:
        'Processamento Abstrativo Recursivo para Recuperação Organizada em Árvore. Veja mais em https://huggingface.co/papers/2401.18059.',
      prompt: 'Prompt',
-      promptTip: 'Prompt usado pelo LLM para sumarização.',
+      promptTip: 'Use o prompt do sistema para descrever a tarefa para o LLM, especificar como ele deve responder e esboçar outros requisitos diversos. O prompt do sistema é frequentemente usado em conjunto com chaves (variáveis), que servem como várias entradas de dados para o LLM. Use uma barra `/` ou o botão (x) para mostrar as chaves a serem usadas.',
      promptMessage: 'O prompt é obrigatório',
      promptText: `Por favor, resuma os seguintes parágrafos. Tenha cuidado com os números, não invente informações. Os parágrafos são os seguintes:
      {cluster_content}
@ -297,7 +297,8 @@ export default {
        <li>As etiquetas são um conjunto fechado definido pelo usuário, enquanto palavras-chave são um conjunto aberto.</li>
        <li>É necessário enviar conjuntos de etiquetas com exemplos antes de usá-los.</li>
        <li>Palavras-chave são geradas pelo LLM, o que é caro e demorado.</li>
-      </ul>`,
+      </ul>
+      <p>Consulte https://ragflow.io/docs/dev/use_tag_sets para obter detalhes.</p>`,
      topnTags: 'Top-N Etiquetas',
      tags: 'Etiquetas',
      addTag: 'Adicionar etiqueta',
--- a/web/src/locales/vi.ts
+++ b/web/src/locales/vi.ts
@ -296,7 +296,7 @@ export default {
      useRaptorTip:
        'Recursive Abstractive Processing for Tree-Organized Retrieval, xem https://huggingface.co/papers/2401.18059 để biết thêm thông tin',
      prompt: 'Nhắc nhở',
-      promptTip: 'Nhắc nhở LLM được sử dụng để tóm tắt.',
+      promptTip: 'Sử dụng lời nhắc hệ thống để mô tả nhiệm vụ cho LLM, chỉ định cách nó nên phản hồi và phác thảo các yêu cầu khác nhau. Lời nhắc hệ thống thường được sử dụng kết hợp với các khóa (biến), đóng vai trò là các đầu vào dữ liệu khác nhau cho LLM. Sử dụng dấu gạch chéo `/` hoặc nút (x) để hiển thị các khóa cần sử dụng.',
      promptMessage: 'Nhắc nhở là bắt buộc',
      promptText: `Vui lòng tóm tắt các đoạn văn sau. Cẩn thận với các số, đừng bịa ra. Các đoạn văn như sau:
      {cluster_content}
@ -329,7 +329,7 @@ export default {
      searchTags: 'Thẻ tìm kiếm',
      tagTable: 'Bảng',
      tagSet: 'Thư viện',
-      tagSetTip: `<p>Việc chọn các cơ sở kiến thức 'Tag' giúp gắn thẻ cho từng đoạn.</p> <p>Truy vấn đến các đoạn đó cũng sẽ kèm theo thẻ.</p> Quy trình này sẽ cải thiện độ chính xác của việc truy xuất bằng cách thêm nhiều thông tin hơn vào bộ dữ liệu, đặc biệt là khi có một tập hợp lớn các đoạn. <p>Sự khác biệt giữa thẻ và từ khóa:</p> <ul> <li>Thẻ là một tập hợp khép kín được người dùng định nghĩa và thao tác trong khi từ khóa là một tập hợp mở.</li> <li>Bạn cần tải lên các tập hợp thẻ với các mẫu trước khi sử dụng.</li> <li>Từ khóa được tạo bởi LLM, tốn kém và mất thời gian.</li> </ul>`,
+      tagSetTip: `<p>Việc chọn các cơ sở kiến thức 'Tag' giúp gắn thẻ cho từng đoạn.</p> <p>Truy vấn đến các đoạn đó cũng sẽ kèm theo thẻ.</p> Quy trình này sẽ cải thiện độ chính xác của việc truy xuất bằng cách thêm nhiều thông tin hơn vào bộ dữ liệu, đặc biệt là khi có một tập hợp lớn các đoạn. <p>Sự khác biệt giữa thẻ và từ khóa:</p> <ul> <li>Thẻ là một tập hợp khép kín được người dùng định nghĩa và thao tác trong khi từ khóa là một tập hợp mở.</li> <li>Bạn cần tải lên các tập hợp thẻ với các mẫu trước khi sử dụng.</li> <li>Từ khóa được tạo bởi LLM, tốn kém và mất thời gian.</li> </ul><p>Xem https://ragflow.io/docs/dev/use_tag_sets để biết thêm chi tiết.</p>`,
      topnTags: 'Thẻ Top-N',
      tags: 'Thẻ',
      addTag: 'Thêm thẻ',
--- a/web/src/locales/zh-traditional.ts
+++ b/web/src/locales/zh-traditional.ts
@ -327,7 +327,7 @@ export default {
      maxClusterMessage: '最大聚類數是必填項',
      randomSeed: '隨機種子',
      randomSeedMessage: '隨機種子是必填項',
-      promptTip: 'LLM提示用於總結。',
+      promptTip: '系統提示為大型模型提供任務描述、規定回覆方式，以及設定其他各種要求。系統提示通常與 key（變數）合用，透過變數設定大型模型的輸入資料。你可以透過斜線或 (x) 按鈕顯示可用的 key。',
      maxTokenTip: '用於匯總的最大token數。',
      thresholdTip: '閾值越大，聚類越少。',
      maxClusterTip: '最大聚類數。',
@ -352,6 +352,7 @@ export default {
        <li>在給你的知識庫文本塊批量打標籤之前，你需要先生成標籤集作為樣本。</li>
        <li>自動關鍵詞功能中的關鍵詞由 LLM 生成，此過程相對耗時，並且會產生一定的 Token 消耗。</li>
      </ul>
+      <p>詳情請參閱 https://ragflow.io/docs/dev/use_tag_sets。</p>
 `,
      tags: '標籤',
      addTag: '增加標籤',
--- a/web/src/locales/zh.ts
+++ b/web/src/locales/zh.ts
@ -344,7 +344,7 @@ export default {
      maxClusterMessage: '最大聚类数是必填项',
      randomSeed: '随机种子',
      randomSeedMessage: '随机种子是必填项',
-      promptTip: 'LLM提示用于总结。',
+      promptTip: '系统提示为大模型提供任务描述、规定回复方式，以及设置其他各种要求。系统提示通常与 key （变量）合用，通过变量设置大模型的输入数据。你可以通过斜杠或者 (x) 按钮显示可用的 key。',
      maxTokenTip: '用于汇总的最大token数。',
      thresholdTip: '阈值越大，聚类越少。',
      maxClusterTip: '最大聚类数。',
@ -360,7 +360,7 @@ export default {
      tagSet: '标签集',
      topnTags: 'Top-N 标签',
      tagSetTip: `
-      <p> 请选择一个或多个标签集或标签知识库，用于对知识库中的每个文本块进行标记。 </p>
+      <p> 请选择一个或多个标签集或标签知识库，用于对知识库中的每个文本块进行标记。</p>
      <p>对这些文本块的查询也将自动关联相应标签。 </p>
      <p>此功能基于文本相似度，能够为数据集的文本块批量添加更多领域知识，从而显著提高检索准确性。该功能还能提升大量文本块的操作效率。</p>
      <p>为了更好地理解标签集的作用，以下是标签集和关键词之间的主要区别：</p>