diff --git a/docs/guides/configure_knowledge_base.md b/docs/guides/configure_knowledge_base.md index edfebb468..d1f627e99 100644 --- a/docs/guides/configure_knowledge_base.md +++ b/docs/guides/configure_knowledge_base.md @@ -107,8 +107,8 @@ RAGFlow features visibility and explainability, allowing you to view the chunkin ![update chunk](https://github.com/infiniflow/ragflow/assets/93570324/1d84b408-4e9f-46fd-9413-8c1059bf9c76) -:::caution NOTE -You can add keywords to a file chunk to increase its relevance. This action increases its keyword weight and can improve its position in search list. +:::caution NOTE +You can add keywords to a file chunk to increase its ranking for queries containing those keywords. This action increases its keyword weight and can improve its position in search list. ::: 4. In Retrieval testing, ask a quick question in **Test text** to double check if your configurations work: diff --git a/docs/quickstart.mdx b/docs/quickstart.mdx index 21cad0ef0..31fe2fcd5 100644 --- a/docs/quickstart.mdx +++ b/docs/quickstart.mdx @@ -307,7 +307,7 @@ RAGFlow features visibility and explainability, allowing you to view the chunkin ![update chunk](https://github.com/infiniflow/ragflow/assets/93570324/1d84b408-4e9f-46fd-9413-8c1059bf9c76) :::caution NOTE -You can add keywords to a file chunk to increase its relevance. This action increases its keyword weight and can improve its position in search list. +You can add keywords to a file chunk to improve its ranking for queries containing those keywords. This action increases its keyword weight and can improve its position in search list. ::: 4. In Retrieval testing, ask a quick question in **Test text** to double check if your configurations work: diff --git a/web/src/locales/en.ts b/web/src/locales/en.ts index adfd893a4..a47e37104 100644 --- a/web/src/locales/en.ts +++ b/web/src/locales/en.ts @@ -158,9 +158,9 @@ export default { html4excel: 'Excel to HTML', html4excelTip: `Excel will be parsed into HTML table or not. If it's FALSE, every row in Excel will be formed as a chunk.`, autoKeywords: 'Auto-keyword', - autoKeywordsTip: `Extract N keywords for each chunk to improve their ranking for queries containing those keywords. You can check or update the added keywords for a chunk from the chunk list. Be aware that extra tokens will be consumed by the LLM specified in 'System model settings'.`, + autoKeywordsTip: `Extract N keywords for each chunk to increase their ranking for queries containing those keywords. You can check or update the added keywords for a chunk from the chunk list. Be aware that extra tokens will be consumed by the LLM specified in 'System model settings'.`, autoQuestions: 'Auto-question', - autoQuestionsTip: `Extract N questions for each chunk to improve their ranking for queries containing those questions. You can check or update the added questions for a chunk from the chunk list. This feature will not disrupt the chunking process if an error occurs, except that it may add an empty result to the original chunk. Be aware that extra tokens will be consumed by the LLM specified in 'System model settings'.`, + autoQuestionsTip: `Extract N questions for each chunk to increase their ranking for queries containing those questions. You can check or update the added questions for a chunk from the chunk list. This feature will not disrupt the chunking process if an error occurs, except that it may add an empty result to the original chunk. Be aware that extra tokens will be consumed by the LLM specified in 'System model settings'.`, }, knowledgeConfiguration: { titleDescription: @@ -210,13 +210,13 @@ export default { We assume that the manual has a hierarchical section structure, using the lowest section titles as basic unit for chunking documents. Therefore, figures and tables in the same section will not be separated, which may result in larger chunk sizes.

`, naive: `

Supported file formats are DOCX, EXCEL, PPT, IMAGE, PDF, TXT, MD, JSON, EML, HTML.

-

This method chunks files using the 'naive' way:

+

This method chunks files using a 'naive' method:

  • Use vision detection model to split the texts into smaller segments.
  • Then, combine adjacent segments until the token count exceeds the threshold specified by 'Chunk token number', at which point a chunk is created.
  • `, paper: `

    Only PDF file is supported.

    Papers will be split by section, such as abstract, 1.1, 1.2.

    - This approach enables the LLM to summarize the paper more effectively and provide more comprehensive, understandable responses. + This approach enables the LLM to summarize the paper more effectively and to provide more comprehensive, understandable responses. However, it also increases the context for AI conversations and adds to the computational cost for the LLM. So during a conversation, consider reducing the value of ‘topN’.

    `, presentation: `

    Supported file formats are PDF, PPTX.

    Every page in the slides is treated as a chunk, with its thumbnail image stored.

    @@ -261,25 +261,23 @@ export default {

  • Every row in table will be treated as a chunk.
  • `, picture: ` -

    Image files are supported. Video is coming soon.

    - If the picture has text in it, OCR is applied to extract the text as its text description. +

    Image files are supported, with video support coming soon.

    + This method employs an OCR model to extract texts from images.

    - If the text extracted by OCR is not enough, visual LLM is used to get the descriptions. + If the text extracted by the OCR model is deemed insufficient, a specified visual LLM will be used to provide a description of the image.

    `, one: `

    Supported file formats are DOCX, EXCEL, PDF, TXT.

    - For a document, it will be treated as an entire chunk, no split at all. + This method treats each document in its entirety as a chunk.

    - If you want to summarize something that needs all the context of an article and the selected LLM's context length covers the document length, you can try this method. + Applicable when you require the LLM to summarize the entire document, provided it can handle that amount of context length.

    `, knowledgeGraph: `

    Supported file formats are DOCX, EXCEL, PPT, IMAGE, PDF, TXT, MD, JSON, EML -

    After files being chunked, it uses chunks to extract knowledge graph and mind map of the entire document. This method apply the naive ways to chunk files: -Successive text will be sliced into pieces each of which is around 512 token number.

    -

    Next, chunks will be transmited to LLM to extract nodes and relationships of a knowledge graph, and a mind map.

    - -Mind the entiry type you need to specify.

    `, +

    This approach chunks files using the 'naive'/'General' method. It splits a document into segements and then combines adjacent segments until the token count exceeds the threshold specified by 'Chunk token number', at which point a chunk is created.

    +

    The chunks are then fed to the LLM to extract nodes and relationships for a knowledge graph and a mind map.

    +

    Ensure that you set the Entity types.

    `, useRaptor: 'Use RAPTOR to enhance retrieval', useRaptorTip: 'Recursive Abstractive Processing for Tree-Organized Retrieval, see https://huggingface.co/papers/2401.18059 for more information',