Docs: From v0.13.0 onwards, markdown chunking is added to the General chunking method. (#7883)

### What problem does this PR solve?

### Type of change

- [x] Documentation Update
This commit is contained in:
writinwaters 2025-05-27 16:33:14 +08:00 committed by GitHub
parent 590070e47d
commit 13528ec328
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
13 changed files with 14 additions and 14 deletions

View File

@ -41,7 +41,7 @@ RAGFlow offers multiple chunking template to facilitate chunking files of differ
| **Template** | Description | File format |
|--------------|-----------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|
| General | Files are consecutively chunked based on a preset chunk token number. | DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML |
| General | Files are consecutively chunked based on a preset chunk token number. | MD, MDX, DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML |
| Q&A | | XLSX, XLS (Excel 97-2003), CSV/TXT |
| Resume | Enterprise edition only. You can also try it out on demo.ragflow.io. | DOCX, PDF, TXT |
| Manual | | PDF |

View File

@ -9,7 +9,7 @@ Convert complex Excel spreadsheets into HTML tables.
---
When using the General chunking method, you can enable the **Excel to HTML** toggle to convert spreadsheet files into HTML tables. If it is disabled, spreadsheet tables will be represented as key-value pairs. For complex tables that cannot be simply represented this way, you must enable this feature.
When using the **General** chunking method, you can enable the **Excel to HTML** toggle to convert spreadsheet files into HTML tables. If it is disabled, spreadsheet tables will be represented as key-value pairs. For complex tables that cannot be simply represented this way, you must enable this feature.
:::caution WARNING
The feature is disabled by default. If your knowledge base contains spreadsheets with complex tables and you do not enable this feature, RAGFlow will not throw an error but your tables are likely to be garbled.
@ -22,7 +22,7 @@ Works with complex tables that cannot be represented as key-value pairs. Example
## Considerations
- The Excel2HTML feature applies only to spreadsheet files (XLSX or XLS (Excel 97-2003)).
- This feature is associated with the General chunking method. In other words, it is available *only when* you select the General chunking method.
- This feature is associated with the **General** chunking method. In other words, it is available *only when* you select the **General** chunking method.
- When this feature is enabled, spreadsheet tables with more than 12 rows will be split into chunks of 12 rows each.
## Procedure

View File

@ -47,7 +47,7 @@ The RAPTOR feature is disabled by default. To enable it, manually switch on the
### Prompt
The following prompt will be applied recursively for cluster summarization, with `{cluster_content}` serving as an internal parameter. We recommend that you keep it as-is for now. The design will be updated in due course.
The following prompt will be applied *recursively* for cluster summarization, with `{cluster_content}` serving as an internal parameter. We recommend that you keep it as-is for now. The design will be updated in due course.
```
Please summarize the following paragraphs... Paragraphs as following:

View File

@ -5,7 +5,7 @@ slug: /use_tag_sets
# Use tag set
Use a tag set to tag chunks in your datasets.
Use a tag set to auto-tag chunks in your datasets.
---

View File

@ -287,7 +287,7 @@ To add and configure an LLM:
## Create your first knowledge base
You are allowed to upload files to a knowledge base in RAGFlow and parse them into datasets. A knowledge base is virtually a collection of datasets. Question answering in RAGFlow can be based on a particular knowledge base or multiple knowledge bases. File formats that RAGFlow supports include documents (PDF, DOC, DOCX, TXT, MD), tables (CSV, XLSX, XLS), pictures (JPEG, JPG, PNG, TIF, GIF), and slides (PPT, PPTX).
You are allowed to upload files to a knowledge base in RAGFlow and parse them into datasets. A knowledge base is virtually a collection of datasets. Question answering in RAGFlow can be based on a particular knowledge base or multiple knowledge bases. File formats that RAGFlow supports include documents (PDF, DOC, DOCX, TXT, MD, MDX), tables (CSV, XLSX, XLS), pictures (JPEG, JPG, PNG, TIF, GIF), and slides (PPT, PPTX).
To create your first knowledge base:

View File

@ -255,7 +255,7 @@ export default {
manual: `<p>Nur <b>PDF</b> wird unterstützt.</p><p>
Wir gehen davon aus, dass das Handbuch eine hierarchische Abschnittsstruktur aufweist und verwenden die Titel der untersten Abschnitte als Grundeinheit für die Aufteilung der Dokumente. Daher werden Abbildungen und Tabellen im selben Abschnitt nicht getrennt, was zu größeren Chunk-Größen führen kann.
</p>`,
naive: `<p>Unterstützte Dateiformate sind <b>DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML</b>.</p>
naive: `<p>Unterstützte Dateiformate sind <b>MD, MDX, DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML</b>.</p>
<p>Diese Methode teilt Dateien mit einer 'naiven' Methode auf: </p>
<p>
<li>Verwenden eines Erkennungsmodells, um die Texte in kleinere Segmente aufzuteilen.</li>

View File

@ -250,7 +250,7 @@ export default {
manual: `<p>Only <b>PDF</b> is supported.</p><p>
We assume that the manual has a hierarchical section structure, using the lowest section titles as basic unit for chunking documents. Therefore, figures and tables in the same section will not be separated, which may result in larger chunk sizes.
</p>`,
naive: `<p>Supported file formats are <b>DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML</b>.</p>
naive: `<p>Supported file formats are <b>MD, MDX, DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML</b>.</p>
<p>This method chunks files using a 'naive' method: </p>
<p>
<li>Use vision detection model to split the texts into smaller segments.</li>

View File

@ -211,7 +211,7 @@ export default {
Kami mengasumsikan manual memiliki struktur bagian hierarkis. Kami menggunakan judul bagian terendah sebagai poros untuk memotong dokumen.
Jadi, gambar dan tabel dalam bagian yang sama tidak akan dipisahkan, dan ukuran potongan mungkin besar.
</p>`,
naive: `<p>Format file yang didukung adalah <b>DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML</b>.</p>
naive: `<p>Format file yang didukung adalah <b>MD, MDX, DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML</b>.</p>
<p>Metode ini menerapkan cara naif untuk memotong file: </p>
<p>
<li>Teks berturut-turut akan dipotong menjadi potongan menggunakan model deteksi visual.</li>

View File

@ -215,7 +215,7 @@ export default {
manual: `<p>対応するのは<b>PDF</b>のみです。</p><p>
</p>`,
naive: `<p>対応ファイル形式は<b>DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML</b>です。</p>
naive: `<p>対応ファイル形式は<b>MD, MDX, DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML</b>です。</p>
<p>'ナイーブ'</p>
<p>
<li>使</li>

View File

@ -246,7 +246,7 @@ export default {
Os fragmentos terão granularidade compatível com 'ARTIGO', garantindo que todo o texto de nível superior seja incluído no fragmento.</p>`,
manual: `<p>Apenas <b>PDF</b> é suportado.</p><p>
Assumimos que o manual tem uma estrutura hierárquica de seções, usando os títulos das seções inferiores como unidade básica para fragmentação. Assim, figuras e tabelas na mesma seção não serão separadas, o que pode resultar em fragmentos maiores.</p>`,
naive: `<p>Os formatos de arquivo suportados são <b>DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML</b>.</p>
naive: `<p>Os formatos de arquivo suportados são <b>MD, MDX, DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML</b>.</p>
<p>Este método fragmenta arquivos de maneira 'simples':</p>
<p>
<li>Usa um modelo de detecção visual para dividir os textos em segmentos menores.</li>

View File

@ -231,7 +231,7 @@ export default {
<p>
<li>Sử dụng hình nhận dạng thị giác đ chia các văn bản thành các phân đoạn nhỏ hơn.</li>
<li>Sau đó, kết hợp các phân đoạn liền kề cho đến khi số lượng token vượt quá ngưỡng đưc chỉ đnh bởi 'Số token khối', tại thời điểm đó, một khối đưc tạo.</li></p>
<p>Các đnh dạng tệp đưc hỗ trợ <b>DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML</b>.</p>`,
<p>Các đnh dạng tệp đưc hỗ trợ <b>MD, MDX, DOCX, XLSX, XLS (Excel 97-2003), PPT, PDF, TXT, JPEG, JPG, PNG, TIF, GIF, CSV, JSON, EML, HTML</b>.</p>`,
paper: `<p>Chỉ hỗ trợ tệp <b>PDF</b>.</p><p>
Bài báo sẽ đưc chia theo các phần, chẳng hạn như <i>tóm tắt, 1.1, 1.2</i>. </p><p>
Cách tiếp cận này cho phép LLM tóm tắt bài báo hiệu quả hơn cung cấp các phản hồi toàn diện, dễ hiểu hơn.

View File

@ -246,7 +246,7 @@ export default {
使
</p>`,
naive: `<p>支持的文件格式為<b>DOCX、XLSX、XLS (Excel 97-2003)、PPT、PDF、TXT、JPEG、JPG、PNG、TIF、GIF、CSV、JSON、EML、HTML</b>。</p>
naive: `<p>支持的文件格式為<b>MD、MDX、DOCX、XLSX、XLS (Excel 97-2003)、PPT、PDF、TXT、JPEG、JPG、PNG、TIF、GIF、CSV、JSON、EML、HTML</b>。</p>
<p></p>
<p>
<li>使</li>

View File

@ -247,7 +247,7 @@ export default {
使
</p>`,
naive: `<p>支持的文件格式为<b>DOCX、XLSX、XLS (Excel 97-2003)、PPT、PDF、TXT、JPEG、JPG、PNG、TIF、GIF、CSV、JSON、EML、HTML</b>。</p>
naive: `<p>支持的文件格式为<b>MD、MDX、DOCX、XLSX、XLS (Excel 97-2003)、PPT、PDF、TXT、JPEG、JPG、PNG、TIF、GIF、CSV、JSON、EML、HTML</b>。</p>
<p></p>
<p>
<li>使</li>