update readme

2025-08-18 06:35:58 +08:00 · 2025-01-15 23:07:36 +08:00 · 2025-01-15 23:07:36 +08:00 · 0668ed399a
commit 0668ed399a
parent 6b5482cdfa
1 changed files with 52 additions and 62 deletions
--- a/README.md
+++ b/README.md
@ -128,7 +128,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
            <td>3.4</td>
        </tr>
        <tr>
-            <td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
+            <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
            <td>-</td>
            <td>-</td>
            <td>64.4</td>
@ -299,7 +299,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
            <td>3.5</td>
        </tr>
        <tr>
-            <td nowrap="nowrap" align="left">InternVL-2.5-8B</td>
+            <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
            <td>8B</td>
            <td>706</td>
            <td>68.3</td>
@ -378,8 +378,8 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
-            <th>BLINK-val</th>
-            <th>Mantis-Eval</th>
+            <th>BLINK val</th>
+            <th>Mantis Eval</th>
            <th>MIRB</th>
            <th>Video-MME (wo / w subs)</th>
        </tr>
@ -391,7 +391,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
        <tr>
            <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
            <td>-</td>
-            <td><strong>68</strong></td>
+            <td><strong>68.0</strong></td>
            <td>-</td>
            <td>-</td>
            <td><strong>71.9/77.2<strong></td>
@ -416,7 +416,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
            <td>-</td>
        </tr>
        <tr>
-            <td nowrap="nowrap" align="left">LLaVA-One-Vision-72B</td>
+            <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
            <td>72B</td>
            <td>55.4</td>
            <td><strong>77.6</strong></td>
@ -440,7 +440,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
            <td>63.3/69.0</td>
        </tr>
        <tr>
-            <td nowrap="nowrap" align="left">InternVL-2.5-8B</td>
+            <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
            <td>8B</td>
            <td>54.8</td>
            <td>67.7</td>
@ -450,7 +450,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
            <td>8B</td>
-            <td>53</td>
+            <td>53.0</td>
            <td>69.1</td>
            <td>53.8</td>
            <td>60.9/63.6</td>
@ -485,7 +485,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
            <th>Size</th>
            <th colspan="3">ASR (zh)</th>
            <th colspan="3">ASR (en)</th>
-            <th colspan="2">ASR</th>
+            <th colspan="2">AST</th>
            <th>Emotion</th>
        </tr>
        <tr>
@ -528,7 +528,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
            <td>33.2*</td>
        </tr>
        <tr>
-            <td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
+            <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
            <td>-</td>
            <td>4.5*</td>
            <td>5.9*</td>
@ -544,7 +544,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
            <td colspan="11" align="left"><strong>Open-Source</strong></td>
        </tr>
        <tr>
-            <td nowrap="nowrap" align="left">Qwen2-Audio</td>
+            <td nowrap="nowrap" align="left">Qwen2-Audio-7B</td>
            <td>8B</td>
            <td>-</td>
            <td>7.5</td>
@ -557,7 +557,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
            <td><strong>55.3</strong></td>
        </tr>
        <tr>
-            <td nowrap="nowrap" align="left">Qwen2-Audio-Instruction</td>
+            <td nowrap="nowrap" align="left">Qwen2-Audio-7B-Instruct</td>
            <td>8B</td>
            <td>2.6*</td>
            <td>6.9*</td>
@ -581,7 +581,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
            <td>-</td>
            <td>-</td>
        </tr>
-        <tr style="background-color: #e6f2ff;">
+        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
            <td>8B</td>
            <td><strong>1.6</strong></td>
@ -702,7 +702,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
            <td>3.4</td>
            <td>10.0</td>
        </tr>
-        <tr style="background-color: #e6f2ff;">
+        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
            <td>8B</td>
            <td><u>61.0</u></td>
@ -727,7 +727,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
    <thead>
        <tr>
            <th align="left">Task</th>
-            <th colspan="2">TTS</th>
+            <th colspan="2">Voice cloning</th>
        </tr>
        <tr>
            <th align="left">Metric</th>
@ -756,7 +756,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
            <td>63</td>
            <td>46</td>
        </tr>
-        <tr style="background-color: #e6f2ff;">
+        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
            <td>57</td>
            <td>47</td>
@ -796,7 +796,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
            <td><strong>70.3</strong></td>
        </tr>
        <tr>
-            <td nowrap="nowrap" align="left">GPT-4o</td>
+            <td nowrap="nowrap" align="left">GPT-4o-202408</td>
            <td>-</td>
            <td>74.5</td>
            <td>51.0</td>
@ -886,7 +886,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git
            <td>33.4</td>
            <td>57.7</td>
        </tr>
-        <tr style="background-color: #e6f2ff;">
+        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
            <td>8B</td>
            <td><strong>79.9</strong></td>
@ -902,7 +902,7 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git

 ### 典型示例 <!-- omit in toc -->

-以下示例为 MiniCPM-o 2.6 部署在 iPad Pro 上所录制得到。
+以下示例为 MiniCPM-o 2.6 部署在 iPad Pro 和web demo录制得到。


 <div style="display: flex; flex-direction: column; align-items: center;">
@ -961,7 +961,7 @@ model.tts.float()
 ### Omni mode
 we provide two inference modes: chat and streaming

-#### chat inference
+#### Chat inference
 ```python
 import math
 import numpy as np
@ -995,11 +995,12 @@ def get_video_chunk_content(video_path, flatten=True):
    return contents

 video_path="/path/to/video"
-sys_msg = model.get_sys_prompt(mode='omni', language='en')
 # if use voice clone prompt, please set ref_audio
-# ref_audio_path = '/path/to/ref_audio'
-# ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
-# sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
+ref_audio_path = 'assets/demo.wav'
+ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
+sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
+# or use default prompt
+# sys_msg = model.get_sys_prompt(mode='omni', language='en')

 contents = get_video_chunk_content(video_path)
 msg = {"role":"user", "content": contents}
@ -1025,7 +1026,7 @@ res = model.chat(
 )
 print(res)
 ```
-#### streaming inference
+#### Streaming inference
 ```python
 # a new conversation need reset session first, it will reset the kv-cache
 model.reset_session()
@ -1083,6 +1084,7 @@ else:

 ### Audio-Only mode
 #### Mimick
+`Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
 ```python
 mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
 audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
@ -1104,27 +1106,25 @@ res = model.chat(
 <details> <summary>Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.</summary>

 ```python
-ref_audio, _ = librosa.load('./assert/voice_01.wav', sr=16000, mono=True) # load the reference audio
+ref_audio, _ = librosa.load('assets/demo.wav', sr=16000, mono=True) # load the reference audio

-# Audio RolePlay:  # With this mode, model will role-play the character based on the audio prompt.
-sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
-user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
+# Choose the mode you want to use
+# Audio RolePlay:  # With this mode, model will role-play the character based on the audio prompt. (More human-like conversation but unstable)
+# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
+# user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}

-# Audio Assistant: # With this mode, model will speak with the voice in ref_audio as a AI assistant.
-# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en') 
-# user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # Try to ask something!
+Audio Assistant: # With this mode, model will speak with the voice in ref_audio as a AI assistant. (Stable and more suitable for general conversation)
+sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en') 
+user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # Try to ask something by recording it in 'xxx.wav'!!!
 ```
 ```python
 msgs = [sys_prompt, user_question]
+# round one
 res = model.chat(
-    image=None,
    msgs=msgs,
-    context=None,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
-    stream=False,
-    stream_input=True,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
@ -1136,14 +1136,10 @@ history = msgs.append({'role': 'assistant', 'content': res})
 user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
 msgs = history.append(user_question)
 res = model.chat(
-    image=None,
    msgs=msgs,
-    context=None,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
-    stream=False,
-    stream_input=True,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
@ -1169,20 +1165,16 @@ General Audio:
    Audio Caption: Summarize the main content of the audio.
    Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene.
 '''
-task_prompt = "\n"
+task_prompt = "" # Choose the task prompt above
 audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)

 msgs = [{'role': 'user', 'content': [task_prompt,audio_input]}]

 res = model.chat(
-    image=None,
    msgs=msgs,
-    context=None,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
-    stream=False,
-    stream_input=True,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
@ -1198,28 +1190,24 @@ Speech Generation Task Prompt:
        # 在新闻中，一个年轻男性兴致勃勃地说：“祝福亲爱的祖国母亲美丽富强！”他用低音调和低音量，慢慢地说出了这句话。
        # Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context. 

-    Voice Cloning or Voice Creation: With this mode, model will act like a TTS model. 
+    Voice Cloning or Voice Conversion: With this mode, model will act like a TTS model. 
 '''
 # Human Instruction-to-Speech:
-task_prompt = '' #Try to make some Human Instruction-to-Speech prompt
-msgs = [{'role': 'user', 'content': [task_prompt]}] # you can try to use the same audio question
+task_prompt = '' #Try to make some Human Instruction-to-Speech prompt (Voice Creation)
+msgs = [{'role': 'user', 'content': [task_prompt]}] # you can also try to ask the same audio question

-# Voice Cloning mode: With this mode, model will act like a TTS model. 
+# Voice Cloning mode: 
 # sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
 # text_prompt = f"Please read the text below."
 # user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]} # using same voice in sys_prompt to read the text. (Voice Cloning)
-# user_question = {'role': 'user', 'content': [text_prompt, librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # using same voice in sys_prompt to read 'xxx.wav'. (Voice Creation)
+# user_question = {'role': 'user', 'content': [text_prompt, librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # using same voice in sys_prompt to read 'xxx.wav'. (Voice Conversion)
+# msgs = [sys_prompt, user_question]

-msgs = [sys_prompt, user_question]
 res = model.chat(
-    image=None,
    msgs=msgs,
-    context=None,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
-    stream=False,
-    stream_input=True,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
@ -1235,7 +1223,7 @@ res = model.chat(

 `MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`

-#### chat with single image
+#### Chat with single image
 ```python
 # test.py
 image = Image.open('xx.jpg').convert('RGB')
@ -1344,30 +1332,32 @@ print(answer)

 Please look at [GitHub](https://github.com/OpenBMB/MiniCPM-o) for more detail about usage.

-## llama.cpp推理 <a id="llamacpp"></a>
-敬请期待
+## llama.cpp <a id="llamacpp"></a>
+MiniCPM-o 2.6 (vision-only mode) can run with llama.cpp. See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-omni) and [readme](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) for more detail.

 ## Int4 量化版
-int4 量化版，更低的显存占用(7GB):  [MiniCPM-o-2_6-int4](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4).
+int4 量化版，更低的显存占用(9GB):  [MiniCPM-o-2_6-int4](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4).


 ## License
 #### Model License
 * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. 
 * The usage of MiniCPM-o and MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
-* The models and weights of MiniCPM are completely free for academic research. after filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, are also available for free commercial use.
-
+* The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-o 2.6 weights are also available for free commercial use.


 #### Statement
 * As an LMM, MiniCPM-o 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 2.6 does not represent the views and positions of the model developers
 * We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

-## Other Multimodal Projects from Our Team
+## Key Techniques and Other Multimodal Projects
+
+👏 Welcome to explore key techniques of MiniCPM-o 2.6 and other multimodal projects of our team:

 [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD)  | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)

 ## Citation
+
 If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️！

 ```bib