update readme

2025-07-09 16:31:47 +08:00 · 2025-02-07 16:13:26 +08:00 · 2025-02-07 16:13:26 +08:00 · df106b958d
commit df106b958d
parent 9dbc4e22a4
1 changed files with 155 additions and 61 deletions
--- a/README.md
+++ b/README.md
@ -904,6 +904,12 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://git

 以下示例为 MiniCPM-o 2.6 部署在 iPad Pro 和web demo录制得到。

+<div align="center">
+  <a href="https://youtu.be/JFJg9KZ_iZk"><img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/o-2dot6-demo-video-preview.png", width=70%></a>
+</div>
+
+<br>
+

 <div style="display: flex; flex-direction: column; align-items: center;">
  <img src="./assets/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
@ -956,10 +962,15 @@ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_

 # In addition to vision-only mode, tts processor and vocos also needs to be initialized
 model.init_tts()
+```
+
+If you are using an older version of PyTorch, you might encounter this issue `"weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16'`, Please convert the TTS to float32 type.
+```python
 model.tts.float()
 ```
+
 ### Omni mode
-we provide two inference modes: chat and streaming
+We provide two inference modes: chat and streaming

 #### Chat inference
 ```python
@ -994,7 +1005,7 @@ def get_video_chunk_content(video_path, flatten=True):
    
    return contents

-video_path="/path/to/video"
+video_path="assets/Skiing.mp4"
 # if use voice clone prompt, please set ref_audio
 ref_audio_path = 'assets/demo.wav'
 ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
@ -1025,6 +1036,11 @@ res = model.chat(
    return_dict=True
 )
 print(res)
+
+## You will get the answer: The person in the picture is skiing down a snowy slope.
+# import IPython
+# IPython.display.Audio('output.wav')
+
 ```
 #### Streaming inference
 ```python
@ -1082,14 +1098,42 @@ else:

 ```

-### Audio-Only mode
+
+### Speech and Audio Mode
+
+Model initialization
+
+```python
+import torch
+import librosa
+from transformers import AutoModel, AutoTokenizer
+
+model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
+    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
+model = model.eval().cuda()
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
+
+model.init_tts()
+model.tts.float()
+```
+
+<hr/>
+
 #### Mimick
+
 `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
+
 ```python
 mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
-audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
-msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
+audio_input, _ = librosa.load('./assets/input_examples/Trump_WEF_2018_10s.mp3', sr=16000, mono=True) # load the audio to be mimicked

+# can also try `./assets/input_examples/cxk_original.wav`, 
+# `./assets/input_examples/fast-pace.wav`, 
+# `./assets/input_examples/chi-english-1.wav` 
+# `./assets/input_examples/exciting-emotion.wav` 
+# for different aspects of speech-centric features.
+
+msgs = [{'role': 'user', 'content': [mimick_prompt, audio_input]}]
 res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
@ -1098,28 +1142,24 @@ res = model.chat(
    use_tts_template=True,
    temperature=0.3,
    generate_audio=True,
-    output_audio_path='output.wav', # save the tts result to output_audio_path
+    output_audio_path='output_mimick.wav', # save the tts result to output_audio_path
 )
 ```

+<hr/>
+
 #### General Speech Conversation with Configurable Voices
-<details> <summary>Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.</summary>
+
+A general usage scenario of `MiniCPM-o-2.6` is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, `MiniCPM-o-2.6` sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
+

 ```python
-ref_audio, _ = librosa.load('assets/demo.wav', sr=16000, mono=True) # load the reference audio
+ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
+sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')

-# Choose the mode you want to use
-# Audio RolePlay:  # With this mode, model will role-play the character based on the audio prompt. (More human-like conversation but unstable)
-# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
-# user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
-
-Audio Assistant: # With this mode, model will speak with the voice in ref_audio as a AI assistant. (Stable and more suitable for general conversation)
-sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en') 
-user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # Try to ask something by recording it in 'xxx.wav'!!!
-```
-```python
-msgs = [sys_prompt, user_question]
 # round one
+user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
+msgs = [sys_prompt, user_question]
 res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
@ -1128,7 +1168,7 @@ res = model.chat(
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
-    output_audio_path='result.wav',
+    output_audio_path='result_roleplay_round_1.wav',
 )

 # round two
@ -1143,32 +1183,62 @@ res = model.chat(
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
-    output_audio_path='result_round_2.wav',
+    output_audio_path='result_roleplay_round_2.wav',
 )
 print(res)
 ```

-</details>
+<hr/>

-#### Addressing various audio tasks
-<details>
-<summary> Click to show Python code running MiniCPM-o 2.6 with specific audioQA task. </summary>
+#### Speech Conversation as an AI Assistant
+
+An enhanced feature of `MiniCPM-o-2.6` is to act as an AI assistant, but only with limited choice of voices. In this mode, `MiniCPM-o-2.6` is **less human-like and more like a voice assistant**. In this mode, the model is more instruction-following. For demo, you are suggested to use `assistant_default_female_voice`, `assistant_male_voice`. Other voices may work but not as stable as the default voices.

 ```python
-'''
-Audio Understanding Task Prompt:
-Speech:
-    ASR with ZH(same as AST en2zh): 请仔细听这段音频片段，并将其内容逐字记录。
-    ASR with EN(same as AST zh2en): Please listen to the audio snippet carefully and transcribe the content.
-    Speaker Analysis: Based on the speaker's content, speculate on their gender, condition, age range, and health status.
-General Audio:
-    Audio Caption: Summarize the main content of the audio.
-    Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene.
-'''
-task_prompt = "" # Choose the task prompt above
-audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
+ref_audio, _ = librosa.load('./assets/input_examples/assistant_default_female_voice.wav', sr=16000, mono=True) # or use `./assets/input_examples/assistant_male_voice.wav`
+sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en') 
+user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # load the user's audio question

-msgs = [{'role': 'user', 'content': [task_prompt,audio_input]}]
+# round one
+msgs = [sys_prompt, user_question]
+res = model.chat(
+    msgs=msgs,
+    tokenizer=tokenizer,
+    sampling=True,
+    max_new_tokens=128,
+    use_tts_template=True,
+    generate_audio=True,
+    temperature=0.3,
+    output_audio_path='result_assistant_round_1.wav',
+)
+
+# round two
+history = msgs.append({'role': 'assistant', 'content': res})
+user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
+msgs = history.append(user_question)
+res = model.chat(
+    msgs=msgs,
+    tokenizer=tokenizer,
+    sampling=True,
+    max_new_tokens=128,
+    use_tts_template=True,
+    generate_audio=True,
+    temperature=0.3,
+    output_audio_path='result_assistant_round_2.wav',
+)
+print(res)
+```
+
+<hr/>
+
+#### Instruction-to-Speech
+
+`MiniCPM-o-2.6` can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
+
+```python
+instruction = 'Speak like a male charming superstar, radiating confidence and style in every word.'
+
+msgs = [{'role': 'user', 'content': [instruction]}]

 res = model.chat(
    msgs=msgs,
@ -1178,30 +1248,56 @@ res = model.chat(
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
-    output_audio_path='result.wav',
+    output_audio_path='result_voice_creation.wav',
 )
-print(res)
 ```
+
+<hr/>
+
+#### Voice Cloning
+
+`MiniCPM-o-2.6` can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
+
+
 ```python
-'''
-Speech Generation Task Prompt:
-    Human Instruction-to-Speech: see https://voxinstruct.github.io/VoxInstruct/
-    Example:
-        # 在新闻中，一个年轻男性兴致勃勃地说：“祝福亲爱的祖国母亲美丽富强！”他用低音调和低音量，慢慢地说出了这句话。
-        # Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context. 
+ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
+sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
+text_prompt = f"Please read the text below."
+user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}

-    Voice Cloning or Voice Conversion: With this mode, model will act like a TTS model. 
-'''
-# Human Instruction-to-Speech:
-task_prompt = '' #Try to make some Human Instruction-to-Speech prompt (Voice Creation)
-msgs = [{'role': 'user', 'content': [task_prompt]}] # you can also try to ask the same audio question
+msgs = [sys_prompt, user_question]
+res = model.chat(
+    msgs=msgs,
+    tokenizer=tokenizer,
+    sampling=True,
+    max_new_tokens=128,
+    use_tts_template=True,
+    generate_audio=True,
+    temperature=0.3,
+    output_audio_path='result_voice_cloning.wav',
+)

-# Voice Cloning mode: 
-# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
-# text_prompt = f"Please read the text below."
-# user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]} # using same voice in sys_prompt to read the text. (Voice Cloning)
-# user_question = {'role': 'user', 'content': [text_prompt, librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # using same voice in sys_prompt to read 'xxx.wav'. (Voice Conversion)
-# msgs = [sys_prompt, user_question]
+```
+
+<hr/>
+
+#### Addressing Various Audio Understanding Tasks
+
+`MiniCPM-o-2.6` can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
+
+For audio-to-text tasks, you can use the following prompts:
+
+- ASR with ZH(same as AST en2zh): `请仔细听这段音频片段，并将其内容逐字记录。`
+- ASR with EN(same as AST zh2en): `Please listen to the audio snippet carefully and transcribe the content.`
+- Speaker Analysis: `Based on the speaker's content, speculate on their gender, condition, age range, and health status.`
+- General Audio Caption: `Summarize the main content of the audio.`
+- General Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`
+
+```python
+task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # can change to other prompts.
+audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) # load the audio to be captioned
+
+msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]

 res = model.chat(
    msgs=msgs,
@ -1211,13 +1307,11 @@ res = model.chat(
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
-    output_audio_path='result.wav',
+    output_audio_path='result_audio_understanding.wav',
 )
-
-
+print(res)
 ```

-</details>

 ### Vision-Only mode