update readme

This commit is contained in:
Hongji Zhu 2025-02-07 16:13:26 +08:00
parent 9dbc4e22a4
commit df106b958d

216
README.md
View File

@ -904,6 +904,12 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用:(1) [llama.cpp](https://git
以下示例为 MiniCPM-o 2.6 部署在 iPad Pro 和web demo录制得到。
<div align="center">
<a href="https://youtu.be/JFJg9KZ_iZk"><img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/o-2dot6-demo-video-preview.png", width=70%></a>
</div>
<br>
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="./assets/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
@ -956,10 +962,15 @@ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_
# In addition to vision-only mode, tts processor and vocos also needs to be initialized
model.init_tts()
```
If you are using an older version of PyTorch, you might encounter this issue `"weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16'`, Please convert the TTS to float32 type.
```python
model.tts.float()
```
### Omni mode
we provide two inference modes: chat and streaming
We provide two inference modes: chat and streaming
#### Chat inference
```python
@ -994,7 +1005,7 @@ def get_video_chunk_content(video_path, flatten=True):
return contents
video_path="/path/to/video"
video_path="assets/Skiing.mp4"
# if use voice clone prompt, please set ref_audio
ref_audio_path = 'assets/demo.wav'
ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
@ -1025,6 +1036,11 @@ res = model.chat(
return_dict=True
)
print(res)
## You will get the answer: The person in the picture is skiing down a snowy slope.
# import IPython
# IPython.display.Audio('output.wav')
```
#### Streaming inference
```python
@ -1082,14 +1098,42 @@ else:
```
### Audio-Only mode
### Speech and Audio Mode
Model initialization
```python
import torch
import librosa
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
model.init_tts()
model.tts.float()
```
<hr/>
#### Mimick
`Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
```python
mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
audio_input, _ = librosa.load('./assets/input_examples/Trump_WEF_2018_10s.mp3', sr=16000, mono=True) # load the audio to be mimicked
# can also try `./assets/input_examples/cxk_original.wav`,
# `./assets/input_examples/fast-pace.wav`,
# `./assets/input_examples/chi-english-1.wav`
# `./assets/input_examples/exciting-emotion.wav`
# for different aspects of speech-centric features.
msgs = [{'role': 'user', 'content': [mimick_prompt, audio_input]}]
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
@ -1098,28 +1142,24 @@ res = model.chat(
use_tts_template=True,
temperature=0.3,
generate_audio=True,
output_audio_path='output.wav', # save the tts result to output_audio_path
output_audio_path='output_mimick.wav', # save the tts result to output_audio_path
)
```
<hr/>
#### General Speech Conversation with Configurable Voices
<details> <summary>Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.</summary>
A general usage scenario of `MiniCPM-o-2.6` is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, `MiniCPM-o-2.6` sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
```python
ref_audio, _ = librosa.load('assets/demo.wav', sr=16000, mono=True) # load the reference audio
ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
# Choose the mode you want to use
# Audio RolePlay: # With this mode, model will role-play the character based on the audio prompt. (More human-like conversation but unstable)
# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
# user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
Audio Assistant: # With this mode, model will speak with the voice in ref_audio as a AI assistant. (Stable and more suitable for general conversation)
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # Try to ask something by recording it in 'xxx.wav'!!!
```
```python
msgs = [sys_prompt, user_question]
# round one
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = [sys_prompt, user_question]
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
@ -1128,7 +1168,7 @@ res = model.chat(
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result.wav',
output_audio_path='result_roleplay_round_1.wav',
)
# round two
@ -1143,32 +1183,62 @@ res = model.chat(
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result_round_2.wav',
output_audio_path='result_roleplay_round_2.wav',
)
print(res)
```
</details>
<hr/>
#### Addressing various audio tasks
<details>
<summary> Click to show Python code running MiniCPM-o 2.6 with specific audioQA task. </summary>
#### Speech Conversation as an AI Assistant
An enhanced feature of `MiniCPM-o-2.6` is to act as an AI assistant, but only with limited choice of voices. In this mode, `MiniCPM-o-2.6` is **less human-like and more like a voice assistant**. In this mode, the model is more instruction-following. For demo, you are suggested to use `assistant_default_female_voice`, `assistant_male_voice`. Other voices may work but not as stable as the default voices.
```python
'''
Audio Understanding Task Prompt:
Speech:
ASR with ZH(same as AST en2zh): 请仔细听这段音频片段,并将其内容逐字记录。
ASR with EN(same as AST zh2en): Please listen to the audio snippet carefully and transcribe the content.
Speaker Analysis: Based on the speaker's content, speculate on their gender, condition, age range, and health status.
General Audio:
Audio Caption: Summarize the main content of the audio.
Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene.
'''
task_prompt = "" # Choose the task prompt above
audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
ref_audio, _ = librosa.load('./assets/input_examples/assistant_default_female_voice.wav', sr=16000, mono=True) # or use `./assets/input_examples/assistant_male_voice.wav`
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # load the user's audio question
msgs = [{'role': 'user', 'content': [task_prompt,audio_input]}]
# round one
msgs = [sys_prompt, user_question]
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result_assistant_round_1.wav',
)
# round two
history = msgs.append({'role': 'assistant', 'content': res})
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = history.append(user_question)
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result_assistant_round_2.wav',
)
print(res)
```
<hr/>
#### Instruction-to-Speech
`MiniCPM-o-2.6` can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
```python
instruction = 'Speak like a male charming superstar, radiating confidence and style in every word.'
msgs = [{'role': 'user', 'content': [instruction]}]
res = model.chat(
msgs=msgs,
@ -1178,30 +1248,56 @@ res = model.chat(
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result.wav',
output_audio_path='result_voice_creation.wav',
)
print(res)
```
<hr/>
#### Voice Cloning
`MiniCPM-o-2.6` can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
```python
'''
Speech Generation Task Prompt:
Human Instruction-to-Speech: see https://voxinstruct.github.io/VoxInstruct/
Example:
# 在新闻中,一个年轻男性兴致勃勃地说:“祝福亲爱的祖国母亲美丽富强!”他用低音调和低音量,慢慢地说出了这句话。
# Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context.
ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
text_prompt = f"Please read the text below."
user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}
Voice Cloning or Voice Conversion: With this mode, model will act like a TTS model.
'''
# Human Instruction-to-Speech:
task_prompt = '' #Try to make some Human Instruction-to-Speech prompt (Voice Creation)
msgs = [{'role': 'user', 'content': [task_prompt]}] # you can also try to ask the same audio question
msgs = [sys_prompt, user_question]
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result_voice_cloning.wav',
)
# Voice Cloning mode:
# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
# text_prompt = f"Please read the text below."
# user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]} # using same voice in sys_prompt to read the text. (Voice Cloning)
# user_question = {'role': 'user', 'content': [text_prompt, librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # using same voice in sys_prompt to read 'xxx.wav'. (Voice Conversion)
# msgs = [sys_prompt, user_question]
```
<hr/>
#### Addressing Various Audio Understanding Tasks
`MiniCPM-o-2.6` can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
For audio-to-text tasks, you can use the following prompts:
- ASR with ZH(same as AST en2zh): `请仔细听这段音频片段,并将其内容逐字记录。`
- ASR with EN(same as AST zh2en): `Please listen to the audio snippet carefully and transcribe the content.`
- Speaker Analysis: `Based on the speaker's content, speculate on their gender, condition, age range, and health status.`
- General Audio Caption: `Summarize the main content of the audio.`
- General Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`
```python
task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # can change to other prompts.
audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) # load the audio to be captioned
msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]
res = model.chat(
msgs=msgs,
@ -1211,13 +1307,11 @@ res = model.chat(
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result.wav',
output_audio_path='result_audio_understanding.wav',
)
print(res)
```
</details>
### Vision-Only mode