|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: any-to-any |
|
|
--- |
|
|
# Ming-UniAudio |
|
|
|
|
|
<p align="center">📑 <a href="https://mdn.alipayobjects.com/cto_asrtts/uri/file/as/TR-Ming-UniAudio.pdf">Technical Report</a>|📖<a href="https://xqacmer.github.io/Ming-Unitok-Audio.github.io/">Project Page</a> |🤗 <a href="https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B">Hugging Face</a>| 🤖 <a href="https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B">ModelScope</a> |
|
|
|
|
|
|
|
|
|
|
|
## Introduction |
|
|
|
|
|
Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon [Ming-Lite-Omni](https://github.com/inclusionAI/Ming). Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification. |
|
|
|
|
|
- 🔥 First unified continuous speech tokenizer for both understanding and generation tasks: [MingTok-Audio](https://github.com/inclusionAI/MingTok-Audio) |
|
|
- 🔥 First Speech LLM with unifed continuous tokenizer for both understanding and generation: [Ming-UniAudio](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B) |
|
|
- 🔥 First universal free-form speech editing model for various semantic and acoustic editing task without any temporal regime: [Ming-UniAudio-Edit](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B-Edit) |
|
|
- 🔥 First benchmark for free-form speech editing: [Ming-Freeform-Audio-Edit-Benchmark](https://huggingface.co/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark) |
|
|
|
|
|
<p align="center"> |
|
|
<img src="./figures/uniaudio.png" width="600"/> |
|
|
<p> |
|
|
|
|
|
## 📌 Updates |
|
|
|
|
|
* [2025.09.30] 🔥 We release [Ming-UniAudio](https://xqacmer.github.io/Ming-Unitok-Audio.github.io/) with significant improvements across speech understanding, generation, and free-form editing tasks. |
|
|
|
|
|
|
|
|
## Key Features |
|
|
Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs: |
|
|
- **Unified Continuous Speech Tokenizer**: Ming-UniAudio proposes a unified continuous speech tokenizer [MingTok-Audio](https://github.com/inclusionAI/MingTok-Audio) based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks |
|
|
|
|
|
- **Unified Speech Language Model for Generation and Understanding**: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-fidelity speech synthesis. |
|
|
- **Instruction-Guided Free-Form Speech Editing**: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with [Ming-Freeform-Audio-Edit](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark), the first open-source evaluation set for such tasks. |
|
|
|
|
|
|
|
|
<!-- <p align="center"> |
|
|
<img src="./figures/uniaudio-tokenizer.pdf" width="600"/> |
|
|
<p> --> |
|
|
|
|
|
## Evaluation |
|
|
In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale. |
|
|
|
|
|
|
|
|
|
|
|
### Speech Understanding |
|
|
|
|
|
<table> |
|
|
<caption>ASR performance comparison on various audio benchmark datasets. The best results are in <strong>bold</strong>.</caption> |
|
|
<thead> |
|
|
<tr> |
|
|
<th rowspan="2"><strong>Datasets</strong></th> |
|
|
<th rowspan="2"><strong>Model</strong></th> |
|
|
<th colspan="7"><strong>Performance</strong></th> |
|
|
</tr> |
|
|
<tr> |
|
|
<th><strong>aishell2-ios</strong></th> |
|
|
<th><strong>LS-clean</strong></th> |
|
|
<th><strong>Hunan</strong></th> |
|
|
<th><strong>Minnan</strong></th> |
|
|
<th><strong>Guangyue</strong></th> |
|
|
<th><strong>Chuanyu</strong></th> |
|
|
<th><strong>Shanghai</strong></th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
<tr> |
|
|
<td rowspan="4"><strong>Understanding ASR</strong></td> |
|
|
<td>Kimi-Audio</td> |
|
|
<td><strong>2.56</strong></td> |
|
|
<td><strong>1.28</strong></td> |
|
|
<td>31.93</td> |
|
|
<td>80.28</td> |
|
|
<td>41.49</td> |
|
|
<td>6.69</td> |
|
|
<td>60.64</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Qwen2.5 Omni</td> |
|
|
<td>2.75</td> |
|
|
<td>1.80</td> |
|
|
<td>29.31</td> |
|
|
<td>53.43</td> |
|
|
<td>10.39</td> |
|
|
<td>7.61</td> |
|
|
<td>32.05</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Qwen2 Audio</td> |
|
|
<td>2.92</td> |
|
|
<td>1.60</td> |
|
|
<td>25.88</td> |
|
|
<td>123.78</td> |
|
|
<td>7.59</td> |
|
|
<td>7.77</td> |
|
|
<td>31.73</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>Ming-UniAudio-16B-A3B(ours)</strong></td> |
|
|
<td>2.84</td> |
|
|
<td>1.62</td> |
|
|
<td><strong>9.80</strong></td> |
|
|
<td><strong>16.50</strong></td> |
|
|
<td><strong>5.51</strong></td> |
|
|
<td><strong>5.46</strong></td> |
|
|
<td><strong>14.65</strong></td> |
|
|
</tr> |
|
|
</tbody> |
|
|
</table> |
|
|
|
|
|
|
|
|
### Speech Generation |
|
|
|
|
|
<table align="center"> |
|
|
<caption>Performance comparison on various audio benchmark datasets. The best results are in <strong>bold</strong>.</caption> |
|
|
<thead> |
|
|
<tr> |
|
|
<th align="left"><b>Datasets</b></th> |
|
|
<th align="left"><b>Model</b></th> |
|
|
<th colspan="4" align="center"><b>Performance</b></th> |
|
|
</tr> |
|
|
<tr> |
|
|
<th></th> |
|
|
<th></th> |
|
|
<th align="center"><b>Seed-zh WER(%)</b></th> |
|
|
<th align="center"><b>Seed-zh SIM</b></th> |
|
|
<th align="center"><b>Seed-en WER(%)</b></th> |
|
|
<th align="center"><b>Seed-en SIM</b></th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
<tr> |
|
|
<td rowspan="5" align="left" style="vertical-align: middle;"><b>Generation</b></td> |
|
|
<td align="left">Seed-TTS</td> |
|
|
<td align="center">1.12</td> |
|
|
<td align="center"><b>0.80</b></td> |
|
|
<td align="center">2.25</td> |
|
|
<td align="center"><b>0.76</b></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td align="left">MiMo-Audio</td> |
|
|
<td align="center">1.96</td> |
|
|
<td align="center">-</td> |
|
|
<td align="center">5.37</td> |
|
|
<td align="center">-</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td align="left">Qwen3-Omni-30B-A3B-Instruct</td> |
|
|
<td align="center">1.07</td> |
|
|
<td align="center">-</td> |
|
|
<td align="center"><b>1.39</b></td> |
|
|
<td align="center">-</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td align="left">Ming-Omni-Lite</td> |
|
|
<td align="center">1.69</td> |
|
|
<td align="center">0.68</td> |
|
|
<td align="center">4.31</td> |
|
|
<td align="center">0.51</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td align="left"><strong>Ming-UniAudio-16B-A3B(ours)</strong></td> |
|
|
<td align="center"><b>0.95</b></td> |
|
|
<td align="center">0.70</td> |
|
|
<td align="center">1.85</td> |
|
|
<td align="center">0.58</td> |
|
|
</tr> |
|
|
</tbody> |
|
|
</table> |
|
|
|
|
|
|
|
|
## Model & Benchmark Downloads |
|
|
|
|
|
You can download our latest model and Benchmark from both Huggingface and ModelScope. |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
|**Type**| **Model** | **Input modality** | **Oput modality** | **Download** | |
|
|
|:-----------------------|:-----------------------|:----------------------:| :---------------: |:------------------------------------------------------------------------------------------------------------------------------------------------------------:| |
|
|
Tokenizer| MingTok-Audio | audio | audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/MingTok-Audio) <br>[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/MingTok-Audio) | |
|
|
SpeechLLM| Ming-UniAudio-16B-A3B | audio | audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B) <br>[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B) | |
|
|
SpeechLLM| Ming-UniAudio-16B-A3B-Edit | text, audio | text, audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B-Edit) <br>[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B-Edit) | |
|
|
Benchmark| Ming-Freeform-Audio-Edit | - | - | [🤗 HuggingFace](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark) <br>[🤖 ModelScope](https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark) <br>[Eval tools](https://github.com/inclusionAI/Ming-Freeform-Audio-Edit)| |
|
|
</div> |
|
|
If you're in mainland China, we strongly recommend you to download our model from 🤖 <a href="https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B">ModelScope</a>. |
|
|
|
|
|
``` |
|
|
pip install modelscope |
|
|
modelscope download --model inclusionAI/Ming-UniAudio-16B-A3B --local_dir inclusionAI/Ming-UniAudio-16B-A3B --revision master |
|
|
``` |
|
|
|
|
|
Note: This download process will take several minutes to several hours, depending on your network conditions. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Use Cases |
|
|
|
|
|
Additional demonstration cases are available on our project [page](https://xqacmer.github.io/Ming-Unitok-Audio.github.io/). |
|
|
|
|
|
|
|
|
## Environment Preparation |
|
|
|
|
|
|
|
|
### Installation with pip |
|
|
```shell |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### Installation with docker |
|
|
|
|
|
You can also initialize the environment by building the docker image. First clone this repository: |
|
|
```shell |
|
|
git clone --depth 1 https://github.com/inclusionAI/Ming-UniAudio |
|
|
cd Ming-UniAudio |
|
|
``` |
|
|
Then build the docker image with the provided Dockerfile in `docker/docker-py310-cu121`. This step might take a while: |
|
|
```shell |
|
|
docker build -t ming:py310-cu121 docker/docker-py310-cu121 |
|
|
``` |
|
|
At last, start the container with the current repo directory mounted: |
|
|
```shell |
|
|
docker run -it --gpus all -v "$(pwd)":/workspace/Ming-UniAudio ming:py310-cu121 ming:py310-cu121 /bin/bash |
|
|
``` |
|
|
You can run the model with python interface. You may download the huggingface model in the repo directory first (`.../Ming-UniAudio/`) or mount the downloaded model path when starting the container. |
|
|
|
|
|
|
|
|
## Example Usage |
|
|
|
|
|
We provide a step-by-step running example: |
|
|
|
|
|
Step 1 - Download the source code |
|
|
``` |
|
|
git clone https://github.com/inclusionAI/Ming-UniAudio |
|
|
cd Ming-UniAudio |
|
|
``` |
|
|
Step 2 - Download the Ming-UniAudio model weights and create a soft link to the source code directory |
|
|
|
|
|
Download our model following `Model & Benchmark Downloads` |
|
|
|
|
|
```shell |
|
|
mkdir inclusionAI |
|
|
ln -s /path/to/inclusionAI/Ming-UniAudio-16B-A3B inclusionAI/Ming-UniAudio-16B-A3B |
|
|
``` |
|
|
|
|
|
Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-UniAudio model. |
|
|
```shell |
|
|
jupyter notebook cookbooks/demo.ipynb |
|
|
``` |
|
|
|
|
|
We also provide a simple example on the usage of this repo. For detailed usage, please refer to [demobook.ipynb](https://github.com/inclusionAI/Ming-UniAudio/blob/main/cookbooks/demo.ipynb). |
|
|
|
|
|
```python |
|
|
import warnings |
|
|
import torch |
|
|
from transformers import AutoProcessor |
|
|
|
|
|
from modeling_bailingmm import BailingMMNativeForConditionalGeneration |
|
|
|
|
|
import random |
|
|
import numpy as np |
|
|
from loguru import logger |
|
|
|
|
|
def seed_everything(seed=1895): |
|
|
random.seed(seed) |
|
|
np.random.seed(seed) |
|
|
torch.manual_seed(seed) |
|
|
torch.cuda.manual_seed(seed) |
|
|
torch.cuda.manual_seed_all(seed) |
|
|
torch.backends.cudnn.deterministic = True |
|
|
torch.backends.cudnn.benchmark = False |
|
|
|
|
|
seed_everything() |
|
|
warnings.filterwarnings("ignore") |
|
|
|
|
|
class MingAudio: |
|
|
def __init__(self, model_path, device="cuda:0"): |
|
|
self.device = device |
|
|
self.model = BailingMMNativeForConditionalGeneration.from_pretrained( |
|
|
model_path, |
|
|
torch_dtype=torch.bfloat16, |
|
|
low_cpu_mem_usage=True, |
|
|
).eval().to(torch.bfloat16).to(self.device) |
|
|
self.processor = AutoProcessor.from_pretrained(".", trust_remote_code=True) |
|
|
self.tokenizer = self.processor.tokenizer |
|
|
self.sample_rate = self.processor.audio_processor.sample_rate |
|
|
self.patch_size = self.processor.audio_processor.patch_size |
|
|
|
|
|
def speech_understanding(self, messages): |
|
|
text = self.processor.apply_chat_template(messages, add_generation_prompt=True) |
|
|
image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages) |
|
|
|
|
|
inputs = self.processor( |
|
|
text=[text], |
|
|
images=image_inputs, |
|
|
videos=video_inputs, |
|
|
audios=audio_inputs, |
|
|
return_tensors="pt", |
|
|
).to(self.device) |
|
|
|
|
|
for k in inputs.keys(): |
|
|
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats": |
|
|
inputs[k] = inputs[k].to(dtype=torch.bfloat16) |
|
|
logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}") |
|
|
|
|
|
generated_ids = self.model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=512, |
|
|
eos_token_id=self.processor.gen_terminator, |
|
|
) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
output_text = self.processor.batch_decode( |
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
|
)[0] |
|
|
|
|
|
return output_text |
|
|
|
|
|
def speech_generation( |
|
|
self, |
|
|
text, |
|
|
prompt_wav_path, |
|
|
prompt_text, |
|
|
lang='zh', |
|
|
output_wav_path='out.wav' |
|
|
): |
|
|
waveform = self.model.generate_tts( |
|
|
text=text, |
|
|
prompt_wav_path=prompt_wav_path, |
|
|
prompt_text=prompt_text, |
|
|
patch_size=self.patch_size, |
|
|
tokenizer=self.tokenizer, |
|
|
lang=lang, |
|
|
output_wav_path=output_wav_path, |
|
|
sample_rate=self.sample_rate, |
|
|
device=self.device |
|
|
) |
|
|
|
|
|
return waveform |
|
|
|
|
|
def speech_edit( |
|
|
self, |
|
|
messages, |
|
|
output_wav_path='out.wav' |
|
|
): |
|
|
text = self.processor.apply_chat_template(messages, add_generation_prompt=True) |
|
|
image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages) |
|
|
|
|
|
inputs = self.processor( |
|
|
text=[text], |
|
|
images=image_inputs, |
|
|
videos=video_inputs, |
|
|
audios=audio_inputs, |
|
|
return_tensors="pt", |
|
|
).to(self.device) |
|
|
|
|
|
ans = torch.tensor([self.tokenizer.encode('<answer>')]).to(inputs['input_ids'].device) |
|
|
inputs['input_ids'] = torch.cat([inputs['input_ids'], ans], dim=1) |
|
|
attention_mask = inputs['attention_mask'] |
|
|
inputs['attention_mask'] = torch.cat((attention_mask, attention_mask[:, :1]), dim=-1) |
|
|
for k in inputs.keys(): |
|
|
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats": |
|
|
inputs[k] = inputs[k].to(dtype=torch.bfloat16) |
|
|
logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}") |
|
|
|
|
|
edited_speech, edited_text = self.model.generate_edit( |
|
|
**inputs, |
|
|
tokenizer=self.tokenizer, |
|
|
output_wav_path=output_wav_path |
|
|
) |
|
|
return edited_speech, edited_text |
|
|
|
|
|
if __name__ == "__main__": |
|
|
model = MingAudio("inclusionAI/Ming-UniAudio-16B-A3B") |
|
|
|
|
|
# ASR |
|
|
messages = [ |
|
|
{ |
|
|
"role": "HUMAN", |
|
|
"content": [ |
|
|
{ |
|
|
"type": "text", |
|
|
"text": "Please recognize the language of this speech and transcribe it. Format: oral.", |
|
|
}, |
|
|
|
|
|
{"type": "audio", "audio": "data/wavs/BAC009S0915W0292.wav"}, |
|
|
], |
|
|
}, |
|
|
] |
|
|
|
|
|
response = model.speech_understanding(messages=messages) |
|
|
logger.info(f"Generated Response: {response}") |
|
|
|
|
|
# TTS |
|
|
model.speech_generation( |
|
|
text='我们的愿景是构建未来服务业的数字化基础设施,为世界带来更多微小而美好的改变。', |
|
|
prompt_wav_path='data/wavs/10002287-00000094.wav', |
|
|
prompt_text='在此奉劝大家别乱打美白针。', |
|
|
) |
|
|
``` |
|
|
|
|
|
Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4. |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our work helpful, feel free to give us a cite. |