yongjielv's picture
Update README.md
aaec54c verified
---
license: apache-2.0
pipeline_tag: any-to-any
---
# Ming-UniAudio
<p align="center">📑 <a href="https://mdn.alipayobjects.com/cto_asrtts/uri/file/as/TR-Ming-UniAudio.pdf">Technical Report</a>|📖<a href="https://xqacmer.github.io/Ming-Unitok-Audio.github.io/">Project Page</a> |🤗 <a href="https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B">Hugging Face</a>| 🤖 <a href="https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B">ModelScope</a>
## Introduction
Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon [Ming-Lite-Omni](https://github.com/inclusionAI/Ming). Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification.
- 🔥 First unified continuous speech tokenizer for both understanding and generation tasks: [MingTok-Audio](https://github.com/inclusionAI/MingTok-Audio)
- 🔥 First Speech LLM with unifed continuous tokenizer for both understanding and generation: [Ming-UniAudio](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B)
- 🔥 First universal free-form speech editing model for various semantic and acoustic editing task without any temporal regime: [Ming-UniAudio-Edit](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B-Edit)
- 🔥 First benchmark for free-form speech editing: [Ming-Freeform-Audio-Edit-Benchmark](https://huggingface.co/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark)
<p align="center">
<img src="./figures/uniaudio.png" width="600"/>
<p>
## 📌 Updates
* [2025.09.30] 🔥 We release [Ming-UniAudio](https://xqacmer.github.io/Ming-Unitok-Audio.github.io/) with significant improvements across speech understanding, generation, and free-form editing tasks.
## Key Features
Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs:
- **Unified Continuous Speech Tokenizer**: Ming-UniAudio proposes a unified continuous speech tokenizer [MingTok-Audio](https://github.com/inclusionAI/MingTok-Audio) based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks
- **Unified Speech Language Model for Generation and Understanding**: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-fidelity speech synthesis.
- **Instruction-Guided Free-Form Speech Editing**: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with [Ming-Freeform-Audio-Edit](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark), the first open-source evaluation set for such tasks.
<!-- <p align="center">
<img src="./figures/uniaudio-tokenizer.pdf" width="600"/>
<p> -->
## Evaluation
In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale.
### Speech Understanding
<table>
<caption>ASR performance comparison on various audio benchmark datasets. The best results are in <strong>bold</strong>.</caption>
<thead>
<tr>
<th rowspan="2"><strong>Datasets</strong></th>
<th rowspan="2"><strong>Model</strong></th>
<th colspan="7"><strong>Performance</strong></th>
</tr>
<tr>
<th><strong>aishell2-ios</strong></th>
<th><strong>LS-clean</strong></th>
<th><strong>Hunan</strong></th>
<th><strong>Minnan</strong></th>
<th><strong>Guangyue</strong></th>
<th><strong>Chuanyu</strong></th>
<th><strong>Shanghai</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="4"><strong>Understanding ASR</strong></td>
<td>Kimi-Audio</td>
<td><strong>2.56</strong></td>
<td><strong>1.28</strong></td>
<td>31.93</td>
<td>80.28</td>
<td>41.49</td>
<td>6.69</td>
<td>60.64</td>
</tr>
<tr>
<td>Qwen2.5 Omni</td>
<td>2.75</td>
<td>1.80</td>
<td>29.31</td>
<td>53.43</td>
<td>10.39</td>
<td>7.61</td>
<td>32.05</td>
</tr>
<tr>
<td>Qwen2 Audio</td>
<td>2.92</td>
<td>1.60</td>
<td>25.88</td>
<td>123.78</td>
<td>7.59</td>
<td>7.77</td>
<td>31.73</td>
</tr>
<tr>
<td><strong>Ming-UniAudio-16B-A3B(ours)</strong></td>
<td>2.84</td>
<td>1.62</td>
<td><strong>9.80</strong></td>
<td><strong>16.50</strong></td>
<td><strong>5.51</strong></td>
<td><strong>5.46</strong></td>
<td><strong>14.65</strong></td>
</tr>
</tbody>
</table>
### Speech Generation
<table align="center">
<caption>Performance comparison on various audio benchmark datasets. The best results are in <strong>bold</strong>.</caption>
<thead>
<tr>
<th align="left"><b>Datasets</b></th>
<th align="left"><b>Model</b></th>
<th colspan="4" align="center"><b>Performance</b></th>
</tr>
<tr>
<th></th>
<th></th>
<th align="center"><b>Seed-zh WER(%)</b></th>
<th align="center"><b>Seed-zh SIM</b></th>
<th align="center"><b>Seed-en WER(%)</b></th>
<th align="center"><b>Seed-en SIM</b></th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="5" align="left" style="vertical-align: middle;"><b>Generation</b></td>
<td align="left">Seed-TTS</td>
<td align="center">1.12</td>
<td align="center"><b>0.80</b></td>
<td align="center">2.25</td>
<td align="center"><b>0.76</b></td>
</tr>
<tr>
<td align="left">MiMo-Audio</td>
<td align="center">1.96</td>
<td align="center">-</td>
<td align="center">5.37</td>
<td align="center">-</td>
</tr>
<tr>
<td align="left">Qwen3-Omni-30B-A3B-Instruct</td>
<td align="center">1.07</td>
<td align="center">-</td>
<td align="center"><b>1.39</b></td>
<td align="center">-</td>
</tr>
<tr>
<td align="left">Ming-Omni-Lite</td>
<td align="center">1.69</td>
<td align="center">0.68</td>
<td align="center">4.31</td>
<td align="center">0.51</td>
</tr>
<tr>
<td align="left"><strong>Ming-UniAudio-16B-A3B(ours)</strong></td>
<td align="center"><b>0.95</b></td>
<td align="center">0.70</td>
<td align="center">1.85</td>
<td align="center">0.58</td>
</tr>
</tbody>
</table>
## Model & Benchmark Downloads
You can download our latest model and Benchmark from both Huggingface and ModelScope.
<div align="center">
|**Type**| **Model** | **Input modality** | **Oput modality** | **Download** |
|:-----------------------|:-----------------------|:----------------------:| :---------------: |:------------------------------------------------------------------------------------------------------------------------------------------------------------:|
Tokenizer| MingTok-Audio | audio | audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/MingTok-Audio) <br>[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/MingTok-Audio) |
SpeechLLM| Ming-UniAudio-16B-A3B | audio | audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B) <br>[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B) |
SpeechLLM| Ming-UniAudio-16B-A3B-Edit | text, audio | text, audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B-Edit) <br>[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B-Edit) |
Benchmark| Ming-Freeform-Audio-Edit | - | - | [🤗 HuggingFace](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark) <br>[🤖 ModelScope](https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark) <br>[Eval tools](https://github.com/inclusionAI/Ming-Freeform-Audio-Edit)|
</div>
If you're in mainland China, we strongly recommend you to download our model from 🤖 <a href="https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B">ModelScope</a>.
```
pip install modelscope
modelscope download --model inclusionAI/Ming-UniAudio-16B-A3B --local_dir inclusionAI/Ming-UniAudio-16B-A3B --revision master
```
Note: This download process will take several minutes to several hours, depending on your network conditions.
## Use Cases
Additional demonstration cases are available on our project [page](https://xqacmer.github.io/Ming-Unitok-Audio.github.io/).
## Environment Preparation
### Installation with pip
```shell
pip install -r requirements.txt
```
### Installation with docker
You can also initialize the environment by building the docker image. First clone this repository:
```shell
git clone --depth 1 https://github.com/inclusionAI/Ming-UniAudio
cd Ming-UniAudio
```
Then build the docker image with the provided Dockerfile in `docker/docker-py310-cu121`. This step might take a while:
```shell
docker build -t ming:py310-cu121 docker/docker-py310-cu121
```
At last, start the container with the current repo directory mounted:
```shell
docker run -it --gpus all -v "$(pwd)":/workspace/Ming-UniAudio ming:py310-cu121 ming:py310-cu121 /bin/bash
```
You can run the model with python interface. You may download the huggingface model in the repo directory first (`.../Ming-UniAudio/`) or mount the downloaded model path when starting the container.
## Example Usage
We provide a step-by-step running example:
Step 1 - Download the source code
```
git clone https://github.com/inclusionAI/Ming-UniAudio
cd Ming-UniAudio
```
Step 2 - Download the Ming-UniAudio model weights and create a soft link to the source code directory
Download our model following `Model & Benchmark Downloads`
```shell
mkdir inclusionAI
ln -s /path/to/inclusionAI/Ming-UniAudio-16B-A3B inclusionAI/Ming-UniAudio-16B-A3B
```
Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-UniAudio model.
```shell
jupyter notebook cookbooks/demo.ipynb
```
We also provide a simple example on the usage of this repo. For detailed usage, please refer to [demobook.ipynb](https://github.com/inclusionAI/Ming-UniAudio/blob/main/cookbooks/demo.ipynb).
```python
import warnings
import torch
from transformers import AutoProcessor
from modeling_bailingmm import BailingMMNativeForConditionalGeneration
import random
import numpy as np
from loguru import logger
def seed_everything(seed=1895):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
seed_everything()
warnings.filterwarnings("ignore")
class MingAudio:
def __init__(self, model_path, device="cuda:0"):
self.device = device
self.model = BailingMMNativeForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
).eval().to(torch.bfloat16).to(self.device)
self.processor = AutoProcessor.from_pretrained(".", trust_remote_code=True)
self.tokenizer = self.processor.tokenizer
self.sample_rate = self.processor.audio_processor.sample_rate
self.patch_size = self.processor.audio_processor.patch_size
def speech_understanding(self, messages):
text = self.processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages)
inputs = self.processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
).to(self.device)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)
logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}")
generated_ids = self.model.generate(
**inputs,
max_new_tokens=512,
eos_token_id=self.processor.gen_terminator,
)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = self.processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
return output_text
def speech_generation(
self,
text,
prompt_wav_path,
prompt_text,
lang='zh',
output_wav_path='out.wav'
):
waveform = self.model.generate_tts(
text=text,
prompt_wav_path=prompt_wav_path,
prompt_text=prompt_text,
patch_size=self.patch_size,
tokenizer=self.tokenizer,
lang=lang,
output_wav_path=output_wav_path,
sample_rate=self.sample_rate,
device=self.device
)
return waveform
def speech_edit(
self,
messages,
output_wav_path='out.wav'
):
text = self.processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages)
inputs = self.processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
).to(self.device)
ans = torch.tensor([self.tokenizer.encode('<answer>')]).to(inputs['input_ids'].device)
inputs['input_ids'] = torch.cat([inputs['input_ids'], ans], dim=1)
attention_mask = inputs['attention_mask']
inputs['attention_mask'] = torch.cat((attention_mask, attention_mask[:, :1]), dim=-1)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)
logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}")
edited_speech, edited_text = self.model.generate_edit(
**inputs,
tokenizer=self.tokenizer,
output_wav_path=output_wav_path
)
return edited_speech, edited_text
if __name__ == "__main__":
model = MingAudio("inclusionAI/Ming-UniAudio-16B-A3B")
# ASR
messages = [
{
"role": "HUMAN",
"content": [
{
"type": "text",
"text": "Please recognize the language of this speech and transcribe it. Format: oral.",
},
{"type": "audio", "audio": "data/wavs/BAC009S0915W0292.wav"},
],
},
]
response = model.speech_understanding(messages=messages)
logger.info(f"Generated Response: {response}")
# TTS
model.speech_generation(
text='我们的愿景是构建未来服务业的数字化基础设施,为世界带来更多微小而美好的改变。',
prompt_wav_path='data/wavs/10002287-00000094.wav',
prompt_text='在此奉劝大家别乱打美白针。',
)
```
Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.
## Citation
If you find our work helpful, feel free to give us a cite.