Update README.md

aaec54c verified 4 days ago

16.5 kB

	---
	license: apache-2.0
	pipeline_tag: any-to-any
	---
	# Ming-UniAudio

	<p align="center">📑 <a href="https://mdn.alipayobjects.com/cto_asrtts/uri/file/as/TR-Ming-UniAudio.pdf">Technical Report</a>｜📖<a href="https://xqacmer.github.io/Ming-Unitok-Audio.github.io/">Project Page</a> ｜🤗 <a href="https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B">Hugging Face</a>｜ 🤖 <a href="https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B">ModelScope</a>



	## Introduction

	Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon [Ming-Lite-Omni](https://github.com/inclusionAI/Ming). Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification.

	- 🔥 First unified continuous speech tokenizer for both understanding and generation tasks: [MingTok-Audio](https://github.com/inclusionAI/MingTok-Audio)
	- 🔥 First Speech LLM with unifed continuous tokenizer for both understanding and generation: [Ming-UniAudio](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B)
	- 🔥 First universal free-form speech editing model for various semantic and acoustic editing task without any temporal regime: [Ming-UniAudio-Edit](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B-Edit)
	- 🔥 First benchmark for free-form speech editing: [Ming-Freeform-Audio-Edit-Benchmark](https://huggingface.co/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark)

	<p align="center">
	<img src="./figures/uniaudio.png" width="600"/>
	<p>

	## 📌 Updates

	* [2025.09.30] 🔥 We release [Ming-UniAudio](https://xqacmer.github.io/Ming-Unitok-Audio.github.io/) with significant improvements across speech understanding, generation, and free-form editing tasks.


	## Key Features
	Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs:
	- Unified Continuous Speech Tokenizer: Ming-UniAudio proposes a unified continuous speech tokenizer [MingTok-Audio](https://github.com/inclusionAI/MingTok-Audio) based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks

	- Unified Speech Language Model for Generation and Understanding: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-fidelity speech synthesis.
	- Instruction-Guided Free-Form Speech Editing: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with [Ming-Freeform-Audio-Edit](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark), the first open-source evaluation set for such tasks.


	<!-- <p align="center">
	<img src="./figures/uniaudio-tokenizer.pdf" width="600"/>
	<p> -->

	## Evaluation
	In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale.



	### Speech Understanding

	<table>
	<caption>ASR performance comparison on various audio benchmark datasets. The best results are in <strong>bold</strong>.</caption>
	<thead>
	<tr>
	<th rowspan="2"><strong>Datasets</strong></th>
	<th rowspan="2"><strong>Model</strong></th>
	<th colspan="7"><strong>Performance</strong></th>
	</tr>
	<tr>
	<th><strong>aishell2-ios</strong></th>
	<th><strong>LS-clean</strong></th>
	<th><strong>Hunan</strong></th>
	<th><strong>Minnan</strong></th>
	<th><strong>Guangyue</strong></th>
	<th><strong>Chuanyu</strong></th>
	<th><strong>Shanghai</strong></th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td rowspan="4"><strong>Understanding ASR</strong></td>
	<td>Kimi-Audio</td>
	<td><strong>2.56</strong></td>
	<td><strong>1.28</strong></td>
	<td>31.93</td>
	<td>80.28</td>
	<td>41.49</td>
	<td>6.69</td>
	<td>60.64</td>
	</tr>
	<tr>
	<td>Qwen2.5 Omni</td>
	<td>2.75</td>
	<td>1.80</td>
	<td>29.31</td>
	<td>53.43</td>
	<td>10.39</td>
	<td>7.61</td>
	<td>32.05</td>
	</tr>
	<tr>
	<td>Qwen2 Audio</td>
	<td>2.92</td>
	<td>1.60</td>
	<td>25.88</td>
	<td>123.78</td>
	<td>7.59</td>
	<td>7.77</td>
	<td>31.73</td>
	</tr>
	<tr>
	<td><strong>Ming-UniAudio-16B-A3B(ours)</strong></td>
	<td>2.84</td>
	<td>1.62</td>
	<td><strong>9.80</strong></td>
	<td><strong>16.50</strong></td>
	<td><strong>5.51</strong></td>
	<td><strong>5.46</strong></td>
	<td><strong>14.65</strong></td>
	</tr>
	</tbody>
	</table>


	### Speech Generation

	<table align="center">
	<caption>Performance comparison on various audio benchmark datasets. The best results are in <strong>bold</strong>.</caption>
	<thead>
	<tr>
	<th align="left"><b>Datasets</b></th>
	<th align="left"><b>Model</b></th>
	<th colspan="4" align="center"><b>Performance</b></th>
	</tr>
	<tr>
	<th></th>
	<th></th>
	<th align="center"><b>Seed-zh WER(%)</b></th>
	<th align="center"><b>Seed-zh SIM</b></th>
	<th align="center"><b>Seed-en WER(%)</b></th>
	<th align="center"><b>Seed-en SIM</b></th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td rowspan="5" align="left" style="vertical-align: middle;"><b>Generation</b></td>
	<td align="left">Seed-TTS</td>
	<td align="center">1.12</td>
	<td align="center"><b>0.80</b></td>
	<td align="center">2.25</td>
	<td align="center"><b>0.76</b></td>
	</tr>
	<tr>
	<td align="left">MiMo-Audio</td>
	<td align="center">1.96</td>
	<td align="center">-</td>
	<td align="center">5.37</td>
	<td align="center">-</td>
	</tr>
	<tr>
	<td align="left">Qwen3-Omni-30B-A3B-Instruct</td>
	<td align="center">1.07</td>
	<td align="center">-</td>
	<td align="center"><b>1.39</b></td>
	<td align="center">-</td>
	</tr>
	<tr>
	<td align="left">Ming-Omni-Lite</td>
	<td align="center">1.69</td>
	<td align="center">0.68</td>
	<td align="center">4.31</td>
	<td align="center">0.51</td>
	</tr>
	<tr>
	<td align="left"><strong>Ming-UniAudio-16B-A3B(ours)</strong></td>
	<td align="center"><b>0.95</b></td>
	<td align="center">0.70</td>
	<td align="center">1.85</td>
	<td align="center">0.58</td>
	</tr>
	</tbody>
	</table>


	## Model & Benchmark Downloads

	You can download our latest model and Benchmark from both Huggingface and ModelScope.

	<div align="center">

	\|Type\| Model \| Input modality \| Oput modality \| Download \|
	\|:-----------------------\|:-----------------------\|:----------------------:\| :---------------: \|:------------------------------------------------------------------------------------------------------------------------------------------------------------:\|
	Tokenizer\| MingTok-Audio \| audio \| audio \| [🤗 HuggingFace](https://huggingface.co/inclusionAI/MingTok-Audio) <br>[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/MingTok-Audio) \|
	SpeechLLM\| Ming-UniAudio-16B-A3B \| audio \| audio \| [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B) <br>[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B) \|
	SpeechLLM\| Ming-UniAudio-16B-A3B-Edit \| text, audio \| text, audio \| [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B-Edit) <br>[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B-Edit) \|
	Benchmark\| Ming-Freeform-Audio-Edit \| - \| - \| [🤗 HuggingFace](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark) <br>[🤖 ModelScope](https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark) <br>[Eval tools](https://github.com/inclusionAI/Ming-Freeform-Audio-Edit)\|
	</div>
	If you're in mainland China, we strongly recommend you to download our model from 🤖 <a href="https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B">ModelScope</a>.

	```
	pip install modelscope
	modelscope download --model inclusionAI/Ming-UniAudio-16B-A3B --local_dir inclusionAI/Ming-UniAudio-16B-A3B --revision master
	```

	Note: This download process will take several minutes to several hours, depending on your network conditions.





	## Use Cases

	Additional demonstration cases are available on our project [page](https://xqacmer.github.io/Ming-Unitok-Audio.github.io/).


	## Environment Preparation


	### Installation with pip
	```shell
	pip install -r requirements.txt
	```

	### Installation with docker

	You can also initialize the environment by building the docker image. First clone this repository:
	```shell
	git clone --depth 1 https://github.com/inclusionAI/Ming-UniAudio
	cd Ming-UniAudio
	```
	Then build the docker image with the provided Dockerfile in `docker/docker-py310-cu121`. This step might take a while:
	```shell
	docker build -t ming:py310-cu121 docker/docker-py310-cu121
	```
	At last, start the container with the current repo directory mounted:
	```shell
	docker run -it --gpus all -v "$(pwd)":/workspace/Ming-UniAudio ming:py310-cu121 ming:py310-cu121 /bin/bash
	```
	You can run the model with python interface. You may download the huggingface model in the repo directory first (`.../Ming-UniAudio/`) or mount the downloaded model path when starting the container.


	## Example Usage

	We provide a step-by-step running example:

	Step 1 - Download the source code
	```
	git clone https://github.com/inclusionAI/Ming-UniAudio
	cd Ming-UniAudio
	```
	Step 2 - Download the Ming-UniAudio model weights and create a soft link to the source code directory

	Download our model following `Model & Benchmark Downloads`

	```shell
	mkdir inclusionAI
	ln -s /path/to/inclusionAI/Ming-UniAudio-16B-A3B inclusionAI/Ming-UniAudio-16B-A3B
	```

	Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-UniAudio model.
	```shell
	jupyter notebook cookbooks/demo.ipynb
	```

	We also provide a simple example on the usage of this repo. For detailed usage, please refer to [demobook.ipynb](https://github.com/inclusionAI/Ming-UniAudio/blob/main/cookbooks/demo.ipynb).

	```python
	import warnings
	import torch
	from transformers import AutoProcessor

	from modeling_bailingmm import BailingMMNativeForConditionalGeneration

	import random
	import numpy as np
	from loguru import logger

	def seed_everything(seed=1895):
	random.seed(seed)
	np.random.seed(seed)
	torch.manual_seed(seed)
	torch.cuda.manual_seed(seed)
	torch.cuda.manual_seed_all(seed)
	torch.backends.cudnn.deterministic = True
	torch.backends.cudnn.benchmark = False

	seed_everything()
	warnings.filterwarnings("ignore")

	class MingAudio:
	def __init__(self, model_path, device="cuda:0"):
	self.device = device
	self.model = BailingMMNativeForConditionalGeneration.from_pretrained(
	model_path,
	torch_dtype=torch.bfloat16,
	low_cpu_mem_usage=True,
	).eval().to(torch.bfloat16).to(self.device)
	self.processor = AutoProcessor.from_pretrained(".", trust_remote_code=True)
	self.tokenizer = self.processor.tokenizer
	self.sample_rate = self.processor.audio_processor.sample_rate
	self.patch_size = self.processor.audio_processor.patch_size

	def speech_understanding(self, messages):
	text = self.processor.apply_chat_template(messages, add_generation_prompt=True)
	image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages)

	inputs = self.processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	audios=audio_inputs,
	return_tensors="pt",
	).to(self.device)

	for k in inputs.keys():
	if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
	inputs[k] = inputs[k].to(dtype=torch.bfloat16)
	logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}")

	generated_ids = self.model.generate(
	**inputs,
	max_new_tokens=512,
	eos_token_id=self.processor.gen_terminator,
	)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = self.processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)[0]

	return output_text

	def speech_generation(
	self,
	text,
	prompt_wav_path,
	prompt_text,
	lang='zh',
	output_wav_path='out.wav'
	):
	waveform = self.model.generate_tts(
	text=text,
	prompt_wav_path=prompt_wav_path,
	prompt_text=prompt_text,
	patch_size=self.patch_size,
	tokenizer=self.tokenizer,
	lang=lang,
	output_wav_path=output_wav_path,
	sample_rate=self.sample_rate,
	device=self.device
	)

	return waveform

	def speech_edit(
	self,
	messages,
	output_wav_path='out.wav'
	):
	text = self.processor.apply_chat_template(messages, add_generation_prompt=True)
	image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages)

	inputs = self.processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	audios=audio_inputs,
	return_tensors="pt",
	).to(self.device)

	ans = torch.tensor([self.tokenizer.encode('<answer>')]).to(inputs['input_ids'].device)
	inputs['input_ids'] = torch.cat([inputs['input_ids'], ans], dim=1)
	attention_mask = inputs['attention_mask']
	inputs['attention_mask'] = torch.cat((attention_mask, attention_mask[:, :1]), dim=-1)
	for k in inputs.keys():
	if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
	inputs[k] = inputs[k].to(dtype=torch.bfloat16)
	logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}")

	edited_speech, edited_text = self.model.generate_edit(
	**inputs,
	tokenizer=self.tokenizer,
	output_wav_path=output_wav_path
	)
	return edited_speech, edited_text

	if __name__ == "__main__":
	model = MingAudio("inclusionAI/Ming-UniAudio-16B-A3B")

	# ASR
	messages = [
	{
	"role": "HUMAN",
	"content": [
	{
	"type": "text",
	"text": "Please recognize the language of this speech and transcribe it. Format: oral.",
	},

	{"type": "audio", "audio": "data/wavs/BAC009S0915W0292.wav"},
	],
	},
	]

	response = model.speech_understanding(messages=messages)
	logger.info(f"Generated Response: {response}")

	# TTS
	model.speech_generation(
	text='我们的愿景是构建未来服务业的数字化基础设施，为世界带来更多微小而美好的改变。',
	prompt_wav_path='data/wavs/10002287-00000094.wav',
	prompt_text='在此奉劝大家别乱打美白针。',
	)
	```

	Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.


	## Citation

	If you find our work helpful, feel free to give us a cite.