update

Browse files

Files changed (6) hide show

README.md +1402 -0
configuration_minicpm.py +1 -0
modeling_minicpmo.py +424 -121
modeling_navit_siglip.py +1 -0
processing_minicpmo.py +6 -7
utils.py +51 -2

README.md ADDED Viewed

	@@ -0,0 +1,1402 @@

+---
+pipeline_tag: image-text-to-text
+datasets:
+- openbmb/RLAIF-V-Dataset
+library_name: transformers
+language:
+- multilingual
+tags:
+- minicpm-o
+- omni
+- vision
+- ocr
+- multi-image
+- video
+- custom_code
+---
+<h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1>
+[GitHub](https://github.com/OpenBMB/MiniCPM-V) | [Online Demo](https://minicpm-omni-webdemo-us.modelbest.cn)</a>
+## MiniCPM-o 2.6
+**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for realtime speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
+- 🔥 **Leading Visual Capability.**
+  MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
+- 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual realtime speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, voice cloning, role play, etc.
+- 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support realtime speech interaction**. It **outperforms GPT-4o-realtime and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding , and multimodal contextual understanding.
+- 💪 **Strong OCR Capability and Others.**
+Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
+  Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
+- 🚀 **Superior Efficiency.**
+  In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
+-  💫  **Easy Usage.**
+MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](XXX) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [CN](https://minicpm-omni-webdemo.modelbest.cn/
+) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn/) server.
+**Model Architecture.**
+- **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an end-to-end fashion to fully exploit rich multimodal knowledge.
+- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for streaminig inputs/outputs. (2) We devise a time-division multiplexing (TDM) mechanism for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
+- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and a new audio system prompt to determine the assistant voice. This enables flexible voice configurations in inference time, and also facilitates voice cloning and description-based voice creation.
+<div align="center">
+<img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpm-o-26-framework.png" , width=80%>
+</div>
+### Evaluation  <!-- omit in toc -->
+<div align="center">
+    <img src="https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/radar.png" width=66% />
+</div>
+<details>
+<summary>Click to view visual understanding results.</summary>
+**Image Understanding**
+<div align="center">
+<table style="margin: 0px auto;">
+    <thead>
+        <tr>
+            <th align="left">Model</th>
+            <th>Size</th>
+            <th>Token Density<sup>+</sup></th>
+            <th>OpenCompass</th>
+            <th>OCRBench</th>
+            <th>MathVista mini</th>
+            <th>ChartQA</th>
+            <th>MMVet</th>
+            <th>MMStar</th>
+            <th>MME</th>
+            <th>MMB1.1 test</th>
+            <th>AI2D</th>
+            <th>MMMU val</th>
+            <th>HallusionBench</th>
+            <th>TextVQA val</th>
+            <th>DocVQA test</th>
+            <th>MathVerse mini</th>
+            <th>MathVision</th>
+            <th>MMHal Score</th>
+        </tr>
+    </thead>
+    <tbody align="center">
+        <tr>
+            <td colspan="19" align="left"><strong>Proprietary</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
+            <td>-</td>
+            <td>1088</td>
+            <td><u>69.9</u></td>
+            <td>736</td>
+            <td>61.3</td>
+            <td>85.7</td>
+            <td><strong>69.1</strong></td>
+            <td>63.9</td>
+            <td>2328.7</td>
+            <td>82.2</td>
+            <td>84.6</td>
+            <td><strong>69.2</strong></td>
+            <td><strong>55.0</strong></td>
+            <td>-</td>
+            <td>92.8</td>
+            <td><strong>50.2</strong></td>
+            <td><strong>30.4</strong></td>
+            <td><u>3.6</u></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Claude3.5-Sonnet</td>
+            <td>-</td>
+            <td>750</td>
+            <td>67.9</td>
+            <td>788</td>
+            <td>61.6</td>
+            <td><strong>90.8</strong></td>
+            <td>66.0</td>
+            <td>62.2</td>
+            <td>1920.0</td>
+            <td>78.5</td>
+            <td>80.2</td>
+            <td><u>65.9</u></td>
+            <td>49.9</td>
+            <td>-</td>
+            <td><strong>95.2</strong></td>
+            <td>-</td>
+            <td>-</td>
+            <td>3.4</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
+            <td>-</td>
+            <td>-</td>
+            <td>64.4</td>
+            <td>754</td>
+            <td>57.7</td>
+            <td>81.3</td>
+            <td>64.0</td>
+            <td>59.1</td>
+            <td>2110.6</td>
+            <td>73.9</td>
+            <td>79.1</td>
+            <td>60.6</td>
+            <td>45.6</td>
+            <td>73.5</td>
+            <td>86.5</td>
+            <td>-</td>
+            <td>19.2</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td>
+            <td>-</td>
+            <td>1088</td>
+            <td>64.1</td>
+            <td>785</td>
+            <td>52.4</td>
+            <td>-</td>
+            <td>66.9</td>
+            <td>54.8</td>
+            <td>2003.4</td>
+            <td>76.0</td>
+            <td>77.8</td>
+            <td>60.0</td>
+            <td>46.1</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>3.3</td>
+        </tr>
+        <tr>
+            <td colspan="19" align="left"><strong>Open Source</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Cambrian-34B</td>
+            <td>34B</td>
+            <td><u>1820</u></td>
+            <td>58.3</td>
+            <td>591</td>
+            <td>50.3</td>
+            <td>75.6</td>
+            <td>53.2</td>
+            <td>54.2</td>
+            <td>2049.9</td>
+            <td>77.8</td>
+            <td>79.5</td>
+            <td>50.4</td>
+            <td>41.6</td>
+            <td>76.7</td>
+            <td>75.5</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">GLM-4V-9B</td>
+            <td>13B</td>
+            <td>784</td>
+            <td>59.1</td>
+            <td>776</td>
+            <td>51.1</td>
+            <td>-</td>
+            <td>58.0</td>
+            <td>54.8</td>
+            <td>2018.8</td>
+            <td>67.9</td>
+            <td>71.2</td>
+            <td>46.9</td>
+            <td>45.0</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Pixtral-12B</td>
+            <td>12B</td>
+            <td>256</td>
+            <td>61.0</td>
+            <td>685</td>
+            <td>56.9</td>
+            <td>81.8</td>
+            <td>58.5</td>
+            <td>54.5</td>
+            <td>-</td>
+            <td>72.7</td>
+            <td>79.0</td>
+            <td>51.1</td>
+            <td>47.0</td>
+            <td>75.7</td>
+            <td>90.7</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td>
+            <td>27B</td>
+            <td>672</td>
+            <td>66.4</td>
+            <td>809</td>
+            <td>63.9</td>
+            <td>86.0</td>
+            <td>60.0</td>
+            <td>61.9</td>
+            <td>2253.0</td>
+            <td>81.2</td>
+            <td>83.8</td>
+            <td>54.0</td>
+            <td>45.3</td>
+            <td><u>84.2</u></td>
+            <td>93.3</td>
+            <td>-</td>
+            <td>-</td>
+            <td>3.0</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
+            <td>8B</td>
+            <td>784</td>
+            <td>67.1</td>
+            <td><u>866</u></td>
+            <td>58.2</td>
+            <td>83.0</td>
+            <td>62.0</td>
+            <td>60.7</td>
+            <td>2326.0</td>
+            <td>81.8</td>
+            <td>83.0</td>
+            <td>54.1</td>
+            <td>50.6</td>
+            <td><strong>84.3</strong></td>
+            <td><u>94.5</u></td>
+            <td>31.9</td>
+            <td>16.3</td>
+            <td>3.2</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
+            <td>72B</td>
+            <td>182</td>
+            <td>68.1</td>
+            <td>741</td>
+            <td>67.5</td>
+            <td>83.7</td>
+            <td>60.6</td>
+            <td><strong>65.8</strong></td>
+            <td>2261.0</td>
+            <td><strong>85.0</strong></td>
+            <td><u>85.6</u></td>
+            <td>56.8</td>
+            <td>49.0</td>
+            <td>80.5</td>
+            <td>91.3</td>
+            <td>39.1</td>
+            <td>-</td>
+            <td>3.5</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">InternVL-2.5-8B</td>
+            <td>8B</td>
+            <td>706</td>
+            <td>68.3</td>
+            <td>822</td>
+            <td><u>64.4</u></td>
+            <td>84.8</td>
+            <td>62.8</td>
+            <td>62.8</td>
+            <td>2344.0</td>
+            <td><u>83.6</u></td>
+            <td>84.5</td>
+            <td>56.0</td>
+            <td>50.1</td>
+            <td>79.1</td>
+            <td>93.0</td>
+            <td>39.5</td>
+            <td>19.7</td>
+            <td>3.4</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
+            <td>8B</td>
+            <td><strong>2822</strong></td>
+            <td>65.2</td>
+            <td>852*</td>
+            <td>60.6</td>
+            <td>79.4</td>
+            <td>60.0</td>
+            <td>57.5</td>
+            <td><u>2348.4*</u></td>
+            <td>78.0</td>
+            <td>82.1</td>
+            <td>49.8*</td>
+            <td>48.1*</td>
+            <td>80.1</td>
+            <td>90.8</td>
+            <td>25.7</td>
+            <td>18.3</td>
+            <td>3.6</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
+            <td>8B</td>
+            <td><strong>2822</strong></td>
+            <td><strong>70.2</strong></td>
+            <td><strong>897*</strong></td>
+            <td><strong>71.9*</strong></td>
+            <td><u>86.9*</u></td>
+            <td><u>67.5</u></td>
+            <td><u>64.0</u></td>
+            <td><strong>2372.0*</strong></td>
+            <td>80.5</td>
+            <td><strong>85.8</strong></td>
+            <td>50.4*</td>
+            <td><u>51.9</u></td>
+            <td>82.0</td>
+            <td>93.5</td>
+            <td><u>41.4*</u></td>
+            <td><u>23.1*</u></td>
+            <td><strong>3.8</strong></td>
+        </tr>
+    </tbody>
+</table>
+</div>
+* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
+<sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
+Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
+**Multi-image and Video Understanding**
+<div align="center">
+<table style="margin: 0px auto;">
+    <thead>
+        <tr>
+            <th align="left">Model</th>
+            <th>Size</th>
+            <th>BLINK-val</th>
+            <th>Mantis-Eval</th>
+            <th>MIRB</th>
+            <th>Video-MME (wo / w subs)</th>
+        </tr>
+    </thead>
+    <tbody align="center">
+        <tr>
+            <td colspan="6" align="left"><strong>Proprietary</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
+            <td>-</td>
+            <td><strong>68</strong></td>
+            <td>-</td>
+            <td>-</td>
+            <td><strong>71.9/77.2<strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">GPT4V</td>
+            <td>-</td>
+            <td>54.6</td>
+            <td>62.7</td>
+            <td>53.1</td>
+            <td>59.9/63.3</td>
+        </tr>
+        <tr>
+            <td colspan="6" align="left"><strong>Open-source</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td>
+            <td>14B</td>
+            <td>52.6</td>
+            <td>66.4</td>
+            <td>30.2</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">LLaVA-One-Vision-72B</td>
+            <td>72B</td>
+            <td>55.4</td>
+            <td><strong>77.6</strong></td>
+            <td>-</td>
+            <td><u>66.2/69.5</u></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">MANTIS 8B</td>
+            <td>8B</td>
+            <td>49.1</td>
+            <td>59.5</td>
+            <td>34.8</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
+            <td>8B</td>
+            <td>53.2</td>
+            <td>69.6*</td>
+            <td><strong>67.6*</strong></td>
+            <td>63.3/69.0</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">InternVL-2.5-8B</td>
+            <td>8B</td>
+            <td>54.8</td>
+            <td>67.7</td>
+            <td>52.5</td>
+            <td>64.2/66.9</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
+            <td>8B</td>
+            <td>53</td>
+            <td>69.1</td>
+            <td>53.8</td>
+            <td>60.9/63.6</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
+            <td>8B</td>
+            <td><u>56.7</u></td>
+            <td><u>71.9</u></td>
+            <td><u>58.6</u></td>
+            <td>63.9/67.9</td>
+        </tr>
+    </tbody>
+</table>
+</div>
+* We evaluate officially released checkpoints by ourselves.
+</details>
+<details>
+<summary>Click to view audio understanding and speech conversation results.</summary>
+**Audio Understanding**
+<div align="center">
+<table style="margin: 0px auto;">
+    <thead>
+        <tr>
+            <th align="left">Task</th>
+            <th>Size</th>
+            <th colspan="3">ASR (zh)</th>
+            <th colspan="3">ASR (en)</th>
+            <th colspan="2">ASR</th>
+            <th>Emotion</th>
+        </tr>
+        <tr>
+            <th align="left">Metric</th>
+            <td></td>
+            <th colspan="3">CER↓</th>
+            <th colspan="3">WER↓</th>
+            <th colspan="2">BLEU↑</th>
+            <th>ACC↑</th>
+        </tr>
+        <tr>
+            <th align="left">Dataset</th>
+            <td></td>
+            <th>AISHELL-1</th>
+            <th>Fleurs zh</th>
+            <th>WenetSpeech test-net</th>
+            <th>LibriSpeech test-clean</th>
+            <th>GigaSpeech</th>
+            <th>TED-LIUM</th>
+            <th>CoVoST en2zh</th>
+            <th>CoVoST zh2en</th>
+            <th>MELD emotion</th>
+        </tr>
+    </thead>
+    <tbody align="center">
+        <tr>
+            <td colspan="11" align="left"><strong>Proprietary</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
+            <td>-</td>
+            <td>7.3*</td>
+            <td><u>5.4*</u></td>
+            <td>28.9*</td>
+            <td>2.6*</td>
+            <td>12.9*</td>
+            <td>4.8*</td>
+            <td>37.1*</td>
+            <td>15.7*</td>
+            <td>33.2*</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
+            <td>-</td>
+            <td>4.5*</td>
+            <td>5.9*</td>
+            <td>14.3*</td>
+            <td>2.9*</td>
+            <td>10.6*</td>
+            <td><strong>3.0*</strong></td>
+            <td><u>47.3*</u></td>
+            <td>22.6*</td>
+            <td>48.4*</td>
+        </tr>
+        <tr>
+            <td colspan="11" align="left"><strong>Open-Source</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Qwen2-Audio</td>
+            <td>8B</td>
+            <td>-</td>
+            <td>7.5</td>
+            <td>-</td>
+            <td><strong>1.6</strong></td>
+            <td>-</td>
+            <td>-</td>
+            <td>45.2</td>
+            <td><u>24.4</u></td>
+            <td><strong>55.3</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Qwen2-Audio-Instruction</td>
+            <td>8B</td>
+            <td>2.6*</td>
+            <td>6.9*</td>
+            <td><u>10.3*</u></td>
+            <td>3.1*</td>
+            <td><u>9.7</u>*</td>
+            <td>5.9*</td>
+            <td>39.5*</td>
+            <td>22.9*</td>
+            <td>17.4*</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">GLM-4-Voice-Base</td>
+            <td>9B</td>
+            <td><u>2.5</u></td>
+            <td>-</td>
+            <td>-</td>
+            <td>2.8</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+        </tr>
+        <tr style="background-color: #e6f2ff;">
+            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
+            <td>8B</td>
+            <td><strong>1.6</strong></td>
+            <td><strong>4.4</strong></td>
+            <td><strong>6.9</strong></td>
+            <td><u>1.7</u></td>
+            <td><strong>8.7</strong></td>
+            <td><strong>3.0</strong></td>
+            <td><strong>48.2</strong></td>
+            <td><strong>27.2</strong></td>
+            <td><u>52.4</u></td>
+        </tr>
+    </tbody>
+</table>
+</div>
+* We evaluate officially released checkpoints by ourselves.<br><br>
+**Speech Generation**
+<div align="center">
+<table style="margin: 0px auto;">
+    <thead>
+        <tr>
+            <th align="left">Task</th>
+            <th>Size</th>
+            <th colspan="9">SpeechQA</th>
+        </tr>
+        <tr>
+            <th align="left">Metric</th>
+            <th></th>
+            <th colspan="3">ACC↑</th>
+            <th>G-Eval (10 point)↑</th>
+            <th>Semantic ELO score↑</th>
+            <th>Acoustic ELO score↑</th>
+            <th>Overall ELO score↑</th>
+            <th>UTMOS↑</th>
+            <th>ASR-WER↓</th>
+        </tr>
+        <tr>
+            <th align="left">Dataset</th>
+            <th></th>
+            <th>Speech Llama Q.</th>
+            <th>Speech Web Q.</th>
+            <th>Speech Trivia QA</th>
+            <th>Speech AlpacaEval</th>
+            <th colspan="5">AudioArena</th>
+        </tr>
+    </thead>
+    <tbody align="center">
+        <tr>
+            <td colspan="11" align="left"><strong>Proprietary</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
+            <td></td>
+            <td><strong>71.7</strong></td>
+            <td><strong>51.6</strong></td>
+            <td><strong>69.7</strong></td>
+            <td><strong>7.4</strong></td>
+            <td><strong>1157</strong></td>
+            <td><strong>1203</strong></td>
+            <td><strong>1200</strong></td>
+            <td><strong>4.2</strong></td>
+            <td><strong>2.3</strong></td>
+        </tr>
+        <tr>
+            <td colspan="11" align="left"><strong>Open-Source</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">GLM-4-Voice</td>
+            <td>9B</td>
+            <td>50.0</td>
+            <td>32.0</td>
+            <td>36.4</td>
+            <td><u>5.1</u></td>
+            <td>999</td>
+            <td>1147</td>
+            <td>1035</td>
+            <td><u>4.1</u></td>
+            <td><u>11.7</u></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Llama-Omni</td>
+            <td>8B</td>
+            <td>45.3</td>
+            <td>22.9</td>
+            <td>10.7</td>
+            <td>3.9</td>
+            <td>960</td>
+            <td>878</td>
+            <td>897</td>
+            <td>3.2</td>
+            <td>24.3</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Moshi</td>
+            <td>7B</td>
+            <td>43.7</td>
+            <td>23.8</td>
+            <td>16.7</td>
+            <td>2.4</td>
+            <td>871</td>
+            <td>808</td>
+            <td>875</td>
+            <td>2.8</td>
+            <td>8.2</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Mini-Omni</td>
+            <td>1B</td>
+            <td>22.0</td>
+            <td>12.8</td>
+            <td>6.9</td>
+            <td>2.5</td>
+            <td>926</td>
+            <td>803</td>
+            <td>865</td>
+            <td>3.4</td>
+            <td>10.0</td>
+        </tr>
+        <tr style="background-color: #e6f2ff;">
+            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
+            <td>8B</td>
+            <td><u>61.0</u></td>
+            <td><u>40.0</u></td>
+            <td><u>40.2</u></td>
+            <td><u>5.1</u></td>
+            <td><u>1088</u></td>
+            <td><u>1163</u></td>
+            <td><u>1131</u></td>
+            <td><strong>4.2</strong></td>
+            <td>9.8</td>
+        </tr>
+    </tbody>
+</table>
+</div>
+All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a>.<br><br>
+**Voice Cloning**
+<div align="center">
+<table style="margin: 0px auto;">
+    <thead>
+        <tr>
+            <th align="left">Task</th>
+            <th colspan="2">Voice cloning</th>
+        </tr>
+        <tr>
+            <th align="left">Metric</th>
+            <th>SIMO↑</th>
+            <th>SIMO↑</th>
+        </tr>
+        <tr>
+            <th align="left">Dataset</th>
+            <th>Seed-TTS test-zh</th>
+            <th>Seed-TTS test-en</th>
+        </tr>
+    </thead>
+    <tbody align="center">
+        <tr>
+            <td nowrap="nowrap" align="left">F5-TTS</td>
+            <td><strong>76</strong></td>
+            <td><strong>67</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">CosyVoice</td>
+            <td><u>75</u></td>
+            <td><u>64</u></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">FireRedTTS</td>
+            <td>63</td>
+            <td>46</td>
+        </tr>
+        <tr style="background-color: #e6f2ff;">
+            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
+            <td>57</td>
+            <td>47</td>
+        </tr>
+    </tbody>
+</table>
+</div>
+Note: Mimick Task: Takes audio input, and outputs both an ASR transcription and a voice imitation (TTS)
+</details>
+<details>
+<summary>Click to view multimodal live streaming results.</summary>
+**Multimodal Live Streaming**: results on StreamingBench
+<table style="margin: 0px auto;">
+    <thead>
+        <tr>
+            <th align="left">Model</th>
+            <th>Size</th>
+            <th>Real-Time Video Understanding</th>
+            <th>Omni-Source Understanding</th>
+            <th>Contextual Understanding</th>
+            <th>Overall</th>
+        </tr>
+    </thead>
+    <tbody align="center">
+        <tr>
+            <td colspan="7" align="left"><strong>Proprietary</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
+            <td>-</td>
+            <td><u>77.4</u></td>
+            <td><strong>67.8</strong></td>
+            <td><strong>51.1</strong></td>
+            <td><strong>70.3</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">GPT-4o</td>
+            <td>-</td>
+            <td>74.5</td>
+            <td>51.0</td>
+            <td><u>48.0</u></td>
+            <td>64.1</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td>
+            <td>-</td>
+            <td>74.0</td>
+            <td>41.4</td>
+            <td>37.8</td>
+            <td>59.7</td>
+        </tr>
+        <tr>
+            <td colspan="9" align="left"><strong>Open-source</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">VILA-1.5</td>
+            <td>8B</td>
+            <td>61.5</td>
+            <td>37.5</td>
+            <td>26.7</td>
+            <td>49.5</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">LongVA</td>
+            <td>7B</td>
+            <td>63.1</td>
+            <td>35.9</td>
+            <td>30.2</td>
+            <td>50.7</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td>
+            <td>34B</td>
+            <td>69.8</td>
+            <td>41.7</td>
+            <td>34.3</td>
+            <td>56.7</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
+            <td>8B</td>
+            <td>71.2</td>
+            <td>40.7</td>
+            <td>33.1</td>
+            <td>57.0</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">InternVL2-8B</td>
+            <td>8B</td>
+            <td>70.1</td>
+            <td>42.7</td>
+            <td>34.1</td>
+            <td>57.0</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">VITA-1.5</td>
+            <td>8B</td>
+            <td>70.9</td>
+            <td>40.8</td>
+            <td>35.8</td>
+            <td>57.4</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td>
+            <td>8B</td>
+            <td>74.3</td>
+            <td>40.8</td>
+            <td>31.0</td>
+            <td>58.4</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td>
+            <td>8B</td>
+            <td>75.4</td>
+            <td>46.2</td>
+            <td>33.6</td>
+            <td>60.8</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
+            <td>8B</td>
+            <td>72.4</td>
+            <td>40.2</td>
+            <td>33.4</td>
+            <td>57.7</td>
+        </tr>
+        <tr style="background-color: #e6f2ff;">
+            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
+            <td>8B</td>
+            <td><strong>79.9</strong></td>
+            <td><u>53.4</u></td>
+            <td>38.5</td>
+            <td><u>66.0</u></td>
+        </tr>
+    </tbody>
+</table>
+</details>
+### Examples <!-- omit in toc -->
+We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.
+<div style="display: flex; flex-direction: column; align-items: center;">
+  <img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
+  <img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
+  <img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
+</div>
+## Online Demo
+Click here to try the online demo of **MiniCPM-o 2.6** on [CN](https://minicpm-omni-webdemo.modelbest.cn/) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn) server.
+## Usage
+Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10：
+```
+Pillow==10.1.0
+torch==2.2.0
+torchaudio==2.2.0
+torchvision==0.17.0
+transformers==4.44.2
+librosa==0.9.0
+soundfile==0.12.1
+vector-quantize-pytorch==1.18.5
+vocos==0.1.0
+decord
+moviepy
+```
+### Model initialization
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, AutoTokenizer
+# load omni model default, the default init_vision/init_audio/init_tts is True
+# if load vision-only model, please set init_audio=False and init_tts=False
+# if load audio-only model, please set init_vision=False
+model = AutoModel.from_pretrained(
+    'openbmb/MiniCPM-o-2_6',
+    trust_remote_code=True,
+    attn_implementation='sdpa', # sdpa or flash_attention_2
+    torch_dtype=torch.bfloat16,
+    init_vision=True,
+    init_audio=True,
+    init_tts=True
+)
+model = model.eval().cuda()
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
+# In addition to vision-only mode, tts processor and vocos also needs to be initialized
+model.init_tts()
+model.tts.float()
+```
+### Omni mode
+we provide two inference modes: chat and streaming
+#### chat inference
+```python
+import math
+import numpy as np
+from PIL import Image
+from moviepy.editor import VideoFileClip
+import tempfile
+import librosa
+import soundfile as sf
+def get_video_chunk_content(video_path, flatten=True):
+    video = VideoFileClip(video_path)
+    print('video_duration:', video.duration)
+    with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
+        temp_audio_file_path = temp_audio_file.name
+        video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
+        audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
+    num_units = math.ceil(video.duration)
+    # 1 frame + 1s audio chunk
+    contents= []
+    for i in range(num_units):
+        frame = video.get_frame(i+1)
+        image = Image.fromarray((frame).astype(np.uint8))
+        audio = audio_np[sr*i:sr*(i+1)]
+        if flatten:
+            contents.extend(["<unit>", image, audio])
+        else:
+            contents.append(["<unit>", image, audio])
+    return contents
+video_path="/path/to/video"
+sys_msg = model.get_sys_prompt(mode='omni', language='en')
+# if use voice clone prompt, please set ref_audio
+# ref_audio_path = '/path/to/ref_audio'
+# ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
+# sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
+contents = get_video_chunk_content(video_path)
+msg = {"role":"user", "content": contents}
+msgs = [sys_msg, msg]
+# please set generate_audio=True and output_audio_path to save the tts result
+generate_audio = True
+output_audio_path = 'output.wav'
+res = model.chat(
+    msgs=msgs,
+    tokenizer=tokenizer,
+    sampling=True,
+    temperature=0.5,
+    max_new_tokens=4096,
+    omni_input=True, # please set omni_input=True when omni inference
+    use_tts_template=True,
+    generate_audio=generate_audio,
+    output_audio_path=output_audio_path,
+    max_slice_nums=1,
+    use_image_id=False,
+    return_dict=True
+)
+print(res)
+```
+#### streaming inference
+```python
+# a new conversation need reset session first, it will reset the kv-cache
+model.reset_session()
+contents = get_video_chunk_content(video_path, flatten=False)
+session_id = '123'
+generate_audio = True
+# 1. prefill system prompt
+res = model.streaming_prefill(
+    session_id=session_id,
+    msgs=[sys_msg],
+    tokenizer=tokenizer
+)
+# 2. prefill video/audio chunks
+for content in contents:
+    msgs = [{"role":"user", "content": content}]
+    res = model.streaming_prefill(
+        session_id=session_id,
+        msgs=msgs,
+        tokenizer=tokenizer
+    )
+# 3. generate
+res = model.streaming_generate(
+    session_id=session_id,
+    tokenizer=tokenizer,
+    temperature=0.5,
+    generate_audio=generate_audio
+)
+audios = []
+text = ""
+if generate_audio:
+    for r in res:
+        audio_wav = r.audio_wav
+        sampling_rate = r.sampling_rate
+        txt = r.text
+        audios.append(audio_wav)
+        text += txt
+    res = np.concatenate(audios)
+    sf.write("output.wav", res, samplerate=sampling_rate)
+    print("text:", text)
+    print("audio saved to output.wav")
+else:
+    for r in res:
+        text += r['text']
+    print("text:", text)
+```
+### Audio-Only mode
+#### Mimick
+```python
+mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
+audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
+msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
+res = model.chat(
+    msgs=msgs,
+    tokenizer=tokenizer,
+    sampling=True,
+    max_new_tokens=128,
+    use_tts_template=True,
+    temperature=0.3,
+    generate_audio=True,
+    output_audio_path='output.wav', # save the tts result to output_audio_path
+)
+```
+#### General Speech Conversation with Configurable Voices
+<details> <summary>Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.</summary>
+```python
+ref_audio, _ = librosa.load('./assert/voice_01.wav', sr=16000, mono=True) # load the reference audio
+# Audio RolePlay:  # With this mode, model will role-play the character based on the audio prompt.
+sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
+user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
+# Audio Assistant: # With this mode, model will speak with the voice in ref_audio as a AI assistant.
+# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
+# user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # Try to ask something!
+```
+```python
+msgs = [sys_prompt, user_question]
+res = model.chat(
+    image=None,
+    msgs=msgs,
+    context=None,
+    tokenizer=tokenizer,
+    sampling=True,
+    max_new_tokens=128,
+    stream=False,
+    stream_input=True,
+    use_tts_template=True,
+    generate_audio=True,
+    temperature=0.3,
+    output_audio_path='result.wav',
+)
+# round two
+history = msgs.append({'role': 'assistant', 'content': res})
+user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
+msgs = history.append(user_question)
+res = model.chat(
+    image=None,
+    msgs=msgs,
+    context=None,
+    tokenizer=tokenizer,
+    sampling=True,
+    max_new_tokens=128,
+    stream=False,
+    stream_input=True,
+    use_tts_template=True,
+    generate_audio=True,
+    temperature=0.3,
+    output_audio_path='result_round_2.wav',
+)
+print(res)
+```
+</details>
+#### Addressing various audio tasks
+<details>
+<summary> Click to show Python code running MiniCPM-o 2.6 with specific audioQA task. </summary>
+```python
+'''
+Audio Understanding Task Prompt:
+Speech:
+    ASR with ZH(same as AST en2zh): 请仔细听这段音频片段，并将其内容逐字记录。
+    ASR with EN(same as AST zh2en): Please listen to the audio snippet carefully and transcribe the content.
+    Speaker Analysis: Based on the speaker's content, speculate on their gender, condition, age range, and health status.
+General Audio:
+    Audio Caption: Summarize the main content of the audio.
+    Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene.
+'''
+task_prompt = "\n"
+audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
+msgs = [{'role': 'user', 'content': [task_prompt,audio_input]}]
+res = model.chat(
+    image=None,
+    msgs=msgs,
+    context=None,
+    tokenizer=tokenizer,
+    sampling=True,
+    max_new_tokens=128,
+    stream=False,
+    stream_input=True,
+    use_tts_template=True,
+    generate_audio=True,
+    temperature=0.3,
+    output_audio_path='result.wav',
+)
+print(res)
+```
+```python
+'''
+Speech Generation Task Prompt:
+    Human Instruction-to-Speech: see https://voxinstruct.github.io/VoxInstruct/
+    Example:
+        # 在新闻中，一个年轻男性兴致勃勃地说：“祝福亲爱的祖国母亲美丽富强！”他用低音调和低音量，慢慢地说出了这句话。
+        # Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context.
+    Voice Cloning or Voice Creation: With this mode, model will act like a TTS model.
+'''
+# Human Instruction-to-Speech:
+task_prompt = '' #Try to make some Human Instruction-to-Speech prompt
+msgs = [{'role': 'user', 'content': [task_prompt]}] # you can try to use the same audio question
+# Voice Cloning mode: With this mode, model will act like a TTS model.
+# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
+# text_prompt = f"Please read the text below."
+# user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]} # using same voice in sys_prompt to read the text. (Voice Cloning)
+# user_question = {'role': 'user', 'content': [text_prompt, librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # using same voice in sys_prompt to read 'xxx.wav'. (Voice Creation)
+msgs = [sys_prompt, user_question]
+res = model.chat(
+    image=None,
+    msgs=msgs,
+    context=None,
+    tokenizer=tokenizer,
+    sampling=True,
+    max_new_tokens=128,
+    stream=False,
+    stream_input=True,
+    use_tts_template=True,
+    generate_audio=True,
+    temperature=0.3,
+    output_audio_path='result.wav',
+)
+```
+</details>
+### Vision-Only mode
+`MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
+#### chat with single image
+```python
+# test.py
+image = Image.open('xx.jpg').convert('RGB')
+question = 'What is in the image?'
+msgs = [{'role': 'user', 'content': [image, question]}]
+res = model.chat(
+    image=None,
+    msgs=msgs,
+    tokenizer=tokenizer
+)
+print(res)
+## if you want to use streaming, please make sure sampling=True and stream=True
+## the model.chat will return a generator
+res = model.chat(
+    msgs=msgs,
+    tokenizer=tokenizer,
+    sampling=True,
+    stream=True
+)
+generated_text = ""
+for new_text in res:
+    generated_text += new_text
+    print(new_text, flush=True, end='')
+```
+#### Chat with multiple images
+<details>
+<summary> Click to show Python code running MiniCPM-o 2.6 with multiple images input. </summary>
+```python
+image1 = Image.open('image1.jpg').convert('RGB')
+image2 = Image.open('image2.jpg').convert('RGB')
+question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
+msgs = [{'role': 'user', 'content': [image1, image2, question]}]
+answer = model.chat(
+    msgs=msgs,
+    tokenizer=tokenizer
+)
+print(answer)
+```
+</details>
+#### In-context few-shot learning
+<details>
+<summary> Click to view Python code running MiniCPM-o 2.6 with few-shot input. </summary>
+```python
+question = "production date"
+image1 = Image.open('example1.jpg').convert('RGB')
+answer1 = "2023.08.04"
+image2 = Image.open('example2.jpg').convert('RGB')
+answer2 = "2007.04.24"
+image_test = Image.open('test.jpg').convert('RGB')
+msgs = [
+    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
+    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
+    {'role': 'user', 'content': [image_test, question]}
+]
+answer = model.chat(
+    msgs=msgs,
+    tokenizer=tokenizer
+)
+print(answer)
+```
+</details>
+#### Chat with video
+<details>
+<summary> Click to view Python code running MiniCPM-o 2.6 with video input. </summary>
+```python
+MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
+def encode_video(video_path):
+    def uniform_sample(l, n):
+        gap = len(l) / n
+        idxs = [int(i * gap + gap / 2) for i in range(n)]
+        return [l[i] for i in idxs]
+    vr = VideoReader(video_path, ctx=cpu(0))
+    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
+    frame_idx = [i for i in range(0, len(vr), sample_fps)]
+    if len(frame_idx) > MAX_NUM_FRAMES:
+        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
+    frames = vr.get_batch(frame_idx).asnumpy()
+    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
+    print('num frames:', len(frames))
+    return frames
+video_path ="video_test.mp4"
+frames = encode_video(video_path)
+question = "Describe the video"
+msgs = [
+    {'role': 'user', 'content': frames + [question]},
+]
+# Set decode params for video
+params={}
+params["use_image_id"] = False
+params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution >  448*448
+answer = model.chat(
+    msgs=msgs,
+    tokenizer=tokenizer,
+    **params
+)
+print(answer)
+```
+</details>
+Please look at [GitHub](https://github.com/OpenBMB/MiniCPM-V) for more detail about usage.
+## Inference with llama.cpp<a id="llamacpp"></a>
+MiniCPM-o 2.6 can run with llama.cpp. See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv) for more detail.
+## Int4 quantized version
+Download the int4 quantized version for lower GPU memory (7GB) usage:  [MiniCPM-o-2_6-int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4).
+## License
+#### Model License
+* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
+* The usage of MiniCPM-o and MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
+* The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-o 2.6 weights are also available for free commercial use.
+#### Statement
+* As an LMM, MiniCPM-o 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 2.6 does not represent the views and positions of the model developers
+* We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
+## Key Techniques and Other Multimodal Projects
+👏 Welcome to explore key techniques of MiniCPM-o 2.6 and other multimodal projects of our team:
+[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD)  | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
+## Citation
+If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️！
+```bib
+@article{yao2024minicpm,
+  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
+  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
+  journal={arXiv preprint arXiv:2408.01800},
+  year={2024}
+}
+```

configuration_minicpm.py CHANGED Viewed

@@ -190,6 +190,7 @@ class MiniCPMOConfig(Qwen2Config):
         elif isinstance(vision_config, SiglipVisionConfig):
             self.vision_config = vision_config
         if audio_config is None:
             self.audio_config = WhisperConfig()
         elif isinstance(audio_config, dict):

         elif isinstance(vision_config, SiglipVisionConfig):
             self.vision_config = vision_config
+        # same as openai/whisper-medium add use_cache
         if audio_config is None:
             self.audio_config = WhisperConfig()
         elif isinstance(audio_config, dict):

modeling_minicpmo.py CHANGED Viewed

@@ -121,19 +121,21 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
         self.processor = AutoProcessor.from_pretrained(self.config._name_or_path, trust_remote_code=True)
-        self.terminators = ['<|im_end|>', '<|endoftext|>']
         self.default_tts_chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n<|spk_bos|><|spk|><|spk_eos|><|tts_bos|>' }}{% endif %}"
         self.force_no_stop = False
         # for stream api
         self.session_id = None
         self.new_user_msg = True
         self.llm_generated = False
         self.llm_generate_completed = False
         self.llm_past_key_values = None
         self.audio_past_key_values = None  # apm kv cache
-        self.speak_score = [0.0]
     def init_tts(
         self,
@@ -401,6 +403,21 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
         return vllm_embedding, vision_hidden_states
     def get_audio_embedding_streaming(self, data):
         wavforms = data.get("audio_features", [])  # (bs, 80, frames) or [], multi audios need filled in advance
         audio_feature_lens_raw = data.get("audio_feature_lens", [])  # list, [[x1, x2], [y1], [z1]]
@@ -447,15 +464,24 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
             return []
     def get_audio_embedding(self, data, chunk_length=-1):
-        """
-        Compute all audio embeddings
         Args:
-            data:
-            chunk_length: if chunk_length == -1 means whisper use full attention
-                          if chunk_length > 0 means whisper use chunk attention
         Returns:
-            audio embeddings
         """
         wavforms = data.get("audio_features", [])  # (bs, 80, frames) or [], multi audios need filled in advance
         audio_feature_lens_raw = data.get("audio_feature_lens", [])  # list, [[x1, x2], [y1], [z1]]
@@ -520,7 +546,6 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
     def get_omni_embedding(self, data, input_embeddings, chunk_length=-1, stream_input=False):
         """
         Args:
             data:
             input_embeddings:
@@ -576,14 +601,21 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
     def forward(self, data, **kwargs):
         vllm_embedding, vision_hidden_states = self.get_vllm_embedding(data)
-        vllm_embedding = self.get_omni_embedding(
-            data, input_embeddings=vllm_embedding, chunk_length=self.config.audio_chunk_length
-        )
         position_ids = data["position_ids"]
         if position_ids.dtype != torch.int64:
             position_ids = position_ids.long()
         return self.llm(input_ids=None, position_ids=position_ids, inputs_embeds=vllm_embedding, **kwargs)
     def _decode(self, inputs_embeds, tokenizer, attention_mask, **kwargs):
@@ -627,6 +659,93 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
             result_text.append(tokenizer.decode(result))
         return result_text
     def generate(
         self,
         input_ids=None,
@@ -697,7 +816,7 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
         omni_input=False,
         max_slice_nums=None,
         use_image_id=None,
-        use_tts=False,
         generate_audio=False,
         return_spk_embed=False,
         return_dict=False,
@@ -721,7 +840,7 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
             omni_input: determine whether it is omni mode
             max_slice_nums: control the maximum number of image slices
             use_image_id: for video understanding or omni understanding, use_image_id should be False
-            use_tts: if the msgs contain audio, use_tts should be True
             generate_audio: whether to generate audio output, only used when return_dict=True
             return_spk_embed: whether to return spk embedding, only used when return_dict=True
             return_dict: whether to return dict
@@ -798,12 +917,12 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
                 for c in content:
                     if isinstance(c, Image.Image):
                         images.append(c)
-                        cur_msgs.append("<image>./</image>")
                     elif isinstance(c, np.ndarray):  # audio
                         audios.append(c)
                         audio_parts.append(i)
-                        cur_msgs.append("<audio>./</audio>")
-                        use_tts = True
                     elif isinstance(c, str):
                         cur_msgs.append(c)
                 if omni_input:
@@ -816,7 +935,7 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
                     copy_msgs,
                     tokenize=False,
                     add_generation_prompt=True,
-                    chat_template=self.default_tts_chat_template if use_tts else None,
                 )
             )
             input_images_list.append(images)
@@ -886,13 +1005,18 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
             else:
                 answer = res[0]
-                if use_tts and generate_audio:
                     mel_spec = self._generate_mel_spec(inputs, outputs, answer)
                     wav_numpy, sr = self.decode_mel_to_audio(mel_spec, output_audio_path)
             if return_spk_embed:
                 spk_embeds = self._get_last_spk_embeds(inputs, outputs)
             if return_dict:
                 return OmniOutput(text=answer, spk_embeds=spk_embeds, audio_wav=wav_numpy, sampling_rate=sr)
             else:
@@ -904,6 +1028,7 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
         session_id,
         msgs,
         tokenizer,
         max_slice_nums=None,
         ls_temperature=1.0,
         **kwargs,
@@ -933,26 +1058,27 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
         for j, c in enumerate(content):
             if isinstance(c, Image.Image):
                 images.append(c)
-                cur_msgs.append("<image>./</image>")
             elif isinstance(c, np.ndarray):  # audio
                 audios.append(c)
-                cur_msgs.append("<audio>./</audio>")
             elif isinstance(c, str):
                 cur_msgs.append(c)
             else:
                 logger.error("Invalid content type:", c)
         if not self.is_first and self.new_user_msg and msg["role"] == "user":  # new user add im_start
             if self.llm_generated:
                 if self.llm_generate_completed:
-                    msg["content"] = "<|im_end|>\n<|im_start|>user\n" + "".join(cur_msgs)
                 else:  # break llm gen, add tts_eos
-                    msg["content"] = "<|tts_eos|><|im_end|>\n<|im_start|>user\n" + "".join(cur_msgs)
             else:
-                msg["content"] = "<|im_start|>user\n" + "".join(cur_msgs)
             self.new_user_msg = False
         else:
-            msg["content"] = "".join(cur_msgs)
         if msg["role"] in ["system", "assistant"]:
             self.new_user_msg = True
@@ -960,11 +1086,9 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
         if self.is_first:
             # init pask_key_values
-            logger.debug(f"new session_id: {session_id}, reset kv cache")
             self.session_id = session_id
-            self.llm_past_key_values = None  # llm kv cache
-            self.new_user_msg = True
-            self.audio_past_key_values = None  # apm kv cache
             prompt = tokenizer.apply_chat_template(
                 copy_msgs, tokenize=False, add_generation_prompt=False, chat_template=self.default_tts_chat_template
@@ -1015,14 +1139,7 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
             return_dict=True,
         )
         self.llm_past_key_values = outputs["past_key_values"]
-        listen_id = tokenizer.convert_tokens_to_ids("<|listen|>")
-        speak_id = tokenizer.convert_tokens_to_ids("<|speak|>")
-        listen_speak_score = torch.Tensor([outputs["logits"][0, -1, listen_id], outputs["logits"][0, -1, speak_id]])
-        listen_speak_score = F.softmax(listen_speak_score / ls_temperature, dim=0).numpy()
-        self.speak_score = [float(listen_speak_score[1])]
-        return self.speak_score
     @torch.inference_mode()
     def streaming_generate(
@@ -1032,7 +1149,7 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
         max_new_tokens=512,
         min_new_tokens=0,
         sampling=True,
-        use_tts=True,
         enable_regenerate=False,
         **kwargs,
     ):
@@ -1079,7 +1196,7 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
         generation_config["max_new_tokens"] = max_new_tokens
         streamer = self.llm_generate_chunk(input_ids, attention_mask, tokenizer, terminators, generation_config)
-        if use_tts:
             result = self._generate_mel_spec_audio_streaming(
                 spk_bounds, streamer, output_chunk_size=25, enable_regenerate=enable_regenerate
             )
@@ -1323,6 +1440,10 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
         return mel_spec
     def _linear_overlap_add2_wav(self, frames: List[torch.Tensor], overlap: int):
         assert len(frames) == 2
         device = frames[0].device
         dtype = frames[0].dtype
@@ -1569,7 +1690,8 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
                         prev_wav = wav_np[len(prev_wav) :]
                         cur_text = gen_text_raw[prev_text_len:]
                         prev_text_len = len(gen_text_raw)
-                        yield wav_y, sr, cur_text
                     else:
                         prev_wav = wav_np
                 else:
@@ -1580,7 +1702,8 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
                         )  # tts_hop256*2
                         cur_text = gen_text_raw[prev_text_len:]
                         prev_text_len = len(gen_text_raw)
-                        yield wav_np, sr, cur_text
                     else:
                         prev_wav = wav_np
@@ -1678,7 +1801,7 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
                         prev_wav = wav_np[len(prev_wav) :]
                         cur_text = gen_text_raw[prev_text_len:]
                         prev_text_len = len(gen_text_raw)
-                        yield wav_y, sr, cur_text
                     else:
                         prev_wav = wav_np
                 else:
@@ -1689,7 +1812,7 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
                         )  # tts_hop256*2
                         cur_text = gen_text_raw[prev_text_len:]
                         prev_text_len = len(gen_text_raw)
-                        yield wav_np, sr, cur_text
                     else:
                         prev_wav = wav_np
@@ -1703,7 +1826,7 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
         if prev_wav is not None:
             cur_text = gen_text_raw[prev_text_len:]
-            yield prev_wav, sr, cur_text  # yield last chunk wav without smooth
         if new_segment_gen and not stop:
             logger.debug(
@@ -1737,6 +1860,7 @@ class MiniCPMO(MiniCPMOPreTrainedModel):
         return wav_numpy, sr
 class MiniCPMWhisperEncoderLayer(nn.Module):
     def __init__(self, config: WhisperConfig, layer_idx: int = None):
         super().__init__()
@@ -1765,6 +1889,24 @@ class MiniCPMWhisperEncoderLayer(nn.Module):
         past_key_values: Optional[EncoderDecoderCache] = None,
         use_cache: Optional[bool] = False,
     ) -> torch.Tensor:
         residual = hidden_states
         hidden_states = self.self_attn_layer_norm(hidden_states)
         hidden_states, attn_weights, past_key_values = self.self_attn(
@@ -1802,6 +1944,7 @@ class MiniCPMWhisperEncoderLayer(nn.Module):
         return outputs
 class MiniCPMWhisperEncoder(WhisperEncoder):
     def __init__(self, config: WhisperConfig):
@@ -1821,6 +1964,107 @@ class MiniCPMWhisperEncoder(WhisperEncoder):
         past_key_values: Optional[EncoderDecoderCache] = None,
         use_cache: Optional[bool] = None,
     ):
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
@@ -1935,7 +2179,7 @@ class MiniCPMWhisperEncoder(WhisperEncoder):
         )
-# dvae module
 class ConvNeXtBlock(nn.Module):
     def __init__(
         self,
@@ -1989,6 +2233,7 @@ class ConvNeXtBlock(nn.Module):
         return x
 class GFSQ(nn.Module):
     def __init__(
         self,
@@ -2031,6 +2276,7 @@ class GFSQ(nn.Module):
         return ind.transpose_(1, 2) if self.transpose else ind
 class DVAEDecoder(nn.Module):
     def __init__(
         self,
@@ -2075,6 +2321,7 @@ class DVAEDecoder(nn.Module):
         return x
 class DVAE(nn.Module):
     def __init__(
         self,
@@ -2153,7 +2400,6 @@ class DVAE(nn.Module):
         return torch.mul(dec_out, self.coef, out=dec_out)
-# tts module
 def apply_spk_emb(
     input_ids: torch.Tensor = None,
     spk_emb: torch.Tensor = None,
@@ -2162,7 +2408,7 @@ def apply_spk_emb(
     num_spk_embs: int = 1,
 ):
     """
-    Replace consecutive speaker embedding placeholders in input_embeds with pre-prepared speaker embeddings. This is an in-place replacement, no new tensor is created, so no value is returned.
     Args:
         input_ids (torch.Tensor): Input ID tensor, shape [batch_size, seq_len_max]
@@ -2201,7 +2447,7 @@ def make_streaming_chunk_mask_generation(
     use_spk_emb: bool = True,
 ) -> torch.Tensor:
     """
-    Determine which `text` tokens the model can attend to when generating each chunk of `audio` tokens.
     This function creates a mask that allows the model to attend to a specific chunk of text
     tokens when generating each chunk of audio tokens, enabling streaming TTS generation.
@@ -2258,6 +2504,7 @@ def make_streaming_chunk_mask_generation(
     return causal_mask
 class CustomRepetitionPenaltyLogitsProcessorRepeat:
     def __init__(self, penalty: float, max_input_ids: int, past_window: int):
         if not isinstance(penalty, float) or not (penalty > 0):
@@ -2316,6 +2563,97 @@ class MultiModalProjector(nn.Module):
 class ConditionalChatTTS(PreTrainedModel):
     config_class = ConditionalChatTTSConfig
     def __init__(self, config: ConditionalChatTTSConfig):
@@ -2373,19 +2711,16 @@ class ConditionalChatTTS(PreTrainedModel):
         self.model = model
     @torch.inference_mode()
-    def prepare_inputs_embeds(
         self,
         input_ids: torch.Tensor,
         lm_spk_emb_last_hidden_states: Optional[torch.Tensor] = None,
-        lm_last_hidden_states: Optional[torch.Tensor] = None,
     ):
-        """Prepare inputs_embeds for the model in inference mode,
-        encode input_ids to embeddings, then merge lm_spk_emb_last_hidden_states, and lm_last_hidden_states.
         Args:
             input_ids (torch.Tensor): Input token IDs.
             lm_spk_emb_last_hidden_states (Optional[torch.Tensor], optional): Last hidden states of speaker embeddings from the language model. Defaults to None.
-            lm_last_hidden_states (Optional[torch.Tensor], optional): Last hidden states from the language model. Defaults to None.
         Raises:
             NotImplementedError: If speaker embedding is not used and language model hidden states are not implemented.
@@ -2415,8 +2750,6 @@ class ConditionalChatTTS(PreTrainedModel):
                     num_spk_embs=self.num_spk_embs,
                 )
         else:
-            assert lm_last_hidden_states is not None
-            # TODO: Add projected language model hidden states to tts embedding space
             raise NotImplementedError
         return inputs_embeds
@@ -2428,10 +2761,9 @@ class ConditionalChatTTS(PreTrainedModel):
         position_ids: torch.LongTensor,
         past_key_values: List[Tuple[torch.Tensor, torch.Tensor]],
         lm_spk_emb_last_hidden_states: Optional[torch.Tensor] = None,
-        lm_last_hidden_states: Optional[torch.Tensor] = None,
     ):
         """Prefill a chunk of new text tokens in streaming setting.
-        Specifically speaking, update `past_key_values` using new text tokens.
         Args:
             input_ids (Tensor): Tensor of shape [batch_size, seq_len]
@@ -2445,11 +2777,10 @@ class ConditionalChatTTS(PreTrainedModel):
         assert input_ids.shape[0] == 1
         assert past_key_values is not None
-        # Merge text and embeddings from language model
-        inputs_embeds = self.prepare_inputs_embeds(
             input_ids=input_ids,
             lm_spk_emb_last_hidden_states=lm_spk_emb_last_hidden_states,
-            lm_last_hidden_states=lm_last_hidden_states,
         )
         # Clone KV Cache
@@ -2476,7 +2807,7 @@ class ConditionalChatTTS(PreTrainedModel):
         # Get model updated KV Cache
         past_key_values_for_prefill_updated = outputs_prefill.past_key_values
-        # Update generated KV Cache to input past_key_values
         for layer_idx in range(len(past_key_values)):
             # Update keys
             past_key_values[layer_idx][0][:, :, position_ids[:, 0] : position_ids[:, -1] + 1, :] = (
@@ -2504,7 +2835,9 @@ class ConditionalChatTTS(PreTrainedModel):
         streaming_tts_text_mask=None,
         add_audio_bos: bool = True,
     ):
-        """
         Args:
             input_ids (torch.Tensor): (1, seq_len, num_vq) Audio input token ids.
             past_key_values (List[Tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism.
@@ -2534,7 +2867,7 @@ class ConditionalChatTTS(PreTrainedModel):
             streaming_tts_text_mask=streaming_tts_text_mask,
             streaming_reserved_length=self.streaming_text_reserved_len,
             streaming_text_chunk_size=self.streaming_text_chunk_size,
-        )  # [1, 1, 1,  past_key_values_length + input_le]
         # Model forward
         outputs: BaseModelOutputWithPast = self.model(
@@ -2564,57 +2897,12 @@ class ConditionalChatTTS(PreTrainedModel):
         logits_processors: List[CustomRepetitionPenaltyLogitsProcessorRepeat] = [],
         show_tqdm=False,
     ):
-        """Generate audio codes in streaming setting.
         Specifically speaking, generate audio codes when not all text tokens are prefilled.
-        Usage:
-            Always pass an non-empty `past_key_values` to the function. The function does not do `prefill` by itself. It relies on `prefill_text` method to provide a valid `past_key_values`.
-            1. Create an empty `past_key_values` with
-            ```python
-            initial_kv_cache_length = 1 + self.num_spk_embs + self.streaming_text_reserved_len
-            dtype = model.emb_text.weight.dtype
-            device = model.emb_text.weight.device
-            past_key_values = [
-                (
-                    torch.zeros(1, model.config.num_attention_heads, initial_kv_cache_length, model.config.hidden_size // model.config.num_attention_heads, dtype=dtype, device=device),
-                    torch.zeros(1, model.config.num_attention_heads, initial_kv_cache_length, model.config.hidden_size // model.config.num_attention_heads, dtype=dtype, device=device)
-                )
-                for _ in range(model.config.num_hidden_layers)
-            ]
-            2. Prefill some text tokens using `prefill_text` method.
-            ```python
-            outputs = llm.generate(**kwargs)
-            lm_spk_emb_last_hidden_states or lm_last_hidden_states = extract(outputs.last_hidden_states)
-            input_ids = tts_tokenizer.encode(llm_tokenizer.decode(llm_tokens))
-            position_ids = torch.arange(begin, end, dtype=torch.long, device=device)
-            past_key_values = self.prefill_text(
-                input_ids=input_ids,
-                position_ids=position_ids,
-                past_key_values=past_key_values,
-                lm_spk_emb_last_hidden_states=lm_spk_emb_last_hidden_states,
-                lm_last_hidden_states=lm_last_hidden_states,
-            )
-            ```
-            3. Generate audio codes using `generate` method.
-            ```python
-            # initialize input_ids, this should be only done `once`
-            condition_length = 1 + model.num_spk_embs * model.use_speaker_embedding + model.streaming_text_reserved_len + 1
-            input_ids = torch.zeros(batch_size=1, condition_length, self.num_vq)
-            outputs = self.generate(
-                input_ids=input_ids,
-                past_key_values=past_key_values,
-            )
-            # update past_key_values and input_ids
-            past_key_values = outputs.past_key_values
-            input_ids = outputs.input_ids
-            ```
-            4. Repeat step 2 and 3.
         Args:
             input_ids (torch.Tensor): Input token ids.
@@ -2626,8 +2914,7 @@ class ConditionalChatTTS(PreTrainedModel):
             logits_warpers (List[LogitsWarper], optional): List of logits warpers. Defaults to [].
             logits_processors (List[CustomRepetitionPenaltyLogitsProcessorRepeat], optional): List of logits processors. Defaults to [].
             show_tqdm (bool, optional): Whether to show progress bar. Defaults to True.
-        Raises:
-            NotImplementedError: _description_
         Returns:
             GenerationOutputs: Generation outputs.
         """
@@ -2655,7 +2942,7 @@ class ConditionalChatTTS(PreTrainedModel):
             device=input_ids.device,
         )
-        # Copy existing input_ids to input_ids_buf
         input_ids_buf.narrow(1, 0, progress).copy_(input_ids)
         del input_ids
@@ -2674,19 +2961,22 @@ class ConditionalChatTTS(PreTrainedModel):
         for i in range(max_new_token):
             # Prepare generation inputs
             audio_bos = False
-            # If this is the first audio token, the case is special
             if progress == condition_length:
                 audio_bos = True
             if audio_bos:
-                # Generate the first token, activate the model with `self.audio_bos_token_id`, the model will predict a new audio token.
-                assert progress == (past_key_values[0][0].shape[2] + 1)
                 narrowed_input_ids = torch.tensor([[self.audio_bos_token_id]], dtype=torch.long, device=self.device)
                 inputs_embeds = self.emb_text(narrowed_input_ids)
                 del narrowed_input_ids
             else:
-                # Generate the following audio tokens, it is applicable to all other cases, including second and the following calling of `generate`
-                assert progress == (past_key_values[0][0].shape[2] + 1)
                 narrowed_input_ids = input_ids.narrow(dim=1, start=input_ids.shape[1] - 1, length=1)
                 code_emb = [self.emb_code[i](narrowed_input_ids[:, :, i]) for i in range(self.num_vq)]
                 inputs_embeds = torch.stack(code_emb, 3).sum(3)
@@ -2696,6 +2986,8 @@ class ConditionalChatTTS(PreTrainedModel):
             ).unsqueeze(0)
             cache_position = position_ids.clone()
             causal_mask = make_streaming_chunk_mask_generation(
                 inputs_embeds=inputs_embeds,
                 past_seen_tokens=past_key_values[0][0].shape[2],
@@ -2787,7 +3079,7 @@ class ConditionalChatTTS(PreTrainedModel):
             finish.logical_or_(finish_or)
             del finish_or
-            # 新的 `token` 存入 `input_ids_buf`
             input_ids_buf.narrow(1, progress, 1).copy_(idx_next.unsqueeze_(1))
             if i == 0 and finish.any():
@@ -2831,8 +3123,18 @@ class ConditionalChatTTS(PreTrainedModel):
     def decode_to_mel_specs(
         self,
         result_list: List[torch.Tensor],
-        use_decoder: bool = False,
     ):
         decoder = self.dvae
         max_x_len = -1
         if len(result_list) == 0:
@@ -2855,6 +3157,7 @@ class ConditionalChatTTS(PreTrainedModel):
         return mel_specs
 def gen_logits(
     num_code: int,
     top_P=0.7,

         self.processor = AutoProcessor.from_pretrained(self.config._name_or_path, trust_remote_code=True)
+        self.terminators = ["<|im_end|>", "<|endoftext|>"]
         self.default_tts_chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n<|spk_bos|><|spk|><|spk_eos|><|tts_bos|>' }}{% endif %}"
         self.force_no_stop = False
         # for stream api
+        self.reset_session()
+    def reset_session(self):
         self.session_id = None
         self.new_user_msg = True
         self.llm_generated = False
         self.llm_generate_completed = False
         self.llm_past_key_values = None
         self.audio_past_key_values = None  # apm kv cache
     def init_tts(
         self,
         return vllm_embedding, vision_hidden_states
     def get_audio_embedding_streaming(self, data):
+        r"""
+        Extract audio embeddings in a streaming manner using cached key-value pairs.
+        This method processes incoming audio features incrementally and stores/updates `past_key_values`
+        for faster inference on subsequent audio frames. It only supports batch_size=1 and is intended
+        for streaming scenarios.
+        Args:
+            data (dict):
+                - **"audio_features"** (`torch.FloatTensor`): Input mel-spectrograms of shape `(batch_size, 80, frames)`.
+                - **"audio_feature_lens"** (List[List[int]]): Lengths of each audio segment for each item in the batch.
+        Returns:
+            List[List[torch.Tensor]]: audio embeddings
+        """
         wavforms = data.get("audio_features", [])  # (bs, 80, frames) or [], multi audios need filled in advance
         audio_feature_lens_raw = data.get("audio_feature_lens", [])  # list, [[x1, x2], [y1], [z1]]
             return []
     def get_audio_embedding(self, data, chunk_length=-1):
+        r"""
+        Extract full audio embeddings with optional chunk-based attention.
+        This method computes embeddings for all audio frames at once, either using full attention (when
+        `chunk_length` is -1) or chunk-based attention (when `chunk_length` is a positive number). It does
+        not use key-value caching and is suitable for non-streaming inference.
         Args:
+            data (dict):
+                - **"audio_features"** (`torch.FloatTensor`): Input mel-spectrograms of shape `(batch_size, 80, frames)`.
+                - **"audio_feature_lens"** (List[List[int]]): Lengths of each audio segment for each item in the batch.
+            chunk_length (int, optional): Determines whether to use full attention (-1) or chunk-based
+                attention (>0) during embedding computation.
         Returns:
+            List[List[torch.Tensor]]: audio embeddings
         """
         wavforms = data.get("audio_features", [])  # (bs, 80, frames) or [], multi audios need filled in advance
         audio_feature_lens_raw = data.get("audio_feature_lens", [])  # list, [[x1, x2], [y1], [z1]]
     def get_omni_embedding(self, data, input_embeddings, chunk_length=-1, stream_input=False):
         """
         Args:
             data:
             input_embeddings:
     def forward(self, data, **kwargs):
         vllm_embedding, vision_hidden_states = self.get_vllm_embedding(data)
+        if self.config.init_audio:
+            vllm_embedding = self.get_omni_embedding(
+                data, input_embeddings=vllm_embedding, chunk_length=self.config.audio_chunk_length
+            )
         position_ids = data["position_ids"]
         if position_ids.dtype != torch.int64:
             position_ids = position_ids.long()
+        # compatible with llama factory
+        for key in ["input_ids", "inputs_embeds", "position_ids"]:
+            if key in kwargs:
+                del kwargs[key]
         return self.llm(input_ids=None, position_ids=position_ids, inputs_embeds=vllm_embedding, **kwargs)
     def _decode(self, inputs_embeds, tokenizer, attention_mask, **kwargs):
             result_text.append(tokenizer.decode(result))
         return result_text
+    def get_sys_prompt(self, ref_audio=None, mode="default", language="zh"):
+        """
+        Choose different system prompts according to different tasks
+        Args:
+            ref_audio: if ref_audio is not None, will use the voice cloning prompts, and the voice
+                       generated by the model will refer to the timbre of ref audio
+            mode:
+                "default": default system prompt and not refer to any task
+                "omni": input video and audio simultaneously
+                "audio_assistant": Default voice-only mode, the model will use the ref_audio's voice to reply user as a helpful assistant.
+                "audio_roleplay": Roleplay voice-only model, the model will use the ref_audio's voice to reply, and also role-play the character based on the audio prompt.
+                "voice_cloning": TTS mode, the model will clone the voice of ref_audio
+            language: prompts language, the model has the ability to automatically select the response language
+                    based on the question language
+        Returns:
+        """
+        if ref_audio is not None:
+            assert isinstance(ref_audio, np.ndarray), "ref_audio error"
+        if mode == "omni":
+            if language == "zh":
+                sys_prompt = "你是一个AI助手。你能接受视频，音频和文本输入并输出语音和文本。"
+                vc_prompt_prefix = sys_prompt + "模仿输入音频中的声音特征。"
+                vc_prompt_suffix = "作为助手，你将使用这种声音风格说话。"
+            else:
+                sys_prompt = "You are a helpful assistant. You can accept video, audio and text input and output voice and text. "
+                vc_prompt_prefix = sys_prompt + "Clone the voice in the provided audio prompt."
+                vc_prompt_suffix = "As an assistant, you will speak using this voice style."
+            if ref_audio is not None:
+                sys_msgs = {"role": "user", "content": [vc_prompt_prefix, ref_audio, vc_prompt_suffix]}
+            else:
+                sys_msgs = {"role": "user", "content": [sys_prompt]}
+            return sys_msgs
+        elif mode == "audio_assistant":
+            if language == "zh":
+                vc_prompt_prefix = "模仿输入音频中的声音特征。"
+                vc_prompt_suffix = "作为助手，你将使用这种声音风格说话。"
+            else:
+                vc_prompt_prefix = "Clone the voice in the provided audio prompt."
+                vc_prompt_suffix = "As an assistant, you will speak using this voice style."
+            if ref_audio is not None:
+                sys_msgs = {"role": "user", "content": [vc_prompt_prefix, ref_audio, vc_prompt_suffix]}
+            else:
+                logger.warning(
+                    "Warning: ref_audio is None, speech generation will be performed based on the default voice."
+                )
+                sys_msgs = {"role": "user", "content": ["Use the <reserved_53> voice.", vc_prompt_suffix]}
+            return sys_msgs
+        elif mode == "audio_roleplay":
+            if language == "zh":
+                vc_prompt_prefix = "模仿输入音频中的声音特征。"
+                vc_prompt_suffix = "假装你是上述音频中的人物，与我进行对话。"
+            else:
+                vc_prompt_prefix = "Clone the voice in the provided audio prompt."
+                vc_prompt_suffix = "Try to role-play the character based on the audio prompt above."
+            if ref_audio is not None:
+                sys_msgs = {"role": "user", "content": [vc_prompt_prefix, ref_audio, vc_prompt_suffix]}
+            else:
+                print("Warning: ref_audio is None, speech generation will be performed based on the default voice.")
+                sys_msgs = {"role": "user", "content": ["Use the <reserved_53> voice.", vc_prompt_suffix]}
+            return sys_msgs
+        elif mode == "voice_cloning":
+            if language == "zh":
+                vc_prompt_prefix = "模仿输入音频中的声音特征。"
+            else:
+                vc_prompt_prefix = "Clone the voice in the provided audio prompt."
+            if ref_audio is not None:
+                sys_msgs = {"role": "user", "content": [vc_prompt_prefix, ref_audio]}
+            else:
+                raise ValueError("ref_audio con't be None in voice_cloning mode.")
+            return sys_msgs
+        else:
+            sys_prompt = "You are a helpful assistant. You can accept audio and text input and output voice and text."
+            sys_msgs = {"role": "user", "content": [sys_prompt]}
+            return sys_msgs
     def generate(
         self,
         input_ids=None,
         omni_input=False,
         max_slice_nums=None,
         use_image_id=None,
+        use_tts_template=False,
         generate_audio=False,
         return_spk_embed=False,
         return_dict=False,
             omni_input: determine whether it is omni mode
             max_slice_nums: control the maximum number of image slices
             use_image_id: for video understanding or omni understanding, use_image_id should be False
+            use_tts_template: if the msgs contain audio, use_tts_template should be True
             generate_audio: whether to generate audio output, only used when return_dict=True
             return_spk_embed: whether to return spk embedding, only used when return_dict=True
             return_dict: whether to return dict
                 for c in content:
                     if isinstance(c, Image.Image):
                         images.append(c)
+                        cur_msgs.append("(<image>./</image>)")
                     elif isinstance(c, np.ndarray):  # audio
                         audios.append(c)
                         audio_parts.append(i)
+                        cur_msgs.append("(<audio>./</audio>)")
+                        use_tts_template = True
                     elif isinstance(c, str):
                         cur_msgs.append(c)
                 if omni_input:
                     copy_msgs,
                     tokenize=False,
                     add_generation_prompt=True,
+                    chat_template=self.default_tts_chat_template if use_tts_template else None,
                 )
             )
             input_images_list.append(images)
             else:
                 answer = res[0]
+                if use_tts_template and generate_audio:
                     mel_spec = self._generate_mel_spec(inputs, outputs, answer)
                     wav_numpy, sr = self.decode_mel_to_audio(mel_spec, output_audio_path)
             if return_spk_embed:
                 spk_embeds = self._get_last_spk_embeds(inputs, outputs)
+            if isinstance(answer, list):
+                answer = [i.replace(tokenizer.tts_end, "") for i in answer]
+            else:
+                answer = answer.replace(tokenizer.tts_end, "")
             if return_dict:
                 return OmniOutput(text=answer, spk_embeds=spk_embeds, audio_wav=wav_numpy, sampling_rate=sr)
             else:
         session_id,
         msgs,
         tokenizer,
+        omni_input=True,
         max_slice_nums=None,
         ls_temperature=1.0,
         **kwargs,
         for j, c in enumerate(content):
             if isinstance(c, Image.Image):
                 images.append(c)
+                cur_msgs.append("(<image>./</image>)")
             elif isinstance(c, np.ndarray):  # audio
                 audios.append(c)
+                cur_msgs.append("(<audio>./</audio>)")
             elif isinstance(c, str):
                 cur_msgs.append(c)
             else:
                 logger.error("Invalid content type:", c)
+        cur_contents = "".join(cur_msgs) if omni_input else "\n".join(omni_input)
         if not self.is_first and self.new_user_msg and msg["role"] == "user":  # new user add im_start
             if self.llm_generated:
                 if self.llm_generate_completed:
+                    msg["content"] = "<|im_end|>\n<|im_start|>user\n" + cur_contents
                 else:  # break llm gen, add tts_eos
+                    msg["content"] = "<|tts_eos|><|im_end|>\n<|im_start|>user\n" + cur_contents
             else:
+                msg["content"] = "<|im_start|>user\n" + cur_contents
             self.new_user_msg = False
         else:
+            msg["content"] = cur_contents
         if msg["role"] in ["system", "assistant"]:
             self.new_user_msg = True
         if self.is_first:
             # init pask_key_values
+            logger.info(f"new session_id: {session_id}, reset kv cache")
+            self.reset_session()
             self.session_id = session_id
             prompt = tokenizer.apply_chat_template(
                 copy_msgs, tokenize=False, add_generation_prompt=False, chat_template=self.default_tts_chat_template
             return_dict=True,
         )
         self.llm_past_key_values = outputs["past_key_values"]
+        return
     @torch.inference_mode()
     def streaming_generate(
         max_new_tokens=512,
         min_new_tokens=0,
         sampling=True,
+        generate_audio=True,
         enable_regenerate=False,
         **kwargs,
     ):
         generation_config["max_new_tokens"] = max_new_tokens
         streamer = self.llm_generate_chunk(input_ids, attention_mask, tokenizer, terminators, generation_config)
+        if generate_audio:
             result = self._generate_mel_spec_audio_streaming(
                 spk_bounds, streamer, output_chunk_size=25, enable_regenerate=enable_regenerate
             )
         return mel_spec
     def _linear_overlap_add2_wav(self, frames: List[torch.Tensor], overlap: int):
+        """
+        Merge two audio waveforms with smooth in streaming audio generation.
+        Borrowed some codes from `https://github.com/huggingface/transformers/blob/main/src/transformers/models/encodec/modeling_encodec.py`
+        """
         assert len(frames) == 2
         device = frames[0].device
         dtype = frames[0].dtype
                         prev_wav = wav_np[len(prev_wav) :]
                         cur_text = gen_text_raw[prev_text_len:]
                         prev_text_len = len(gen_text_raw)
+                        yield OmniOutput(text=cur_text, audio_wav=wav_y, sampling_rate=sr)
                     else:
                         prev_wav = wav_np
                 else:
                         )  # tts_hop256*2
                         cur_text = gen_text_raw[prev_text_len:]
                         prev_text_len = len(gen_text_raw)
+                        yield OmniOutput(text=cur_text, audio_wav=wav_np, sampling_rate=sr)
                     else:
                         prev_wav = wav_np
                         prev_wav = wav_np[len(prev_wav) :]
                         cur_text = gen_text_raw[prev_text_len:]
                         prev_text_len = len(gen_text_raw)
+                        yield OmniOutput(text=cur_text, audio_wav=wav_y, sampling_rate=sr)
                     else:
                         prev_wav = wav_np
                 else:
                         )  # tts_hop256*2
                         cur_text = gen_text_raw[prev_text_len:]
                         prev_text_len = len(gen_text_raw)
+                        yield OmniOutput(text=cur_text, audio_wav=wav_np, sampling_rate=sr)
                     else:
                         prev_wav = wav_np
         if prev_wav is not None:
             cur_text = gen_text_raw[prev_text_len:]
+            yield OmniOutput(text=cur_text, audio_wav=prev_wav, sampling_rate=sr)  # yield last chunk wav without smooth
         if new_segment_gen and not stop:
             logger.debug(
         return wav_numpy, sr
+# Copied from transformers.models.whisper.modeling_whisper.WhisperEncoderLayer and add use_cache for streaming inference
 class MiniCPMWhisperEncoderLayer(nn.Module):
     def __init__(self, config: WhisperConfig, layer_idx: int = None):
         super().__init__()
         past_key_values: Optional[EncoderDecoderCache] = None,
         use_cache: Optional[bool] = False,
     ) -> torch.Tensor:
+        r"""
+        Args:
+            hidden_states (`torch.FloatTensor` of shape `(batch_size, seq_len, embed_dim)`):
+                Hidden states to be fed into the encoder layer.
+            attention_mask (`torch.FloatTensor` of shape `(batch_size, 1, tgt_len, src_len)`):
+                Attention mask where padding elements are indicated by large negative values.
+            layer_head_mask (`torch.FloatTensor` of shape `(encoder_attention_heads,)`):
+                Mask to nullify selected heads of the attention modules.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attention weights.
+            past_key_values (`EncoderDecoderCache`, *optional*):
+                Past key-value pairs used for incremental decoding.
+            use_cache (`bool`, *optional*):
+                Whether or not to return updated `past_key_values` for caching.
+        Returns:
+            A tuple of shape `(hidden_states, optional(attn_weights), optional(past_key_values))`.
+        """
         residual = hidden_states
         hidden_states = self.self_attn_layer_norm(hidden_states)
         hidden_states, attn_weights, past_key_values = self.self_attn(
         return outputs
+# Copied from from transformers.models.whisper.modeling_whisper.WhisperEncoder and add use_cache for streaming inference
 class MiniCPMWhisperEncoder(WhisperEncoder):
     def __init__(self, config: WhisperConfig):
         past_key_values: Optional[EncoderDecoderCache] = None,
         use_cache: Optional[bool] = None,
     ):
+        r"""
+        Forward pass of the Whisper encoder.
+        Args:
+            input_features (`torch.FloatTensor` of shape `(batch_size, feature_size, sequence_length)`):
+                Float values of log-mel features extracted from the raw audio waveform. Typically generated
+                by a feature extractor (e.g., `WhisperFeatureExtractor`) that processes `.flac` or `.wav`
+                files into padded 2D mel spectrogram frames. These features are projected via convolution layers
+                (`conv1` and `conv2`) and then transformed into embeddings for the encoder.
+            attention_mask (`torch.Tensor`, *optional*):
+                Not used by Whisper for masking `input_features`, but included for API compatibility with
+                other models. If provided, it is simply ignored within the model. By default, Whisper
+                effectively ignores silence in the input log-mel spectrogram.
+            head_mask (`torch.Tensor` of shape `(encoder_layers, encoder_attention_heads)`, *optional*):
+                Mask to nullify selected attention heads. The elements should be either 1 or 0, where:
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the head is **masked** (i.e., the attention head is dropped).
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attention tensors of all encoder layers. If set to `True`, the
+                returned tuple (or `BaseModelOutputWithPast`) will contain an additional element with
+                attention weights for each encoder layer.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. If set to `True`, the returned
+                tuple (or `BaseModelOutputWithPast`) will contain a tuple of hidden states, including the
+                initial embedding output as well as the outputs of each layer.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a `BaseModelOutputWithPast` (a subclass of `ModelOutput`) instead
+                of a plain tuple. If set to `True`, the output will be a `BaseModelOutputWithPast` object,
+                otherwise it will be a tuple.
+            past_key_values (`EncoderDecoderCache`, *optional*):
+                When using caching for faster inference, this is an object that stores the key-value pairs
+                for attention states. If provided, the model will append new states to the existing cache
+                and return the updated cache. This speeds up sequential decoding or chunked inference.
+                - If `past_key_values` is `None`, no past states are used or returned.
+                - If `past_key_values` is not `None` and `use_cache=True`, the model will use the provided
+                cache and return the updated cache (as `next_encoder_cache`).
+            use_cache (`bool`, *optional*):
+                Whether or not the model should use caching (`past_key_values`) to speed up processing
+                during inference. When set to `True`, the model will:
+                - Inspect and use `past_key_values` if provided.
+                - Return updated `past_key_values` (under the name `next_encoder_cache` in
+                    `BaseModelOutputWithPast`).
+        Returns:
+            `BaseModelOutputWithPast` or `tuple` (depending on `return_dict`):
+                If `return_dict=True`, a `BaseModelOutputWithPast` is returned, which contains:
+                - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+                The output of the final encoder layer.
+                - **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned if `output_hidden_states=True`):
+                Hidden states of the model at each layer (including the initial projection).
+                - **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned if `output_attentions=True`):
+                Attention weights from each encoder layer.
+                - **past_key_values** (an object of type `EncoderDecoderCache` or `None`, *optional*):
+                Updated cache of key-value pairs if `use_cache=True`.
+                If `return_dict=False`, a tuple is returned, where the format is:
+                `(last_hidden_state, hidden_states, attentions)`, with `hidden_states` and `attentions`
+                only present if their respective `output_*` arguments are set to `True`.
+        Example:
+            >>> from transformers import AutoFeatureExtractor, WhisperConfig, WhisperForConditionalGeneration
+            >>> import torch
+            >>> # Load a feature extractor and a Whisper model
+            >>> feature_extractor = AutoFeatureExtractor.from_pretrained("openai/whisper-tiny.en")
+            >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")
+            >>> # Assume you have audio (list of floats or numpy array) loaded from a file
+            >>> # Then extract the mel features:
+            >>> input_features = feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features
+            >>> # Forward pass
+            >>> outputs = model.encoder(
+            ...     input_features=input_features,
+            ...     output_hidden_states=True,
+            ...     output_attentions=True,
+            ...     use_cache=True
+            ... )
+            >>> # Retrieve the last hidden state
+            >>> last_hidden_state = outputs.last_hidden_state
+            >>> print(last_hidden_state.shape)
+            torch.Size([batch_size, seq_length, hidden_size])
+            >>> # Retrieve the intermediate hidden states if output_hidden_states=True
+            >>> all_encoder_hidden_states = outputs.hidden_states
+            >>> # Retrieve attention weights if output_attentions=True
+            >>> all_encoder_attentions = outputs.attentions
+            >>> # Retrieve updated past key values if use_cache=True
+            >>> encoder_cache = outputs.past_key_values
+        """
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
         )
+# Borrowed from `https://github.com/2noise/ChatTTS/blob/main/ChatTTS/model/dvae.py`
 class ConvNeXtBlock(nn.Module):
     def __init__(
         self,
         return x
+# Borrowed from `https://github.com/2noise/ChatTTS/blob/main/ChatTTS/model/dvae.py`
 class GFSQ(nn.Module):
     def __init__(
         self,
         return ind.transpose_(1, 2) if self.transpose else ind
+# Borrowed from `https://github.com/2noise/ChatTTS/blob/main/ChatTTS/model/dvae.py`
 class DVAEDecoder(nn.Module):
     def __init__(
         self,
         return x
+# Borrowed from `https://github.com/2noise/ChatTTS/blob/main/ChatTTS/model/dvae.py`
 class DVAE(nn.Module):
     def __init__(
         self,
         return torch.mul(dec_out, self.coef, out=dec_out)
 def apply_spk_emb(
     input_ids: torch.Tensor = None,
     spk_emb: torch.Tensor = None,
     num_spk_embs: int = 1,
 ):
     """
+    Replace consecutive `num_spk_embs` speaker embedding placeholders in input_embeds with pre-prepared speaker embeddings. This is an in-place replacement, no new tensor is created, so no value is returned.
     Args:
         input_ids (torch.Tensor): Input ID tensor, shape [batch_size, seq_len_max]
     use_spk_emb: bool = True,
 ) -> torch.Tensor:
     """
+    In streaming audio generation, determine which `text` positions the TTS model can attend to when generating each chunk of `audio` tokens.
     This function creates a mask that allows the model to attend to a specific chunk of text
     tokens when generating each chunk of audio tokens, enabling streaming TTS generation.
     return causal_mask
+# Borrowed from `https://github.com/2noise/ChatTTS/blob/main/ChatTTS/model/processors.py`
 class CustomRepetitionPenaltyLogitsProcessorRepeat:
     def __init__(self, penalty: float, max_input_ids: int, past_window: int):
         if not isinstance(penalty, float) or not (penalty > 0):
 class ConditionalChatTTS(PreTrainedModel):
+    """A conditional text-to-speech model that can generate speech from text with speaker conditioning.
+    This model extends PreTrainedModel to provide text-to-speech capabilities with:
+    - LLM hidden state conditioning
+    - Streaming generation
+    The model uses a transformer architecture with LLM hidden states and can operate in both
+    streaming and non-streaming modes for flexible deployment.
+    The model process sequence in the following format:
+    | text bos token | LLM embedding projected to tts embedding space | text tokens (fixed length, reserved for future tokens) | audio bos token | audio tokens (audio token length is not fixed)| audio eos token |
+    The format is designed to support LLM-conditioned streaming audio generation.
+    Usage:
+    To support streaming generation, two global variables should be maintained outside of the model.
+        1. `audio_input_ids`: stores *discrete* audio codes. It is a tensor with shape [1, sequence length+1, num_vq].
+        2. `past_key_values`: stores the KV cache for both text tokens and audio codes. It is a list of tuples, each tuple contains two tensors with shape [1, num_attention_heads, sequence length, hidden_size // num_attention_heads]
+    where `num_vq` is the number of audio codebooks, in default setting, it is `4`.
+    1. Create an empty `past_key_values` with
+    ```python
+    initial_kv_cache_length = 1 + model.num_spk_embs + model.streaming_text_reserved_len # where `1` denotes the `bos` token
+    dtype = model.emb_text.weight.dtype
+    device = model.emb_text.weight.device
+    past_key_values = [
+        (
+            torch.zeros(1, model.config.num_attention_heads, initial_kv_cache_length, model.config.hidden_size // model.config.num_attention_heads, dtype=dtype, device=device),
+            torch.zeros(1, model.config.num_attention_heads, initial_kv_cache_length, model.config.hidden_size // model.config.num_attention_heads, dtype=dtype, device=device)
+        )
+        for _ in range(model.config.num_hidden_layers)
+    ]
+    2. At the same time, create an empty `audio_input_ids` with shape [1, sequence length, num_vq], `num_vq` denotes multiple layer audio codebooks. But here we also include text tokens in the sequence, but they will be zeros, and will not be used, just a placeholder.
+    ```python
+    initial_audio_input_ids_length = 1 + model.num_spk_embs + model.streaming_text_reserved_len + 1
+    # [bos token, speaker embeddings, text tokens, audio bos token]
+    audio_input_ids = torch.zeros(batch_size=1, initial_audio_input_ids_length, model.num_vq)
+    ```
+    2. Prefill some text tokens to TTS model (for example, 10 tokens) using `prefill_text` method.
+    ```python
+    outputs = llm.generate(**kwargs)
+    llm_tokens = some_function_to_extract_llm_tokens(outputs)
+    lm_spk_emb_last_hidden_states = some_function_to_extract_lm_spk_emb_last_hidden_states(outputs)
+    tts_text_input_ids = tts_tokenizer.encode(llm_tokenizer.decode(llm_tokens))
+    # here assume we are prefilling text token 0 to text token 9 (included), totally 10 tokens.
+    begin = 0
+    end = 9+1
+    position_ids = torch.arange(begin, end, dtype=torch.long, device=device)
+    past_key_values = model.prefill_text(
+        input_ids=tts_text_input_ids,
+        position_ids=position_ids,
+        past_key_values=past_key_values,
+        lm_spk_emb_last_hidden_states=lm_spk_emb_last_hidden_states,
+    )
+    ```
+    3. Make a `streaming_tts_text_mask` to denote which position contains valid text tokens, similar to `attention_mask` in standard causal attention.
+    ```python
+    streaming_tts_text_mask = torch.zeros(model.streaming_reserved_length)
+    streaming_tts_text_mask[0:end] = 1 # denotes these post
+    ```
+    3. Generate audio codes using `generate` method.
+    ```python
+    outputs = model.generate(
+        input_ids=audio_input_ids,
+        past_key_values=past_key_values,
+        streaming_tts_text_mask=streaming_tts_text_mask,
+        max_new_token=50,
+    )
+    # update past_key_values and input_ids
+    past_key_values = outputs.past_key_values
+    audio_input_ids = outputs.input_ids
+    ```
+    The `past_key_values` is extended by `max_new_token=50`, and `audio_input_ids` is also extended by `max_new_token=50` after `generate` calling.
+    4. Notice that after prefilling `10` text tokens, the model can generate up to `50` audio tokens, if you want to generate more audio tokens, you need to prefill next `10` text tokens. And it is okay to only generate `25` audio tokens for faster initial response.
+    5. Repeat steps `2,3,4` as needed in your streaming audio generation cases, but ensure usage complies with the following guidelines discussed above.
+    """
     config_class = ConditionalChatTTSConfig
     def __init__(self, config: ConditionalChatTTSConfig):
         self.model = model
     @torch.inference_mode()
+    def merge_inputs_embeds(
         self,
         input_ids: torch.Tensor,
         lm_spk_emb_last_hidden_states: Optional[torch.Tensor] = None,
     ):
+        """Merge `input_ids` and `lm_spk_emb_last_hidden_states` to `inputs_embeds`.
         Args:
             input_ids (torch.Tensor): Input token IDs.
             lm_spk_emb_last_hidden_states (Optional[torch.Tensor], optional): Last hidden states of speaker embeddings from the language model. Defaults to None.
         Raises:
             NotImplementedError: If speaker embedding is not used and language model hidden states are not implemented.
                     num_spk_embs=self.num_spk_embs,
                 )
         else:
             raise NotImplementedError
         return inputs_embeds
         position_ids: torch.LongTensor,
         past_key_values: List[Tuple[torch.Tensor, torch.Tensor]],
         lm_spk_emb_last_hidden_states: Optional[torch.Tensor] = None,
     ):
         """Prefill a chunk of new text tokens in streaming setting.
+        Specifically speaking, update `past_key_values` using new text tokens, then the model will read the new text tokens.
         Args:
             input_ids (Tensor): Tensor of shape [batch_size, seq_len]
         assert input_ids.shape[0] == 1
         assert past_key_values is not None
+        # Merge text and LLM embeddings
+        inputs_embeds = self.merge_inputs_embeds(
             input_ids=input_ids,
             lm_spk_emb_last_hidden_states=lm_spk_emb_last_hidden_states,
         )
         # Clone KV Cache
         # Get model updated KV Cache
         past_key_values_for_prefill_updated = outputs_prefill.past_key_values
+        # Update generated KV Cache to input `past_key_values`
         for layer_idx in range(len(past_key_values)):
             # Update keys
             past_key_values[layer_idx][0][:, :, position_ids[:, 0] : position_ids[:, -1] + 1, :] = (
         streaming_tts_text_mask=None,
         add_audio_bos: bool = True,
     ):
+        """Prefill a chunk of audio ids to the model. Used in sliding-window long audio generation.
+        Specifically, prefill many audio ids (typically from last window) to the model in the new window.
         Args:
             input_ids (torch.Tensor): (1, seq_len, num_vq) Audio input token ids.
             past_key_values (List[Tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism.
             streaming_tts_text_mask=streaming_tts_text_mask,
             streaming_reserved_length=self.streaming_text_reserved_len,
             streaming_text_chunk_size=self.streaming_text_chunk_size,
+        )  # [1, 1, 1, past_key_values_length + input_len]
         # Model forward
         outputs: BaseModelOutputWithPast = self.model(
         logits_processors: List[CustomRepetitionPenaltyLogitsProcessorRepeat] = [],
         show_tqdm=False,
     ):
+        """Generate audio codes in streaming setting or non-streaming setting.
         Specifically speaking, generate audio codes when not all text tokens are prefilled.
+        Always pass a valid `past_key_values` to the method. The method does not do `prefill` by itself. It relies on `prefill_text` method to provide valid `past_key_values`. Please refer to docstring of this class for more details.
+        In this method, we borrowed a lot of codes from `https://github.com/2noise/ChatTTS/blob/main/ChatTTS/model/gpt.py`.
         Args:
             input_ids (torch.Tensor): Input token ids.
             logits_warpers (List[LogitsWarper], optional): List of logits warpers. Defaults to [].
             logits_processors (List[CustomRepetitionPenaltyLogitsProcessorRepeat], optional): List of logits processors. Defaults to [].
             show_tqdm (bool, optional): Whether to show progress bar. Defaults to True.
         Returns:
             GenerationOutputs: Generation outputs.
         """
             device=input_ids.device,
         )
+        # Copy existing `input_ids` to `input_ids_buf`
         input_ids_buf.narrow(1, 0, progress).copy_(input_ids)
         del input_ids
         for i in range(max_new_token):
             # Prepare generation inputs
             audio_bos = False
+            # If this is the first audio token, the case is SPECIAL
             if progress == condition_length:
                 audio_bos = True
+            assert progress == (
+                past_key_values[0][0].shape[2] + 1
+            )  # If you are using according to the guidelines, this should be passed.
             if audio_bos:
+                # Generate the first token, activate the model with `self.audio_bos_token_id`, the model will predict a new audio token. This is a special case because without the `audio bos token`, it is impossible to generate the first audio token in our streaming setting.
                 narrowed_input_ids = torch.tensor([[self.audio_bos_token_id]], dtype=torch.long, device=self.device)
                 inputs_embeds = self.emb_text(narrowed_input_ids)
                 del narrowed_input_ids
             else:
+                # Generate the following audio tokens, it is applicable to all other cases, including second and the following calling of `generate`.
                 narrowed_input_ids = input_ids.narrow(dim=1, start=input_ids.shape[1] - 1, length=1)
                 code_emb = [self.emb_code[i](narrowed_input_ids[:, :, i]) for i in range(self.num_vq)]
                 inputs_embeds = torch.stack(code_emb, 3).sum(3)
             ).unsqueeze(0)
             cache_position = position_ids.clone()
+            # Make causal mask
             causal_mask = make_streaming_chunk_mask_generation(
                 inputs_embeds=inputs_embeds,
                 past_seen_tokens=past_key_values[0][0].shape[2],
             finish.logical_or_(finish_or)
             del finish_or
+            # Store new `token` into `input_ids_buf`
             input_ids_buf.narrow(1, progress, 1).copy_(idx_next.unsqueeze_(1))
             if i == 0 and finish.any():
     def decode_to_mel_specs(
         self,
         result_list: List[torch.Tensor],
     ):
+        """Decode discrete audio codes to mel spectrograms.
+        Borrowed from `https://github.com/2noise/ChatTTS/blob/main/ChatTTS/core.py`
+        Args:
+            result_list (List[torch.Tensor]): Audio codes output from `generate`.
+        Returns:
+            torch.Tensor: Mel spectrograms.
+        """
         decoder = self.dvae
         max_x_len = -1
         if len(result_list) == 0:
         return mel_specs
+# Borrowed from `https://github.com/2noise/ChatTTS/blob/main/ChatTTS/model/processors.py`
 def gen_logits(
     num_code: int,
     top_P=0.7,

modeling_navit_siglip.py CHANGED Viewed

@@ -851,6 +851,7 @@ class SiglipVisionTransformer(SiglipPreTrainedModel):
     config_class = SiglipVisionConfig
     main_input_name = "pixel_values"
     _supports_flash_attn_2 = True
     def __init__(self, config: SiglipVisionConfig):
         super().__init__(config)

     config_class = SiglipVisionConfig
     main_input_name = "pixel_values"
     _supports_flash_attn_2 = True
+    _no_split_modules = []
     def __init__(self, config: SiglipVisionConfig):
         super().__init__(config)

processing_minicpmo.py CHANGED Viewed

@@ -309,8 +309,10 @@ class MiniCPMOProcessor(ProcessorMixin):
             )
             return MiniCPMOBatchFeature(data={**model_inputs})
-        image_pattern = "<image>./</image>"
-        audio_pattern = "<audio>./</audio>"
         split_pattern = f"({image_pattern}|{audio_pattern})"
         if isinstance(texts, str):
@@ -343,13 +345,13 @@ class MiniCPMOProcessor(ProcessorMixin):
             image_id = 0
             audio_id = 0
             for i, chunk in enumerate(text_chunks):
-                if chunk == image_pattern:
                     image_placeholder = self.image_processor.get_slice_image_placeholder(
                         image_sizes[index][image_id], image_id, max_slice_nums, use_image_id
                     )
                     image_id += 1
                     text_chunks[i] = image_placeholder
-                elif chunk == audio_pattern:
                     audio_placeholder = audio_phs[index][audio_id]
                     audio_id += 1
                     text_chunks[i] = audio_placeholder
@@ -494,9 +496,6 @@ class ChatTTSProcessor:
             try:
                 mel = self.audio_processor(audio)  # [100(num_mel_bins), seq_len_mel]
             except Exception as e:
-                print(
-                    "fuck! there is an error with audio waveform. If you use a dataset __getitem__, will skip and use next data as compensate, will not halt training."
-                )
                 raise e
             audio_features_varlen.append(mel)

             )
             return MiniCPMOBatchFeature(data={**model_inputs})
+        image_tag = "(<image>./</image>)"
+        image_pattern = "\(<image>./</image>\)"
+        audio_tag = "(<audio>./</audio>)"
+        audio_pattern = "\(<audio>./</audio>\)"
         split_pattern = f"({image_pattern}|{audio_pattern})"
         if isinstance(texts, str):
             image_id = 0
             audio_id = 0
             for i, chunk in enumerate(text_chunks):
+                if chunk == image_tag:
                     image_placeholder = self.image_processor.get_slice_image_placeholder(
                         image_sizes[index][image_id], image_id, max_slice_nums, use_image_id
                     )
                     image_id += 1
                     text_chunks[i] = image_placeholder
+                elif chunk == audio_tag:
                     audio_placeholder = audio_phs[index][audio_id]
                     audio_id += 1
                     text_chunks[i] = audio_placeholder
             try:
                 mel = self.audio_processor(audio)  # [100(num_mel_bins), seq_len_mel]
             except Exception as e:
                 raise e
             audio_features_varlen.append(mel)

utils.py CHANGED Viewed

@@ -13,8 +13,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
-import re
 import logging
 import librosa
 import numpy as np
@@ -42,6 +42,28 @@ def sentence_end(txt):
 class NumberToTextConverter:
     def __init__(self):
         self.num_to_chinese = {
             "0": "零",
@@ -103,6 +125,31 @@ class NumberToTextConverter:
 class VoiceChecker:
     def __init__(self):
         self.previous_mel = None
         self.consecutive_zeros = 0
@@ -129,7 +176,9 @@ class VoiceChecker:
             mel_spec_chunk = mel_spec[:, i * mel_chunk_size : (i + 1) * mel_chunk_size]
             distance = self.compute_distance(audio_chunk, mel_spec_chunk)
-            logger.warning(f"mel dist: {distance:.1f}, zero: {self.consecutive_zeros}, low: {self.consecutive_low_distance}")
             if distance == 0:
                 self.consecutive_low_distance = 0  # reset
                 self.consecutive_zeros += 1

 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
+import re
 import librosa
 import numpy as np
 class NumberToTextConverter:
+    r"""
+    A helper class to ensure text-to-speech (TTS) systems read numeric digits
+    in the desired language (Chinese or English) digit-by-digit. It forcibly
+    replaces all numeric substrings in text with their language-specific
+    textual representations, thereby reducing the likelihood of TTS mistakes
+    on numbers.
+    Note: MiniCPM-o 2.6 only use this in streaming mode.
+    Attributes:
+        num_to_chinese (dict):
+            Mapping from digit (str) to its Chinese textual form (str).
+        num_to_english (dict):
+            Mapping from digit (str) to its English textual form (str).
+    Example:
+        >>> converter = NumberToTextConverter()
+        >>> converter.replace_numbers_with_text("我有2个苹果", language="chinese")
+        '我有两个苹果'
+        >>> converter.replace_numbers_with_text("I have 23 books", language="english")
+        'I have two three books'
+    """
     def __init__(self):
         self.num_to_chinese = {
             "0": "零",
 class VoiceChecker:
+    r"""
+    A simple utility class to detect silence or low variation in consecutive audio chunks by comparing
+    the mel-spectrogram distances. It keeps track of consecutive zero-distance and low-distance chunks
+    to decide if the audio is considered "bad" (e.g., overly silent or not changing enough).
+    Attributes:
+        previous_mel (`np.ndarray` or `None`):
+            Holds the previously observed mel-spectrogram in decibel scale. Used to compute
+            the next distance; reset via :meth:`reset`.
+        consecutive_zeros (`int`):
+            The number of consecutive chunks that were detected as silent (distance = 0).
+        consecutive_low_distance (`int`):
+            The number of consecutive chunks whose distance was below the threshold.
+    Example:
+        >>> checker = VoiceChecker()
+        >>> # Suppose we have audio_wav (list or np.ndarray) and mel_spec (np.ndarray)
+        >>> # We split them into chunks and call checker.is_bad(...)
+        >>> is_audio_bad = checker.is_bad(audio_wav, mel_spec, chunk_size=2560, thresh=100.0)
+        >>> if is_audio_bad:
+        ...     print("Audio deemed bad!")
+        >>> # Reset states if needed
+        >>> checker.reset()
+    """
     def __init__(self):
         self.previous_mel = None
         self.consecutive_zeros = 0
             mel_spec_chunk = mel_spec[:, i * mel_chunk_size : (i + 1) * mel_chunk_size]
             distance = self.compute_distance(audio_chunk, mel_spec_chunk)
+            logger.warning(
+                f"mel dist: {distance:.1f}, zero: {self.consecutive_zeros}, low: {self.consecutive_low_distance}"
+            )
             if distance == 0:
                 self.consecutive_low_distance = 0  # reset
                 self.consecutive_zeros += 1