Spaces:
Running
on
Zero
Running
on
Zero
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
emoji: 🔊
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: indigo
|
|
@@ -7,157 +7,3 @@ sdk: gradio
|
|
| 7 |
app_file: app.py
|
| 8 |
pinned: false
|
| 9 |
---
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
# [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://hkchengrex.github.io/MMAudio)
|
| 13 |
-
|
| 14 |
-
[Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/)
|
| 15 |
-
|
| 16 |
-
University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
[[Paper (being prepared)]](https://hkchengrex.github.io/MMAudio) [[Project Page]](https://hkchengrex.github.io/MMAudio)
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
**Note: This repository is still under construction. Single-example inference should work as expected. The training code will be added. Code is subject to non-backward-compatible changes.**
|
| 23 |
-
|
| 24 |
-
## Highlight
|
| 25 |
-
|
| 26 |
-
MMAudio generates synchronized audio given video and/or text inputs.
|
| 27 |
-
Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets.
|
| 28 |
-
Moreover, a synchronization module aligns the generated audio with the video frames.
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
## Results
|
| 32 |
-
|
| 33 |
-
(All audio from our algorithm MMAudio)
|
| 34 |
-
|
| 35 |
-
Videos from Sora:
|
| 36 |
-
|
| 37 |
-
https://github.com/user-attachments/assets/82afd192-0cee-48a1-86ca-bd39b8c8f330
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
Videos from MovieGen/Hunyuan Video/VGGSound:
|
| 41 |
-
|
| 42 |
-
https://github.com/user-attachments/assets/29230d4e-21c1-4cf8-a221-c28f2af6d0ca
|
| 43 |
-
|
| 44 |
-
For more results, visit https://hkchengrex.com/MMAudio/video_main.html.
|
| 45 |
-
|
| 46 |
-
## Installation
|
| 47 |
-
|
| 48 |
-
We have only tested this on Ubuntu.
|
| 49 |
-
|
| 50 |
-
### Prerequisites
|
| 51 |
-
|
| 52 |
-
We recommend using a [miniforge](https://github.com/conda-forge/miniforge) environment.
|
| 53 |
-
|
| 54 |
-
- Python 3.8+
|
| 55 |
-
- PyTorch **2.5.1+** and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/)
|
| 56 |
-
- ffmpeg<7 ([this is required by torchaudio](https://pytorch.org/audio/master/installation.html#optional-dependencies), you can install it in a miniforge environment with `conda install -c conda-forge 'ffmpeg<7'`)
|
| 57 |
-
|
| 58 |
-
**Clone our repository:**
|
| 59 |
-
|
| 60 |
-
```bash
|
| 61 |
-
git clone https://github.com/hkchengrex/MMAudio.git
|
| 62 |
-
```
|
| 63 |
-
|
| 64 |
-
**Install with pip:**
|
| 65 |
-
|
| 66 |
-
```bash
|
| 67 |
-
cd MMAudio
|
| 68 |
-
pip install -e .
|
| 69 |
-
```
|
| 70 |
-
|
| 71 |
-
(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)
|
| 72 |
-
|
| 73 |
-
**Pretrained models:**
|
| 74 |
-
|
| 75 |
-
The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`
|
| 76 |
-
|
| 77 |
-
| Model | Download link | File size |
|
| 78 |
-
| -------- | ------- | ------- |
|
| 79 |
-
| Flow prediction network, small 16kHz | <a href="https://databank.illinois.edu/datafiles/k6jve/download" download="mmaudio_small_16k.pth">mmaudio_small_16k.pth</a> | 601M |
|
| 80 |
-
| Flow prediction network, small 44.1kHz | <a href="https://databank.illinois.edu/datafiles/864ya/download" download="mmaudio_small_44k.pth">mmaudio_small_44k.pth</a> | 601M |
|
| 81 |
-
| Flow prediction network, medium 44.1kHz | <a href="https://databank.illinois.edu/datafiles/pa94t/download" download="mmaudio_medium_44k.pth">mmaudio_medium_44k.pth</a> | 2.4G |
|
| 82 |
-
| Flow prediction network, large 44.1kHz **(recommended)** | <a href="https://databank.illinois.edu/datafiles/4jx76/download" download="mmaudio_large_44k.pth">mmaudio_large_44k.pth</a> | 3.9G |
|
| 83 |
-
| 16kHz VAE | <a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-16.pth">v1-16.pth</a> | 655M |
|
| 84 |
-
| 16kHz BigVGAN vocoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/best_netG.pt">best_netG.pt</a> | 429M |
|
| 85 |
-
| 44.1kHz VAE |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-44.pth">v1-44.pth</a> | 1.2G |
|
| 86 |
-
| Synchformer visual encoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/synchformer_state_dict.pth">synchformer_state_dict.pth</a> | 907M |
|
| 87 |
-
|
| 88 |
-
The 44.1kHz vocoder will be downloaded automatically.
|
| 89 |
-
|
| 90 |
-
The expected directory structure (full):
|
| 91 |
-
|
| 92 |
-
```bash
|
| 93 |
-
MMAudio
|
| 94 |
-
├── ext_weights
|
| 95 |
-
│ ├── best_netG.pt
|
| 96 |
-
│ ├── synchformer_state_dict.pth
|
| 97 |
-
│ ├── v1-16.pth
|
| 98 |
-
│ └── v1-44.pth
|
| 99 |
-
├── weights
|
| 100 |
-
│ ├── mmaudio_small_16k.pth
|
| 101 |
-
│ ├── mmaudio_small_44k.pth
|
| 102 |
-
│ ├── mmaudio_medium_44k.pth
|
| 103 |
-
│ └── mmaudio_large_44k.pth
|
| 104 |
-
└── ...
|
| 105 |
-
```
|
| 106 |
-
|
| 107 |
-
The expected directory structure (minimal, for the recommended model only):
|
| 108 |
-
|
| 109 |
-
```bash
|
| 110 |
-
MMAudio
|
| 111 |
-
├── ext_weights
|
| 112 |
-
│ ├── synchformer_state_dict.pth
|
| 113 |
-
│ └── v1-44.pth
|
| 114 |
-
├── weights
|
| 115 |
-
│ └── mmaudio_large_44k.pth
|
| 116 |
-
└── ...
|
| 117 |
-
```
|
| 118 |
-
|
| 119 |
-
## Demo
|
| 120 |
-
|
| 121 |
-
By default, these scripts use the `large_44k` model.
|
| 122 |
-
In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs.
|
| 123 |
-
|
| 124 |
-
### Command-line interface
|
| 125 |
-
|
| 126 |
-
With `demo.py`
|
| 127 |
-
```bash
|
| 128 |
-
python demo.py --duration=8 --video=<path to video> --prompt "your prompt"
|
| 129 |
-
```
|
| 130 |
-
The output (audio in `.flac` format, and video in `.mp4` format) will be saved in `./output`.
|
| 131 |
-
See the file for more options.
|
| 132 |
-
Simply omit the `--video` option for text-to-audio synthesis.
|
| 133 |
-
The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
### Gradio interface
|
| 137 |
-
|
| 138 |
-
Supports video-to-audio and text-to-audio synthesis.
|
| 139 |
-
|
| 140 |
-
```
|
| 141 |
-
python gradio_demo.py
|
| 142 |
-
```
|
| 143 |
-
|
| 144 |
-
### Known limitations
|
| 145 |
-
|
| 146 |
-
1. The model sometimes generates undesired unintelligible human speech-like sounds
|
| 147 |
-
2. The model sometimes generates undesired background music
|
| 148 |
-
3. The model struggles with unfamiliar concepts, e.g., it can generate "gunfires" but not "RPG firing".
|
| 149 |
-
|
| 150 |
-
We believe all of these three limitations can be addressed with more high-quality training data.
|
| 151 |
-
|
| 152 |
-
## Training
|
| 153 |
-
Work in progress.
|
| 154 |
-
|
| 155 |
-
## Evaluation
|
| 156 |
-
Work in progress.
|
| 157 |
-
|
| 158 |
-
## Acknowledgement
|
| 159 |
-
Many thanks to:
|
| 160 |
-
- [Make-An-Audio 2](https://github.com/bytedance/Make-An-Audio-2) for the 16kHz BigVGAN pretrained model
|
| 161 |
-
- [BigVGAN](https://github.com/NVIDIA/BigVGAN)
|
| 162 |
-
- [Synchformer](https://github.com/v-iashin/Synchformer)
|
| 163 |
-
|
|
|
|
| 1 |
---
|
| 2 |
+
title: WAN 2.1 FAST VIDEO with AUDIO
|
| 3 |
emoji: 🔊
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: indigo
|
|
|
|
| 7 |
app_file: app.py
|
| 8 |
pinned: false
|
| 9 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|