James Zhou
commited on
Commit
Β·
2e936cf
1
Parent(s):
9c301e6
[update] readme
Browse files
README.md
CHANGED
|
@@ -105,7 +105,7 @@ Professional-grade audio generation with crystal clarity
|
|
| 105 |
|
| 106 |
## π **Abstract**
|
| 107 |
|
| 108 |
-
<div align="center" style="background: linear-gradient(135deg, #ffeef8 0%, #f0f8ff 100%); padding: 30px; border-radius: 20px; margin: 20px 0; border-left: 5px solid #ff6b9d;">
|
| 109 |
|
| 110 |
**π Tencent Hunyuan** proudly open-sources **HunyuanVideo-Foley** - an end-to-end video sound effect generation model!
|
| 111 |
|
|
@@ -117,21 +117,21 @@ Professional-grade audio generation with crystal clarity
|
|
| 117 |
|
| 118 |
<div style="display: grid; grid-template-columns: 1fr; gap: 15px; margin: 20px 0;">
|
| 119 |
|
| 120 |
-
<div style="border-left: 4px solid #4CAF50; padding: 15px; background: #f8f9fa; border-radius: 8px;">
|
| 121 |
|
| 122 |
**π¬ Multi-scenario Audio-Visual Synchronization**
|
| 123 |
Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications.
|
| 124 |
|
| 125 |
</div>
|
| 126 |
|
| 127 |
-
<div style="border-left: 4px solid #2196F3; padding: 15px; background: #f8f9fa; border-radius: 8px;">
|
| 128 |
|
| 129 |
**βοΈ Multi-modal Semantic Balance**
|
| 130 |
Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements.
|
| 131 |
|
| 132 |
</div>
|
| 133 |
|
| 134 |
-
<div style="border-left: 4px solid #FF9800; padding: 15px; background: #f8f9fa; border-radius: 8px;">
|
| 135 |
|
| 136 |
**π΅ High-fidelity Audio Output**
|
| 137 |
Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality.
|
|
@@ -140,7 +140,7 @@ Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and
|
|
| 140 |
|
| 141 |
</div>
|
| 142 |
|
| 143 |
-
<div align="center" style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 15px; margin: 20px 0;">
|
| 144 |
|
| 145 |
**π SOTA Performance Achieved**
|
| 146 |
|
|
@@ -168,7 +168,7 @@ Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and
|
|
| 168 |
|
| 169 |
</div>
|
| 170 |
|
| 171 |
-
<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #17a2b8; margin: 20px 0;">
|
| 172 |
|
| 173 |
The **TV2A (Text-Video-to-Audio)** task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities.
|
| 174 |
|
|
@@ -183,7 +183,7 @@ The **TV2A (Text-Video-to-Audio)** task presents a complex multimodal generation
|
|
| 183 |
|
| 184 |
</div>
|
| 185 |
|
| 186 |
-
<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #28a745; margin: 20px 0;">
|
| 187 |
|
| 188 |
**HunyuanVideo-Foley** employs a sophisticated hybrid architecture:
|
| 189 |
|
|
@@ -276,7 +276,7 @@ cd HunyuanVideo-Foley
|
|
| 276 |
|
| 277 |
#### **Step 2: Environment Setup**
|
| 278 |
|
| 279 |
-
<div style="background: #fff3cd; padding: 15px; border-radius: 8px; border-left: 4px solid #ffc107; margin: 10px 0;">
|
| 280 |
|
| 281 |
π‘ **Tip**: We recommend using [Conda](https://docs.anaconda.com/free/miniconda/index.html) for Python environment management.
|
| 282 |
|
|
@@ -289,7 +289,7 @@ pip install -r requirements.txt
|
|
| 289 |
|
| 290 |
#### **Step 3: Download Pretrained Models**
|
| 291 |
|
| 292 |
-
<div style="background: #d1ecf1; padding: 15px; border-radius: 8px; border-left: 4px solid #17a2b8; margin: 10px 0;">
|
| 293 |
|
| 294 |
π **Download Model weights from Huggingface**
|
| 295 |
```bash
|
|
@@ -309,7 +309,7 @@ huggingface-cli download tencent/HunyuanVideo-Foley
|
|
| 309 |
|
| 310 |
### π¬ **Single Video Generation**
|
| 311 |
|
| 312 |
-
<div style="background: #e8f5e8; padding: 15px; border-radius: 8px; border-left: 4px solid #28a745; margin: 10px 0;">
|
| 313 |
|
| 314 |
Generate Foley audio for a single video file with text description:
|
| 315 |
|
|
@@ -326,7 +326,7 @@ python3 infer.py \
|
|
| 326 |
|
| 327 |
### π **Batch Processing**
|
| 328 |
|
| 329 |
-
<div style="background: #fff3e0; padding: 15px; border-radius: 8px; border-left: 4px solid #ff9800; margin: 10px 0;">
|
| 330 |
|
| 331 |
Process multiple videos using a CSV file with video paths and descriptions:
|
| 332 |
|
|
@@ -342,7 +342,7 @@ python3 infer.py \
|
|
| 342 |
|
| 343 |
### π **Interactive Web Interface**
|
| 344 |
|
| 345 |
-
<div style="background: #f3e5f5; padding: 15px; border-radius: 8px; border-left: 4px solid #9c27b0; margin: 10px 0;">
|
| 346 |
|
| 347 |
Launch a user-friendly Gradio web interface for easy interaction:
|
| 348 |
|
|
@@ -353,7 +353,7 @@ export HIFI_FOLEY_MODEL_PATH=PRETRAINED_MODEL_PATH_DIR
|
|
| 353 |
python3 gradio_app.py
|
| 354 |
```
|
| 355 |
|
| 356 |
-
<div align="center" style="margin: 20px 0;">
|
| 357 |
|
| 358 |
*π Then open your browser and navigate to the provided local URL to start generating Foley audio!*
|
| 359 |
|
|
@@ -363,7 +363,7 @@ python3 gradio_app.py
|
|
| 363 |
|
| 364 |
## π **Citation**
|
| 365 |
|
| 366 |
-
<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #6c757d; margin: 20px 0;">
|
| 367 |
|
| 368 |
If you find **HunyuanVideo-Foley** useful for your research, please consider citing our paper:
|
| 369 |
|
|
|
|
| 105 |
|
| 106 |
## π **Abstract**
|
| 107 |
|
| 108 |
+
<div align="center" style="background: linear-gradient(135deg, #ffeef8 0%, #f0f8ff 100%); padding: 30px; border-radius: 20px; margin: 20px 0; border-left: 5px solid #ff6b9d; color: #333;">
|
| 109 |
|
| 110 |
**π Tencent Hunyuan** proudly open-sources **HunyuanVideo-Foley** - an end-to-end video sound effect generation model!
|
| 111 |
|
|
|
|
| 117 |
|
| 118 |
<div style="display: grid; grid-template-columns: 1fr; gap: 15px; margin: 20px 0;">
|
| 119 |
|
| 120 |
+
<div style="border-left: 4px solid #4CAF50; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;">
|
| 121 |
|
| 122 |
**π¬ Multi-scenario Audio-Visual Synchronization**
|
| 123 |
Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications.
|
| 124 |
|
| 125 |
</div>
|
| 126 |
|
| 127 |
+
<div style="border-left: 4px solid #2196F3; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;">
|
| 128 |
|
| 129 |
**βοΈ Multi-modal Semantic Balance**
|
| 130 |
Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements.
|
| 131 |
|
| 132 |
</div>
|
| 133 |
|
| 134 |
+
<div style="border-left: 4px solid #FF9800; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;">
|
| 135 |
|
| 136 |
**π΅ High-fidelity Audio Output**
|
| 137 |
Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality.
|
|
|
|
| 140 |
|
| 141 |
</div>
|
| 142 |
|
| 143 |
+
<div align="center" style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 15px; margin: 20px 0; color: #333;">
|
| 144 |
|
| 145 |
**π SOTA Performance Achieved**
|
| 146 |
|
|
|
|
| 168 |
|
| 169 |
</div>
|
| 170 |
|
| 171 |
+
<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #17a2b8; margin: 20px 0; color: #333;">
|
| 172 |
|
| 173 |
The **TV2A (Text-Video-to-Audio)** task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities.
|
| 174 |
|
|
|
|
| 183 |
|
| 184 |
</div>
|
| 185 |
|
| 186 |
+
<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #28a745; margin: 20px 0; color: #333;">
|
| 187 |
|
| 188 |
**HunyuanVideo-Foley** employs a sophisticated hybrid architecture:
|
| 189 |
|
|
|
|
| 276 |
|
| 277 |
#### **Step 2: Environment Setup**
|
| 278 |
|
| 279 |
+
<div style="background: #fff3cd; padding: 15px; border-radius: 8px; border-left: 4px solid #ffc107; margin: 10px 0; color: #333;">
|
| 280 |
|
| 281 |
π‘ **Tip**: We recommend using [Conda](https://docs.anaconda.com/free/miniconda/index.html) for Python environment management.
|
| 282 |
|
|
|
|
| 289 |
|
| 290 |
#### **Step 3: Download Pretrained Models**
|
| 291 |
|
| 292 |
+
<div style="background: #d1ecf1; padding: 15px; border-radius: 8px; border-left: 4px solid #17a2b8; margin: 10px 0; color: #333;">
|
| 293 |
|
| 294 |
π **Download Model weights from Huggingface**
|
| 295 |
```bash
|
|
|
|
| 309 |
|
| 310 |
### π¬ **Single Video Generation**
|
| 311 |
|
| 312 |
+
<div style="background: #e8f5e8; padding: 15px; border-radius: 8px; border-left: 4px solid #28a745; margin: 10px 0; color: #333;">
|
| 313 |
|
| 314 |
Generate Foley audio for a single video file with text description:
|
| 315 |
|
|
|
|
| 326 |
|
| 327 |
### π **Batch Processing**
|
| 328 |
|
| 329 |
+
<div style="background: #fff3e0; padding: 15px; border-radius: 8px; border-left: 4px solid #ff9800; margin: 10px 0; color: #333;">
|
| 330 |
|
| 331 |
Process multiple videos using a CSV file with video paths and descriptions:
|
| 332 |
|
|
|
|
| 342 |
|
| 343 |
### π **Interactive Web Interface**
|
| 344 |
|
| 345 |
+
<div style="background: #f3e5f5; padding: 15px; border-radius: 8px; border-left: 4px solid #9c27b0; margin: 10px 0; color: #333;">
|
| 346 |
|
| 347 |
Launch a user-friendly Gradio web interface for easy interaction:
|
| 348 |
|
|
|
|
| 353 |
python3 gradio_app.py
|
| 354 |
```
|
| 355 |
|
| 356 |
+
<div align="center" style="margin: 20px 0; color: #333;">
|
| 357 |
|
| 358 |
*π Then open your browser and navigate to the provided local URL to start generating Foley audio!*
|
| 359 |
|
|
|
|
| 363 |
|
| 364 |
## π **Citation**
|
| 365 |
|
| 366 |
+
<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #6c757d; margin: 20px 0; color: #333;">
|
| 367 |
|
| 368 |
If you find **HunyuanVideo-Foley** useful for your research, please consider citing our paper:
|
| 369 |
|