Spaces:

minhho
/

mimo-1.0

Paused

App Files Files Community

mimo-1.0 / README.md

minhho

Clean deployment: All fixes without binary files

6f2c7f0 about 1 month ago

preview code

raw

history blame contribute delete

2.44 kB

A newer version of the Gradio SDK is available: 5.49.1

Upgrade

metadata

title: MIMO - Character Video Synthesis
emoji: 🎭
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.7.1
app_file: app.py
pinned: false
license: apache-2.0
python_version: '3.10'

MIMO - Controllable Character Video Synthesis

🎬 Complete Implementation - Optimized for HuggingFace Spaces

Transform character images into animated videos with controllable motion and advanced video editing capabilities.

🚀 Quick Start

Setup Models: Click "Setup Models" button (downloads required models)
Load Model: Click "Load Model" button (initializes MIMO pipeline)
Upload Image: Character image (person, anime, cartoon, etc.)
Choose Template (Optional): Select motion template or use reference image only
Generate: Create animated video

Note on Templates: Video templates are optional. See TEMPLATES_SETUP.md for adding custom templates.

⚡ Why This Approach?

To prevent HuggingFace Spaces build timeout, we use progressive loading:

Minimal dependencies at startup (fast build)
Runtime installation of heavy packages (TensorFlow, OpenCV)
Full features available after one-time setup

Features

🎭 Character Animation Mode

Simple character animation with motion templates
Based on run_animate.py from original repository
Fast generation (512x512, 20 steps)

🎬 Video Character Editing Mode

Advanced editing with background preservation
Human segmentation and occlusion handling
Based on run_edit.py from original repository
High quality (784x784, 25 steps)

Available Templates

Sports: basketball_gym, nba_dunk, nba_pass, football Action: kungfu_desert, kungfu_match, parkour, BruceLee Dance: dance_indoor, irish_dance Synthetic: syn_basketball, syn_dancing

Technical Details

Models: Stable Diffusion v1.5 + 3D UNet + Pose Guider
GPU: Auto-detection (T4/A10G/A100) with FP16/FP32
Resolution: 512x512 (Animation), 784x784 (Editing)
Processing: 2-5 minutes depending on template
Video I/O: PyAV (av pip package) for frame decoding/encoding

Credits

Paper: MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling Authors: Yifang Men, Yuan Yao, Miaomiao Cui, Liefeng Bo (Alibaba Group) Conference: CVPR 2025 Code: GitHub