MUG-V 10B Training Checkpoints
Pre-trained Megatron-format checkpoints for MUG-V 10B video generation model.
Available Checkpoints
MUG-V-10B-torch_dist (Recommended)
Torch Distributed Checkpoint - Flexible parallelism support
- Format: Torch Distributed (
.distcp) - Parallelism: Can be loaded with any TP/PP configuration
- Use Case: Production training, flexible distributed setup
huggingface-cli download MUG-V/MUG-V-training --local-dir ./checkpoints --include "MUG-V-10B-torch_dist/*"
MUG-V-10B-TP4-legacy
Torch Format (Legacy) - Fixed TP=4
- Format: Torch format (
mp_rank_XX/model_optim_rng.pt) - Parallelism: Must be loaded with TP=4
- Use Case: Fixed TP setup or conversion to Torch Distributed
huggingface-cli download MUG-V/MUG-V-training --local-dir ./checkpoints --include "MUG-V-10B-TP4-legacy/*"
Quick Start
Option 1: Direct Training
Use the Torch Distributed checkpoint directly for training:
# Download checkpoint
huggingface-cli download MUG-V/MUG-V-training --local-dir ./checkpoints --include "MUG-V-10B-torch_dist/*"
# Download sample data
huggingface-cli download MUG-V/MUG-V-Training-Samples --repo-type dataset --local-dir ./sample_dataset
# Set environment variables
export CHECKPOINT_DIR="./checkpoints/MUG-V-10B-torch_dist/torch_dist"
export MODEL_TYPE="mugdit_10b"
export DATA_TRAIN="./sample_dataset/train.csv"
# Start training (8 GPUs)
bash examples/mugv/pretrain_slurm.sh
Option 2: Convert to HuggingFace Format
Convert Megatron checkpoint to HuggingFace format for inference:
python -m examples.mugv.convertor.mugdit_mcore2hf \
--dcp-dir ./checkpoints/MUG-V-10B-torch_dist/torch_dist/iter_0000000 \
--output ./mugdit_10b_hf.pt \
--model-size 10B
Checkpoint Formats Comparison
| Format | Parallelism | File Structure | Training | Conversion |
|---|---|---|---|---|
| Torch Distributed | Flexible TP/PP | *.distcp files |
โ Recommended | โ To HF |
| Torch (Legacy) | Fixed TP=4 | mp_rank_XX/ dirs |
โ ๏ธ TP=4 only | โ To Torch Dist / HF |
| HuggingFace | None (inference) | Single .pt file |
โ Not for training | - |
Model Architecture
- Parameters: ~10 billion
- Architecture: Diffusion Transformer (DiT)
- Hidden Size: 3456
- Attention Heads: 48
- Layers: 56
- Compression: VideoVAE 8ร8ร8
Related Resources
- Training Code: MUG-V-Megatron-LM-Training
- Inference Code: MUG-V
- Inference Weights: MUG-V-inference
- Sample Dataset: MUG-V-Training-Samples
Documentation
- Training Guide: examples/mugv/README.md
- Checkpoint Conversion: Conversion Guide
Citation
@article{zhang2025mugv10b,
title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
author={Zhang, Yongshun and Fan, Zhongyi and Zhang, Yonghang and Li, Zhangzikang and Chen, Weifeng and Feng, Zhongwei and Wang, Chaoyue and Hou, Peng and Zeng, Anxiang},
journal={arXiv preprint},
year={2025}
}
License
Apache License 2.0
Developed by Shopee Multimodal Understanding and Generation (MUG) Team
- Downloads last month
- 5
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support