MUG-V 10B Training Checkpoints

Pre-trained Megatron-format checkpoints for MUG-V 10B video generation model.

Available Checkpoints

MUG-V-10B-torch_dist (Recommended)

Torch Distributed Checkpoint - Flexible parallelism support

  • Format: Torch Distributed (.distcp)
  • Parallelism: Can be loaded with any TP/PP configuration
  • Use Case: Production training, flexible distributed setup
huggingface-cli download MUG-V/MUG-V-training --local-dir ./checkpoints --include "MUG-V-10B-torch_dist/*"

MUG-V-10B-TP4-legacy

Torch Format (Legacy) - Fixed TP=4

  • Format: Torch format (mp_rank_XX/model_optim_rng.pt)
  • Parallelism: Must be loaded with TP=4
  • Use Case: Fixed TP setup or conversion to Torch Distributed
huggingface-cli download MUG-V/MUG-V-training --local-dir ./checkpoints --include "MUG-V-10B-TP4-legacy/*"

Quick Start

Option 1: Direct Training

Use the Torch Distributed checkpoint directly for training:

# Download checkpoint
huggingface-cli download MUG-V/MUG-V-training --local-dir ./checkpoints --include "MUG-V-10B-torch_dist/*"

# Download sample data
huggingface-cli download MUG-V/MUG-V-Training-Samples --repo-type dataset --local-dir ./sample_dataset

# Set environment variables
export CHECKPOINT_DIR="./checkpoints/MUG-V-10B-torch_dist/torch_dist"
export MODEL_TYPE="mugdit_10b"
export DATA_TRAIN="./sample_dataset/train.csv"

# Start training (8 GPUs)
bash examples/mugv/pretrain_slurm.sh

Option 2: Convert to HuggingFace Format

Convert Megatron checkpoint to HuggingFace format for inference:

python -m examples.mugv.convertor.mugdit_mcore2hf \
    --dcp-dir ./checkpoints/MUG-V-10B-torch_dist/torch_dist/iter_0000000 \
    --output ./mugdit_10b_hf.pt \
    --model-size 10B

Checkpoint Formats Comparison

Format Parallelism File Structure Training Conversion
Torch Distributed Flexible TP/PP *.distcp files โœ… Recommended โœ… To HF
Torch (Legacy) Fixed TP=4 mp_rank_XX/ dirs โš ๏ธ TP=4 only โœ… To Torch Dist / HF
HuggingFace None (inference) Single .pt file โŒ Not for training -

Model Architecture

  • Parameters: ~10 billion
  • Architecture: Diffusion Transformer (DiT)
  • Hidden Size: 3456
  • Attention Heads: 48
  • Layers: 56
  • Compression: VideoVAE 8ร—8ร—8

Related Resources

Documentation

Citation

@article{zhang2025mugv10b,
  title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
  author={Zhang, Yongshun and Fan, Zhongyi and Zhang, Yonghang and Li, Zhangzikang and Chen, Weifeng and Feng, Zhongwei and Wang, Chaoyue and Hou, Peng and Zeng, Anxiang},
  journal={arXiv preprint},
  year={2025}
}

License

Apache License 2.0


Developed by Shopee Multimodal Understanding and Generation (MUG) Team

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support