WAN 2.2 FP8 Text Encoders

Optimized FP8 text encoder models for WAN (Wan An Nian) 2.2 video generation pipeline. These quantized encoders provide significantly reduced memory footprint while maintaining high-quality text understanding for video generation tasks.

Model Description

This repository contains FP8-quantized text encoder models specifically optimized for the WAN 2.2 video generation system. The models enable text-to-video and image-to-video generation with substantially lower VRAM requirements compared to FP16 variants.

Key Features:

  • FP8 Quantization: Reduces model size by ~50% compared to FP16 with minimal quality loss
  • Dual Encoder Support: Includes both T5-XXL and UMT5-XXL encoders for flexible text understanding
  • Memory Efficient: Enables video generation on GPUs with 16GB+ VRAM
  • Drop-in Replacement: Compatible with WAN 2.2 diffusers pipeline

Capabilities:

  • Text-to-video generation with natural language prompts
  • Enhanced multilingual support (UMT5-XXL)
  • High-quality semantic understanding for video synthesis
  • Optimized for batch processing and long video generation

Repository Contents

wan22-fp8-encoders/
└── text_encoders/
    β”œβ”€β”€ t5-xxl-fp8.safetensors      # 4.6 GB - T5-XXL FP8 text encoder
    └── umt5-xxl-fp8.safetensors    # 6.3 GB - UMT5-XXL FP8 multilingual encoder

Total Repository Size: 11 GB

Model Files

File Size Description Use Case
t5-xxl-fp8.safetensors 4.6 GB T5-XXL FP8 encoder English text understanding
umt5-xxl-fp8.safetensors 6.3 GB UMT5-XXL FP8 encoder Multilingual text support

Hardware Requirements

Minimum Requirements

  • VRAM: 16 GB (with FP8 encoders + base model)
  • System RAM: 32 GB recommended
  • Disk Space: 11 GB for encoders + additional space for base models
  • GPU: NVIDIA RTX 3090, RTX 4090, or better

Recommended Requirements

  • VRAM: 24 GB+ (for higher resolution and longer videos)
  • System RAM: 64 GB
  • Disk Space: 50 GB+ (including all WAN 2.2 components)
  • GPU: NVIDIA RTX 4090, A6000, or better

Performance Notes

  • FP8 encoders reduce VRAM usage by ~4-6 GB compared to FP16
  • UMT5-XXL provides better multilingual support but uses more VRAM
  • T5-XXL is recommended for English-only workflows
  • Batch size and video length may require additional VRAM scaling

Usage Examples

Basic Usage with Diffusers

from diffusers import WanPipeline
import torch

# Load WAN pipeline with FP8 text encoders
pipe = WanPipeline.from_pretrained(
    "path/to/wan22-base-model",
    text_encoder_path="E:/huggingface/wan22-fp8-encoders/text_encoders/t5-xxl-fp8.safetensors",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")

# Generate video from text prompt
prompt = "A serene mountain landscape at sunset with clouds moving gently"
video = pipe(
    prompt=prompt,
    num_frames=48,
    height=512,
    width=512,
    num_inference_steps=30,
    guidance_scale=7.5
).frames

# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output.mp4", fps=24)

Using UMT5-XXL for Multilingual Support

from diffusers import WanPipeline
import torch

# Load with multilingual encoder
pipe = WanPipeline.from_pretrained(
    "path/to/wan22-base-model",
    text_encoder_path="E:/huggingface/wan22-fp8-encoders/text_encoders/umt5-xxl-fp8.safetensors",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")

# Generate with non-English prompt
prompt = "ηΎŽδΈ½ηš„ζ¨±θŠ±ζ ‘εœ¨ζ˜₯ε€©η››εΌ€οΌŒθŠ±η“£ιšι£Žι£˜θ½"
video = pipe(prompt=prompt, num_frames=48).frames

Memory-Efficient Configuration

import torch
from diffusers import WanPipeline

# Enable memory optimizations
pipe = WanPipeline.from_pretrained(
    "path/to/wan22-base-model",
    text_encoder_path="E:/huggingface/wan22-fp8-encoders/text_encoders/t5-xxl-fp8.safetensors",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")

# Enable memory-efficient attention
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()

# Generate with lower memory usage
video = pipe(
    prompt="Your prompt here",
    num_frames=24,  # Reduced frame count
    height=512,
    width=512
).frames

Model Specifications

T5-XXL FP8 Encoder

  • Architecture: T5 (Text-to-Text Transfer Transformer)
  • Size: XXL variant (11 billion parameters)
  • Precision: FP8 (8-bit floating point)
  • Format: SafeTensors
  • Language: English-optimized
  • Context Length: 512 tokens
  • Embedding Dimension: 4096

UMT5-XXL FP8 Encoder

  • Architecture: UMT5 (Unified Multilingual T5)
  • Size: XXL variant (11 billion parameters)
  • Precision: FP8 (8-bit floating point)
  • Format: SafeTensors
  • Languages: 100+ languages supported
  • Context Length: 512 tokens
  • Embedding Dimension: 4096

Performance Tips and Optimization

Memory Optimization

  1. Use T5-XXL for English: Save 1.7 GB VRAM with T5-XXL vs UMT5-XXL
  2. Enable Attention Slicing: Reduces peak memory usage by 20-30%
  3. Enable VAE Slicing: Further reduces memory for longer videos
  4. Reduce Frame Count: Start with 24-48 frames for testing
  5. Lower Resolution: Use 512x512 instead of 1024x1024 for testing

Quality Optimization

  1. Increase Inference Steps: 30-50 steps for higher quality (default: 30)
  2. Adjust Guidance Scale: 7.0-9.0 range for better prompt adherence
  3. Use UMT5 for Complex Prompts: Better semantic understanding
  4. Longer Prompts: Detailed descriptions produce better results
  5. Seed Control: Use fixed seeds for reproducible results

Performance Benchmarks

Configuration VRAM Usage Generation Time (48 frames)
T5-XXL FP8 + Base Model ~16 GB ~120 seconds (RTX 4090)
UMT5-XXL FP8 + Base Model ~18 GB ~130 seconds (RTX 4090)
With Attention Slicing -20% +10% time

License

These FP8-quantized text encoders are derived from the original T5 and UMT5 models:

  • T5-XXL: Apache 2.0 License
  • UMT5-XXL: Apache 2.0 License
  • Quantization: Community contribution under Apache 2.0

License Terms: These models may be used for research and commercial purposes. Attribution to the original T5/UMT5 authors and the WAN project is appreciated but not required under Apache 2.0 terms.

Disclaimer: These are quantized versions optimized for memory efficiency. For critical applications, validate output quality against FP16 versions.

Citation

If you use these FP8 encoders in your research or applications, please cite:

@misc{wan22-fp8-encoders,
  title={WAN 2.2 FP8 Text Encoders},
  author={WAN Community Contributors},
  year={2024},
  publisher={Hugging Face},
  note={FP8-quantized T5-XXL and UMT5-XXL encoders for video generation}
}

@article{t5,
  title={Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  author={Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J},
  journal={Journal of Machine Learning Research},
  volume={21},
  number={140},
  pages={1--67},
  year={2020}
}

@article{umt5,
  title={UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining},
  author={Chung, Hyung Won and Garrette, Dan and Tan, Kiat Chuan and Riesa, Jason},
  journal={arXiv preprint arXiv:2304.09151},
  year={2023}
}

Contact and Resources

Official Resources

Community

Related Models

  • WAN 2.2 Base Model: Full video generation pipeline
  • WAN 2.2 VAE: Video autoencoder
  • WAN Enhancement LoRAs: Camera control, lighting, quality improvements

Support

For questions, issues, or feature requests:

  1. Check the official documentation
  2. Search existing issues
  3. Join the community Discord
  4. Open a new issue with detailed information

Note: These FP8 encoders are part of the WAN 2.2 ecosystem. Ensure you have the complete WAN 2.2 pipeline installed for full functionality. Visit the official repository for installation instructions and additional components.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including wangkanai/wan22-fp8-encoders