WAN 2.2 FP8 Text Encoders

Optimized FP8 text encoder models for WAN (Wan An Nian) 2.2 video generation pipeline. These quantized encoders provide significantly reduced memory footprint while maintaining high-quality text understanding for video generation tasks.

Model Description

This repository contains FP8-quantized text encoder models specifically optimized for the WAN 2.2 video generation system. The models enable text-to-video and image-to-video generation with substantially lower VRAM requirements compared to FP16 variants.

Key Features:

FP8 Quantization: Reduces model size by ~50% compared to FP16 with minimal quality loss
Dual Encoder Support: Includes both T5-XXL and UMT5-XXL encoders for flexible text understanding
Memory Efficient: Enables video generation on GPUs with 16GB+ VRAM
Drop-in Replacement: Compatible with WAN 2.2 diffusers pipeline

Capabilities:

Text-to-video generation with natural language prompts
Enhanced multilingual support (UMT5-XXL)
High-quality semantic understanding for video synthesis
Optimized for batch processing and long video generation

Repository Contents

wan22-fp8-encoders/
└── text_encoders/
    ├── t5-xxl-fp8.safetensors      # 4.6 GB - T5-XXL FP8 text encoder
    └── umt5-xxl-fp8.safetensors    # 6.3 GB - UMT5-XXL FP8 multilingual encoder

Total Repository Size: 11 GB

Model Files

File	Size	Description	Use Case
`t5-xxl-fp8.safetensors`	4.6 GB	T5-XXL FP8 encoder	English text understanding
`umt5-xxl-fp8.safetensors`	6.3 GB	UMT5-XXL FP8 encoder	Multilingual text support

Hardware Requirements

Minimum Requirements

VRAM: 16 GB (with FP8 encoders + base model)
System RAM: 32 GB recommended
Disk Space: 11 GB for encoders + additional space for base models
GPU: NVIDIA RTX 3090, RTX 4090, or better

Recommended Requirements

VRAM: 24 GB+ (for higher resolution and longer videos)
System RAM: 64 GB
Disk Space: 50 GB+ (including all WAN 2.2 components)
GPU: NVIDIA RTX 4090, A6000, or better

Performance Notes

FP8 encoders reduce VRAM usage by ~4-6 GB compared to FP16
UMT5-XXL provides better multilingual support but uses more VRAM
T5-XXL is recommended for English-only workflows
Batch size and video length may require additional VRAM scaling

Usage Examples

Basic Usage with Diffusers

from diffusers import WanPipeline
import torch

# Load WAN pipeline with FP8 text encoders
pipe = WanPipeline.from_pretrained(
    "path/to/wan22-base-model",
    text_encoder_path="E:/huggingface/wan22-fp8-encoders/text_encoders/t5-xxl-fp8.safetensors",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")

# Generate video from text prompt
prompt = "A serene mountain landscape at sunset with clouds moving gently"
video = pipe(
    prompt=prompt,
    num_frames=48,
    height=512,
    width=512,
    num_inference_steps=30,
    guidance_scale=7.5
).frames

# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output.mp4", fps=24)

Using UMT5-XXL for Multilingual Support

from diffusers import WanPipeline
import torch

# Load with multilingual encoder
pipe = WanPipeline.from_pretrained(
    "path/to/wan22-base-model",
    text_encoder_path="E:/huggingface/wan22-fp8-encoders/text_encoders/umt5-xxl-fp8.safetensors",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")

# Generate with non-English prompt
prompt = "美丽的樱花树在春天盛开，花瓣随风飘落"
video = pipe(prompt=prompt, num_frames=48).frames

Memory-Efficient Configuration

import torch
from diffusers import WanPipeline

# Enable memory optimizations
pipe = WanPipeline.from_pretrained(
    "path/to/wan22-base-model",
    text_encoder_path="E:/huggingface/wan22-fp8-encoders/text_encoders/t5-xxl-fp8.safetensors",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")

# Enable memory-efficient attention
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()

# Generate with lower memory usage
video = pipe(
    prompt="Your prompt here",
    num_frames=24,  # Reduced frame count
    height=512,
    width=512
).frames

Model Specifications

T5-XXL FP8 Encoder

Architecture: T5 (Text-to-Text Transfer Transformer)
Size: XXL variant (11 billion parameters)
Precision: FP8 (8-bit floating point)
Format: SafeTensors
Language: English-optimized
Context Length: 512 tokens
Embedding Dimension: 4096

UMT5-XXL FP8 Encoder

Architecture: UMT5 (Unified Multilingual T5)
Size: XXL variant (11 billion parameters)
Precision: FP8 (8-bit floating point)
Format: SafeTensors
Languages: 100+ languages supported
Context Length: 512 tokens
Embedding Dimension: 4096

Performance Tips and Optimization

Memory Optimization

Use T5-XXL for English: Save 1.7 GB VRAM with T5-XXL vs UMT5-XXL
Enable Attention Slicing: Reduces peak memory usage by 20-30%
Enable VAE Slicing: Further reduces memory for longer videos
Reduce Frame Count: Start with 24-48 frames for testing
Lower Resolution: Use 512x512 instead of 1024x1024 for testing

Quality Optimization

Increase Inference Steps: 30-50 steps for higher quality (default: 30)
Adjust Guidance Scale: 7.0-9.0 range for better prompt adherence
Use UMT5 for Complex Prompts: Better semantic understanding
Longer Prompts: Detailed descriptions produce better results
Seed Control: Use fixed seeds for reproducible results

Performance Benchmarks

Configuration	VRAM Usage	Generation Time (48 frames)
T5-XXL FP8 + Base Model	~16 GB	~120 seconds (RTX 4090)
UMT5-XXL FP8 + Base Model	~18 GB	~130 seconds (RTX 4090)
With Attention Slicing	-20%	+10% time

License

These FP8-quantized text encoders are derived from the original T5 and UMT5 models:

T5-XXL: Apache 2.0 License
UMT5-XXL: Apache 2.0 License
Quantization: Community contribution under Apache 2.0

License Terms: These models may be used for research and commercial purposes. Attribution to the original T5/UMT5 authors and the WAN project is appreciated but not required under Apache 2.0 terms.

Disclaimer: These are quantized versions optimized for memory efficiency. For critical applications, validate output quality against FP16 versions.

Citation

If you use these FP8 encoders in your research or applications, please cite:

@misc{wan22-fp8-encoders,
  title={WAN 2.2 FP8 Text Encoders},
  author={WAN Community Contributors},
  year={2024},
  publisher={Hugging Face},
  note={FP8-quantized T5-XXL and UMT5-XXL encoders for video generation}
}

@article{t5,
  title={Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  author={Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J},
  journal={Journal of Machine Learning Research},
  volume={21},
  number={140},
  pages={1--67},
  year={2020}
}

@article{umt5,
  title={UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining},
  author={Chung, Hyung Won and Garrette, Dan and Tan, Kiat Chuan and Riesa, Jason},
  journal={arXiv preprint arXiv:2304.09151},
  year={2023}
}

Contact and Resources

Official Resources

WAN Project: Official Repository
Hugging Face Hub: WAN Models
Documentation: WAN Docs

Community

Discord: WAN Community Discord
GitHub Issues: Report Issues
Discussions: Hugging Face Discussions

Related Models

WAN 2.2 Base Model: Full video generation pipeline
WAN 2.2 VAE: Video autoencoder
WAN Enhancement LoRAs: Camera control, lighting, quality improvements

Support

For questions, issues, or feature requests:

Check the official documentation
Search existing issues
Join the community Discord
Open a new issue with detailed information

Note: These FP8 encoders are part of the WAN 2.2 ecosystem. Ensure you have the complete WAN 2.2 pipeline installed for full functionality. Visit the official repository for installation instructions and additional components.

Downloads last month: -

Collection including wangkanai/wan22-fp8-encoders

wan-2.2

Collection

WAN 2.2 video models • 27 items • Updated about 23 hours ago • 1