WAN 2.2 QX Text Encoders (GGUF)
Quantized text encoder models for WAN (World Animated Network) 2.2 QX video generation system. These GGUF-format encoders provide efficient text-to-embedding conversion for text-to-video and image-to-video generation workflows with significantly reduced VRAM requirements.
Model Description
This repository contains 8 quantized variants of the UMT5-XXL text encoder, optimized for WAN 2.2 QX video generation pipelines. The GGUF format enables efficient inference with reduced memory footprint while maintaining high-quality text understanding for video generation prompts.
Key Features:
- Multiple Precision Levels: Q3_K_S to Q8_0 quantization options
- VRAM Optimization: 30-70% memory reduction compared to FP16
- Quality vs Size Trade-offs: Choose optimal balance for your hardware
- Direct Integration: Compatible with WAN 2.2 QX diffusers pipeline
- UMT5-XXL Architecture: Advanced multilingual text understanding
Repository Contents
Text Encoders (text_encoders/)
| File | Size | Quantization | VRAM | Quality |
|---|---|---|---|---|
umt5-xxl-encoder-q3-k-s.gguf |
2.7 GB | Q3_K_S | ~3 GB | Good |
umt5-xxl-encoder-q3-k-m.gguf |
2.9 GB | Q3_K_M | ~3.5 GB | Better |
umt5-xxl-encoder-q4-k-s.gguf |
3.3 GB | Q4_K_S | ~4 GB | Very Good |
umt5-xxl-encoder-q4-k-m.gguf |
3.5 GB | Q4_K_M | ~4.5 GB | Excellent |
umt5-xxl-encoder-q5-k-s.gguf |
3.8 GB | Q5_K_S | ~5 GB | Excellent |
umt5-xxl-encoder-q5-k-m.gguf |
3.9 GB | Q5_K_M | ~5.5 GB | Near-Original |
umt5-xxl-encoder-q6-k.gguf |
4.4 GB | Q6_K | ~6 GB | Near-Original |
umt5-xxl-encoder-q8-0.gguf |
5.7 GB | Q8_0 | ~7.5 GB | Original Quality |
Total Repository Size: ~30.2 GB
Hardware Requirements
Minimum Requirements
- VRAM: 4 GB (Q3_K_S quantization)
- RAM: 8 GB system memory
- Disk Space: 3 GB (single encoder variant)
- GPU: NVIDIA RTX 2060 or equivalent (CUDA support recommended)
Recommended Requirements
- VRAM: 8+ GB (Q4_K_M or higher quantization)
- RAM: 16 GB system memory
- Disk Space: 10 GB (multiple variants for testing)
- GPU: NVIDIA RTX 3060 Ti or better
Optimal Requirements
- VRAM: 12+ GB (Q6_K or Q8_0 quantization)
- RAM: 32 GB system memory
- Disk Space: 30 GB (full repository)
- GPU: NVIDIA RTX 4070 Ti or better
Usage Examples
Basic Text Encoder Loading
from diffusers import DiffusionPipeline
import torch
# Load WAN 2.2 QX pipeline with quantized text encoder
pipe = DiffusionPipeline.from_pretrained(
"path/to/wan-2.2-qx",
text_encoder_path="E:/huggingface/wan22-qx-encoders-gguf/text_encoders/umt5-xxl-encoder-q4-k-m.gguf",
torch_dtype=torch.float16,
variant="fp16"
)
# Move pipeline to GPU
pipe = pipe.to("cuda")
# Generate video from text
prompt = "A serene mountain landscape at sunset with flowing clouds"
video_frames = pipe(
prompt=prompt,
num_frames=24,
num_inference_steps=30
).frames
# Save video
save_video(video_frames, "output.mp4", fps=8)
Memory-Optimized Configuration (Low VRAM)
from diffusers import DiffusionPipeline
import torch
# Use Q3_K_S encoder for minimum VRAM usage
pipe = DiffusionPipeline.from_pretrained(
"path/to/wan-2.2-qx",
text_encoder_path="E:/huggingface/wan22-qx-encoders-gguf/text_encoders/umt5-xxl-encoder-q3-k-s.gguf",
torch_dtype=torch.float16
)
# Enable memory optimizations
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
pipe = pipe.to("cuda")
# Generate with lower resolution
video_frames = pipe(
prompt="A cat playing with a ball of yarn",
height=512,
width=512,
num_frames=16,
num_inference_steps=25
).frames
Quality-Optimized Configuration (High VRAM)
from diffusers import DiffusionPipeline
import torch
# Use Q8_0 encoder for maximum quality
pipe = DiffusionPipeline.from_pretrained(
"path/to/wan-2.2-qx",
text_encoder_path="E:/huggingface/wan22-qx-encoders-gguf/text_encoders/umt5-xxl-encoder-q8-0.gguf",
torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
# Generate high-quality video
video_frames = pipe(
prompt="Cinematic shot of a futuristic cityscape with flying vehicles",
height=1024,
width=1024,
num_frames=48,
num_inference_steps=50,
guidance_scale=7.5
).frames
Batch Processing with Different Encoders
import torch
from diffusers import DiffusionPipeline
# Test different quantization levels
encoders = {
"q3_k_m": "E:/huggingface/wan22-qx-encoders-gguf/text_encoders/umt5-xxl-encoder-q3-k-m.gguf",
"q4_k_m": "E:/huggingface/wan22-qx-encoders-gguf/text_encoders/umt5-xxl-encoder-q4-k-m.gguf",
"q5_k_m": "E:/huggingface/wan22-qx-encoders-gguf/text_encoders/umt5-xxl-encoder-q5-k-m.gguf",
}
prompt = "A beautiful garden with blooming flowers in spring"
for name, encoder_path in encoders.items():
pipe = DiffusionPipeline.from_pretrained(
"path/to/wan-2.2-qx",
text_encoder_path=encoder_path,
torch_dtype=torch.float16
).to("cuda")
video = pipe(prompt=prompt, num_frames=24).frames
save_video(video, f"output_{name}.mp4", fps=8)
# Clear VRAM
del pipe
torch.cuda.empty_cache()
Model Specifications
Architecture
- Base Model: UMT5-XXL (Unified Multilingual T5)
- Parameters: ~13 billion (unquantized)
- Context Length: 512 tokens
- Vocabulary Size: 250,000+ tokens
- Language Support: Multilingual (100+ languages)
Quantization Details
| Level | Bits | Method | Quality | Use Case |
|---|---|---|---|---|
| Q3_K_S | 3-bit | K-quant Small | 85% | Minimum VRAM, prototyping |
| Q3_K_M | 3-bit | K-quant Medium | 87% | Low VRAM, good quality |
| Q4_K_S | 4-bit | K-quant Small | 92% | Balanced VRAM/quality |
| Q4_K_M | 4-bit | K-quant Medium | 94% | Recommended default |
| Q5_K_S | 5-bit | K-quant Small | 96% | High quality, moderate VRAM |
| Q5_K_M | 5-bit | K-quant Medium | 97% | High quality production |
| Q6_K | 6-bit | K-quant | 98% | Near-lossless quality |
| Q8_0 | 8-bit | Zero-point | 99% | Maximum quality |
Format
- File Format: GGUF (GPT-Generated Unified Format)
- Precision: Mixed precision quantization (K-quant variants)
- Compression: Lossless GGUF compression
- Compatibility: llama.cpp ecosystem, diffusers integration
Performance Tips
Quantization Selection Guide
- 8GB VRAM or less: Use Q3_K_M or Q4_K_S
- 12GB VRAM: Use Q4_K_M or Q5_K_S (recommended)
- 16GB VRAM: Use Q5_K_M or Q6_K
- 24GB+ VRAM: Use Q8_0 for maximum quality
Optimization Strategies
Memory Optimization:
- Enable
enable_attention_slicing()for low VRAM - Use
enable_vae_slicing()for large videos - Reduce
num_framesand resolution for faster generation
- Enable
Quality Optimization:
- Q4_K_M offers best quality/performance balance
- Q6_K or Q8_0 for production-quality outputs
- Higher
num_inference_stepsimproves coherence
Speed Optimization:
- Lower quantization levels (Q3_K) are slightly faster
- Reduce inference steps for draft iterations
- Use torch.compile() for additional speedup (PyTorch 2.0+)
Benchmark Performance (RTX 4090, 24GB VRAM)
| Encoder | Load Time | VRAM Usage | Quality Score | Speed |
|---|---|---|---|---|
| Q3_K_S | 8s | 2.9 GB | 8.2/10 | Fast |
| Q4_K_M | 10s | 4.1 GB | 9.1/10 | Fast |
| Q5_K_M | 12s | 5.3 GB | 9.5/10 | Medium |
| Q8_0 | 15s | 7.2 GB | 9.8/10 | Medium |
License
This model repository uses a custom license. Please review the WAN model license terms before use.
License Type: other (WAN License) Commercial Use: Check WAN license terms Attribution: Required for derivative works
Citation
If you use these models in your research or projects, please cite:
@misc{wan22-qx-encoders-gguf,
title={WAN 2.2 QX Text Encoders (GGUF)},
author={WAN Team},
year={2024},
howpublished={\url{https://huggingface.co/wan22-qx-encoders-gguf}},
note={Quantized UMT5-XXL text encoders for video generation}
}
Related Resources
- WAN Official Documentation: WAN Docs
- Diffusers Library: https://github.com/huggingface/diffusers
- GGUF Format: https://github.com/ggerganov/llama.cpp
- UMT5 Paper: Unified Multilingual T5
Technical Support
For issues, questions, or feature requests:
- GitHub Issues: Report issues
- Hugging Face Discussions: Community support
- Documentation: WAN User Guide
Model Card Metadata
- Developed by: WAN Team
- Model type: Text Encoder (Quantized)
- Language(s): Multilingual (100+ languages)
- License: Custom WAN License
- Finetuned from: UMT5-XXL
- Model Format: GGUF (quantized)
- Precision Variants: Q3_K_S, Q3_K_M, Q4_K_S, Q4_K_M, Q5_K_S, Q5_K_M, Q6_K, Q8_0
- Downloads last month
- 240
We're not able to determine the quantization variants.