WAN 2.2 QX Text Encoders (GGUF)

Quantized text encoder models for WAN (World Animated Network) 2.2 QX video generation system. These GGUF-format encoders provide efficient text-to-embedding conversion for text-to-video and image-to-video generation workflows with significantly reduced VRAM requirements.

Model Description

This repository contains 8 quantized variants of the UMT5-XXL text encoder, optimized for WAN 2.2 QX video generation pipelines. The GGUF format enables efficient inference with reduced memory footprint while maintaining high-quality text understanding for video generation prompts.

Key Features:

  • Multiple Precision Levels: Q3_K_S to Q8_0 quantization options
  • VRAM Optimization: 30-70% memory reduction compared to FP16
  • Quality vs Size Trade-offs: Choose optimal balance for your hardware
  • Direct Integration: Compatible with WAN 2.2 QX diffusers pipeline
  • UMT5-XXL Architecture: Advanced multilingual text understanding

Repository Contents

Text Encoders (text_encoders/)

File Size Quantization VRAM Quality
umt5-xxl-encoder-q3-k-s.gguf 2.7 GB Q3_K_S ~3 GB Good
umt5-xxl-encoder-q3-k-m.gguf 2.9 GB Q3_K_M ~3.5 GB Better
umt5-xxl-encoder-q4-k-s.gguf 3.3 GB Q4_K_S ~4 GB Very Good
umt5-xxl-encoder-q4-k-m.gguf 3.5 GB Q4_K_M ~4.5 GB Excellent
umt5-xxl-encoder-q5-k-s.gguf 3.8 GB Q5_K_S ~5 GB Excellent
umt5-xxl-encoder-q5-k-m.gguf 3.9 GB Q5_K_M ~5.5 GB Near-Original
umt5-xxl-encoder-q6-k.gguf 4.4 GB Q6_K ~6 GB Near-Original
umt5-xxl-encoder-q8-0.gguf 5.7 GB Q8_0 ~7.5 GB Original Quality

Total Repository Size: ~30.2 GB

Hardware Requirements

Minimum Requirements

  • VRAM: 4 GB (Q3_K_S quantization)
  • RAM: 8 GB system memory
  • Disk Space: 3 GB (single encoder variant)
  • GPU: NVIDIA RTX 2060 or equivalent (CUDA support recommended)

Recommended Requirements

  • VRAM: 8+ GB (Q4_K_M or higher quantization)
  • RAM: 16 GB system memory
  • Disk Space: 10 GB (multiple variants for testing)
  • GPU: NVIDIA RTX 3060 Ti or better

Optimal Requirements

  • VRAM: 12+ GB (Q6_K or Q8_0 quantization)
  • RAM: 32 GB system memory
  • Disk Space: 30 GB (full repository)
  • GPU: NVIDIA RTX 4070 Ti or better

Usage Examples

Basic Text Encoder Loading

from diffusers import DiffusionPipeline
import torch

# Load WAN 2.2 QX pipeline with quantized text encoder
pipe = DiffusionPipeline.from_pretrained(
    "path/to/wan-2.2-qx",
    text_encoder_path="E:/huggingface/wan22-qx-encoders-gguf/text_encoders/umt5-xxl-encoder-q4-k-m.gguf",
    torch_dtype=torch.float16,
    variant="fp16"
)

# Move pipeline to GPU
pipe = pipe.to("cuda")

# Generate video from text
prompt = "A serene mountain landscape at sunset with flowing clouds"
video_frames = pipe(
    prompt=prompt,
    num_frames=24,
    num_inference_steps=30
).frames

# Save video
save_video(video_frames, "output.mp4", fps=8)

Memory-Optimized Configuration (Low VRAM)

from diffusers import DiffusionPipeline
import torch

# Use Q3_K_S encoder for minimum VRAM usage
pipe = DiffusionPipeline.from_pretrained(
    "path/to/wan-2.2-qx",
    text_encoder_path="E:/huggingface/wan22-qx-encoders-gguf/text_encoders/umt5-xxl-encoder-q3-k-s.gguf",
    torch_dtype=torch.float16
)

# Enable memory optimizations
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
pipe = pipe.to("cuda")

# Generate with lower resolution
video_frames = pipe(
    prompt="A cat playing with a ball of yarn",
    height=512,
    width=512,
    num_frames=16,
    num_inference_steps=25
).frames

Quality-Optimized Configuration (High VRAM)

from diffusers import DiffusionPipeline
import torch

# Use Q8_0 encoder for maximum quality
pipe = DiffusionPipeline.from_pretrained(
    "path/to/wan-2.2-qx",
    text_encoder_path="E:/huggingface/wan22-qx-encoders-gguf/text_encoders/umt5-xxl-encoder-q8-0.gguf",
    torch_dtype=torch.float16
)

pipe = pipe.to("cuda")

# Generate high-quality video
video_frames = pipe(
    prompt="Cinematic shot of a futuristic cityscape with flying vehicles",
    height=1024,
    width=1024,
    num_frames=48,
    num_inference_steps=50,
    guidance_scale=7.5
).frames

Batch Processing with Different Encoders

import torch
from diffusers import DiffusionPipeline

# Test different quantization levels
encoders = {
    "q3_k_m": "E:/huggingface/wan22-qx-encoders-gguf/text_encoders/umt5-xxl-encoder-q3-k-m.gguf",
    "q4_k_m": "E:/huggingface/wan22-qx-encoders-gguf/text_encoders/umt5-xxl-encoder-q4-k-m.gguf",
    "q5_k_m": "E:/huggingface/wan22-qx-encoders-gguf/text_encoders/umt5-xxl-encoder-q5-k-m.gguf",
}

prompt = "A beautiful garden with blooming flowers in spring"

for name, encoder_path in encoders.items():
    pipe = DiffusionPipeline.from_pretrained(
        "path/to/wan-2.2-qx",
        text_encoder_path=encoder_path,
        torch_dtype=torch.float16
    ).to("cuda")

    video = pipe(prompt=prompt, num_frames=24).frames
    save_video(video, f"output_{name}.mp4", fps=8)

    # Clear VRAM
    del pipe
    torch.cuda.empty_cache()

Model Specifications

Architecture

  • Base Model: UMT5-XXL (Unified Multilingual T5)
  • Parameters: ~13 billion (unquantized)
  • Context Length: 512 tokens
  • Vocabulary Size: 250,000+ tokens
  • Language Support: Multilingual (100+ languages)

Quantization Details

Level Bits Method Quality Use Case
Q3_K_S 3-bit K-quant Small 85% Minimum VRAM, prototyping
Q3_K_M 3-bit K-quant Medium 87% Low VRAM, good quality
Q4_K_S 4-bit K-quant Small 92% Balanced VRAM/quality
Q4_K_M 4-bit K-quant Medium 94% Recommended default
Q5_K_S 5-bit K-quant Small 96% High quality, moderate VRAM
Q5_K_M 5-bit K-quant Medium 97% High quality production
Q6_K 6-bit K-quant 98% Near-lossless quality
Q8_0 8-bit Zero-point 99% Maximum quality

Format

  • File Format: GGUF (GPT-Generated Unified Format)
  • Precision: Mixed precision quantization (K-quant variants)
  • Compression: Lossless GGUF compression
  • Compatibility: llama.cpp ecosystem, diffusers integration

Performance Tips

Quantization Selection Guide

  • 8GB VRAM or less: Use Q3_K_M or Q4_K_S
  • 12GB VRAM: Use Q4_K_M or Q5_K_S (recommended)
  • 16GB VRAM: Use Q5_K_M or Q6_K
  • 24GB+ VRAM: Use Q8_0 for maximum quality

Optimization Strategies

  1. Memory Optimization:

    • Enable enable_attention_slicing() for low VRAM
    • Use enable_vae_slicing() for large videos
    • Reduce num_frames and resolution for faster generation
  2. Quality Optimization:

    • Q4_K_M offers best quality/performance balance
    • Q6_K or Q8_0 for production-quality outputs
    • Higher num_inference_steps improves coherence
  3. Speed Optimization:

    • Lower quantization levels (Q3_K) are slightly faster
    • Reduce inference steps for draft iterations
    • Use torch.compile() for additional speedup (PyTorch 2.0+)

Benchmark Performance (RTX 4090, 24GB VRAM)

Encoder Load Time VRAM Usage Quality Score Speed
Q3_K_S 8s 2.9 GB 8.2/10 Fast
Q4_K_M 10s 4.1 GB 9.1/10 Fast
Q5_K_M 12s 5.3 GB 9.5/10 Medium
Q8_0 15s 7.2 GB 9.8/10 Medium

License

This model repository uses a custom license. Please review the WAN model license terms before use.

License Type: other (WAN License) Commercial Use: Check WAN license terms Attribution: Required for derivative works

Citation

If you use these models in your research or projects, please cite:

@misc{wan22-qx-encoders-gguf,
  title={WAN 2.2 QX Text Encoders (GGUF)},
  author={WAN Team},
  year={2024},
  howpublished={\url{https://huggingface.co/wan22-qx-encoders-gguf}},
  note={Quantized UMT5-XXL text encoders for video generation}
}

Related Resources

Technical Support

For issues, questions, or feature requests:

Model Card Metadata

  • Developed by: WAN Team
  • Model type: Text Encoder (Quantized)
  • Language(s): Multilingual (100+ languages)
  • License: Custom WAN License
  • Finetuned from: UMT5-XXL
  • Model Format: GGUF (quantized)
  • Precision Variants: Q3_K_S, Q3_K_M, Q4_K_S, Q4_K_M, Q5_K_S, Q5_K_M, Q6_K, Q8_0
Downloads last month
240
GGUF
Model size
6B params
Architecture
t5encoder
Hardware compatibility
Log In to view the estimation

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including wangkanai/wan22-qx-encoders-gguf