Diffusers documentation
Kandinsky 5.0
Kandinsky 5.0
Kandinsky 5.0 is created by the Kandinsky team: Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim, Anastasiia Kargapoltseva, Nikita Kiselev, Anna Dmitrienko, Anastasia Maltseva, Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov, Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina, Tatiana Nikulina, Polina Gavrilova, Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Denis Dimitrov
Kandinsky 5.0 is a family of diffusion models for Video & Image generation. Kandinsky 5.0 T2V Lite is a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger models and offers the best understanding of Russian concepts in the open-source ecosystem.
The model introduces several key innovations:
- Latent diffusion pipeline with Flow Matching for improved training stability
- Diffusion Transformer (DiT) as the main generative backbone with cross-attention to text embeddings
- Dual text encoding using Qwen2.5-VL and CLIP for comprehensive text understanding
- HunyuanVideo 3D VAE for efficient video encoding and decoding
- Sparse attention mechanisms (NABLA) for efficient long-sequence processing
The original codebase can be found at ai-forever/Kandinsky-5.
Check out the AI Forever organization on the Hub for the official model checkpoints for text-to-video generation, including pretrained, SFT, no-CFG, and distilled variants.
Available Models
Kandinsky 5.0 T2V Lite comes in several variants optimized for different use cases:
| model_id | Description | Use Cases |
|---|---|---|
| ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers | 5 second Supervised Fine-Tuned model | Highest generation quality |
| ai-forever/Kandinsky-5.0-T2V-Lite-sft-10s-Diffusers | 10 second Supervised Fine-Tuned model | Highest generation quality |
| ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-5s-Diffusers | 5 second Classifier-Free Guidance distilled | 2× faster inference |
| ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-10s-Diffusers | 10 second Classifier-Free Guidance distilled | 2× faster inference |
| ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers | 5 second Diffusion distilled to 16 steps | 6× faster inference, minimal quality loss |
| ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-10s-Diffusers | 10 second Diffusion distilled to 16 steps | 6× faster inference, minimal quality loss |
| ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-5s-Diffusers | 5 second Base pretrained model | Research and fine-tuning |
| ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-10s-Diffusers | 10 second Base pretrained model | Research and fine-tuning |
All models are available in 5-second and 10-second video generation versions.
Kandinsky5T2VPipeline
class diffusers.Kandinsky5T2VPipeline
< source >( transformer: Kandinsky5Transformer3DModel vae: AutoencoderKLHunyuanVideo text_encoder: Qwen2_5_VLForConditionalGeneration tokenizer: Qwen2VLProcessor text_encoder_2: CLIPTextModel tokenizer_2: CLIPTokenizer scheduler: FlowMatchEulerDiscreteScheduler )
Parameters
- transformer (
Kandinsky5Transformer3DModel) — Conditional Transformer to denoise the encoded video latents. - vae (AutoencoderKLHunyuanVideo) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
- text_encoder (
Qwen2_5_VLForConditionalGeneration) — Frozen text-encoder (Qwen2.5-VL). - tokenizer (
AutoProcessor) — Tokenizer for Qwen2.5-VL. - text_encoder_2 (
CLIPTextModel) — Frozen CLIP text encoder. - tokenizer_2 (
CLIPTokenizer) — Tokenizer for CLIP. - scheduler (FlowMatchEulerDiscreteScheduler) —
A scheduler to be used in combination with
transformerto denoise the encoded video latents.
Pipeline for text-to-video generation using Kandinsky 5.0.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 512 width: int = 768 num_frames: int = 121 num_inference_steps: int = 50 guidance_scale: float = 5.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds_qwen: typing.Optional[torch.Tensor] = None prompt_embeds_clip: typing.Optional[torch.Tensor] = None negative_prompt_embeds_qwen: typing.Optional[torch.Tensor] = None negative_prompt_embeds_clip: typing.Optional[torch.Tensor] = None prompt_cu_seqlens: typing.Optional[torch.Tensor] = None negative_prompt_cu_seqlens: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 **kwargs ) → ~KandinskyPipelineOutput or tuple
Parameters
- prompt (
strorList[str], optional) — The prompt or prompts to guide the video generation. If not defined, passprompt_embedsinstead. - negative_prompt (
strorList[str], optional) — The prompt or prompts to avoid during video generation. If not defined, passnegative_prompt_embedsinstead. Ignored when not using guidance (guidance_scale<1). - height (
int, defaults to512) — The height in pixels of the generated video. - width (
int, defaults to768) — The width in pixels of the generated video. - num_frames (
int, defaults to25) — The number of frames in the generated video. - num_inference_steps (
int, defaults to50) — The number of denoising steps. - guidance_scale (
float, defaults to5.0) — Guidance scale as defined in classifier-free guidance. - num_videos_per_prompt (
int, optional, defaults to 1) — The number of videos to generate per prompt. - generator (
torch.GeneratororList[torch.Generator], optional) — A torch generator to make generation deterministic. - latents (
torch.Tensor, optional) — Pre-generated noisy latents. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. - negative_prompt_embeds (
torch.Tensor, optional) — Pre-generated negative text embeddings. - output_type (
str, optional, defaults to"pil") — The output format of the generated video. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return aKandinskyPipelineOutput. - callback_on_step_end (
Callable,PipelineCallback,MultiPipelineCallbacks, optional) — A function that is called at the end of each denoising step. - callback_on_step_end_tensor_inputs (
List, optional) — The list of tensor inputs for thecallback_on_step_endfunction. - max_sequence_length (
int, defaults to512) — The maximum sequence length for text encoding.
Returns
~KandinskyPipelineOutput or tuple
If return_dict is True, KandinskyPipelineOutput is returned, otherwise a tuple is returned
where the first element is a list with the generated images.
The call function to the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import Kandinsky5T2VPipeline
>>> from diffusers.utils import export_to_video
>>> # Available models:
>>> # ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers
>>> # ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-5s-Diffusers
>>> # ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers
>>> # ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-5s-Diffusers
>>> model_id = "ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers"
>>> pipe = Kandinsky5T2VPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe = pipe.to("cuda")
>>> prompt = "A cat and a dog baking a cake together in a kitchen."
>>> negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"
>>> output = pipe(
... prompt=prompt,
... negative_prompt=negative_prompt,
... height=512,
... width=768,
... num_frames=121,
... num_inference_steps=50,
... guidance_scale=5.0,
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=24, quality=9)check_inputs
< source >( prompt negative_prompt height width prompt_embeds_qwen = None prompt_embeds_clip = None negative_prompt_embeds_qwen = None negative_prompt_embeds_clip = None prompt_cu_seqlens = None negative_prompt_cu_seqlens = None callback_on_step_end_tensor_inputs = None )
Parameters
- prompt — Input prompt
- negative_prompt — Negative prompt for guidance
- height — Video height
- width — Video width
- prompt_embeds_qwen — Pre-computed Qwen prompt embeddings
- prompt_embeds_clip — Pre-computed CLIP prompt embeddings
- negative_prompt_embeds_qwen — Pre-computed Qwen negative prompt embeddings
- negative_prompt_embeds_clip — Pre-computed CLIP negative prompt embeddings
- prompt_cu_seqlens — Pre-computed cumulative sequence lengths for Qwen positive prompt
- negative_prompt_cu_seqlens — Pre-computed cumulative sequence lengths for Qwen negative prompt
- callback_on_step_end_tensor_inputs — Callback tensor inputs
Raises
ValueError
ValueError— If inputs are invalid
Validate input parameters for the pipeline.
encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] num_videos_per_prompt: int = 1 max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None ) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
Parameters
- prompt (
strorList[str]) — Prompt to be encoded. - num_videos_per_prompt (
int, optional, defaults to 1) — Number of videos to generate per prompt. - max_sequence_length (
int, optional, defaults to 512) — Maximum sequence length for text encoding. - device (
torch.device, optional) — Torch device. - dtype (
torch.dtype, optional) — Torch dtype.
Returns
Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
- Qwen text embeddings of shape (batch_size * num_videos_per_prompt, sequence_length, embedding_dim)
- CLIP pooled embeddings of shape (batch_size * num_videos_per_prompt, clip_embedding_dim)
- Cumulative sequence lengths (
cu_seqlens) for Qwen embeddings of shape (batch_size * num_videos_per_prompt + 1,)
Encodes a single prompt (positive or negative) into text encoder hidden states.
This method combines embeddings from both Qwen2.5-VL and CLIP text encoders to create comprehensive text representations for video generation.
fast_sta_nabla
< source >( T: int H: int W: int wT: int = 3 wH: int = 3 wW: int = 3 device = 'cuda' ) → torch.Tensor
Parameters
- T (int) — Number of temporal frames
- H (int) — Height in latent space
- W (int) — Width in latent space
- wT (int) — Temporal attention window size
- wH (int) — Height attention window size
- wW (int) — Width attention window size
- device (str) — Device to create tensor on
Returns
torch.Tensor
Sparse attention mask of shape (THW, THW)
Create a sparse temporal attention (STA) mask for efficient video generation.
This method generates a mask that limits attention to nearby frames and spatial positions, reducing computational complexity for video generation.
get_sparse_params
< source >( sample device ) → Dict
Generate sparse attention parameters for the transformer based on sample dimensions.
This method computes the sparse attention configuration needed for efficient video processing in the transformer model.
prepare_latents
< source >( batch_size: int num_channels_latents: int = 16 height: int = 480 width: int = 832 num_frames: int = 81 dtype: typing.Optional[torch.dtype] = None device: typing.Optional[torch.device] = None generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None ) → torch.Tensor
Parameters
- batch_size (int) — Number of videos to generate
- num_channels_latents (int) — Number of channels in latent space
- height (int) — Height of generated video
- width (int) — Width of generated video
- num_frames (int) — Number of frames in video
- dtype (torch.dtype) — Data type for latents
- device (torch.device) — Device to create latents on
- generator (torch.Generator) — Random number generator
- latents (torch.Tensor) — Pre-existing latents to use
Returns
torch.Tensor
Prepared latent tensor
Prepare initial latent variables for video generation.
This method creates random noise latents or uses provided latents as starting point for the denoising process.
Usage Examples
Basic Text-to-Video Generation
import torch
from diffusers import Kandinsky5T2VPipeline
from diffusers.utils import export_to_video
# Load the pipeline
model_id = "ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers"
pipe = Kandinsky5T2VPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
# Generate video
prompt = "A cat and a dog baking a cake together in a kitchen."
negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"
output = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=512,
width=768,
num_frames=121, # ~5 seconds at 24fps
num_inference_steps=50,
guidance_scale=5.0,
).frames[0]
export_to_video(output, "output.mp4", fps=24, quality=9)10 second Models
⚠️ Warning! all 10 second models should be used with Flex attention and max-autotune-no-cudagraphs compilation:
pipe = Kandinsky5T2VPipeline.from_pretrained(
"ai-forever/Kandinsky-5.0-T2V-Lite-sft-10s-Diffusers",
torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")
pipe.transformer.set_attention_backend(
"flex"
) # <--- Set attention backend to Flex
pipe.transformer.compile(
mode="max-autotune-no-cudagraphs",
dynamic=True
) # <--- Compile with max-autotune-no-cudagraphs
prompt = "A cat and a dog baking a cake together in a kitchen."
negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"
output = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=512,
width=768,
num_frames=241,
num_inference_steps=50,
guidance_scale=5.0,
).frames[0]
export_to_video(output, "output.mp4", fps=24, quality=9)Diffusion Distilled model
⚠️ Warning! all nocfg and diffusion distilled models should be inferred without CFG (guidance_scale=1.0):
model_id = "ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers"
pipe = Kandinsky5T2VPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
output = pipe(
prompt="A beautiful sunset over mountains",
num_inference_steps=16, # <--- Model is distilled in 16 steps
guidance_scale=1.0, # <--- no CFG
).frames[0]
export_to_video(output, "output.mp4", fps=24, quality=9)Citation
@misc{kandinsky2025,
author = {Alexey Letunovskiy and Maria Kovaleva and Ivan Kirillov and Lev Novitskiy and Denis Koposov and
Dmitrii Mikhailov and Anna Averchenkova and Andrey Shutkin and Julia Agafonova and Olga Kim and
Anastasiia Kargapoltseva and Nikita Kiselev and Vladimir Arkhipkin and Vladimir Korviakov and
Nikolai Gerasimenko and Denis Parkhomenko and Anna Dmitrienko and Anastasia Maltseva and
Kirill Chernyshev and Ilia Vasiliev and Viacheslav Vasilev and Vladimir Polovnikov and
Yury Kolabushin and Alexander Belykh and Mikhail Mamaev and Anastasia Aliaskina and
Tatiana Nikulina and Polina Gavrilova and Denis Dimitrov},
title = {Kandinsky 5.0: A family of diffusion models for Video & Image generation},
howpublished = {\url{https://github.com/ai-forever/Kandinsky-5}},
year = 2025
}