Discrete Diffusion Divergence Instruct (DiDi-Instruct)
DiDi-Instruct is a training-based framework for ultra-fast language generation. It initializes from a pre-trained discrete diffusion language model (dLLM) and distills it into a few-step student model for rapid generation.
The DiDi-Instruct model achieves performance that is comparable or superior to its dLLM teacher and the GPT-2 baseline, while enabling up to 64x acceleration. The method is founded on a novel framework of integral KL-divergence minimization. It also introduces grouped reward normalization, intermediate-state matching, and a reward-guided ancestral sampler (RGAS), which significantly improve training stability, model coverage, and inference quality.
Model Details
- Model Architecture: This model is a 169M parameter Masked Diffusion Model with a Diffusion Transformer (DiT) backbone. The specific configuration is a 768 hidden dimension, 12 transformer layers, and 12 attention heads.
- Training Data: The model was trained on the OpenWebText dataset. The corpus was tokenized using the GPT-2 tokenizer with a context length of 1024.
- Training Procedure: The student model was distilled from a 169M parameter MDLM teacher using the DiDi-Instruct framework for 10,000 iterations. The distillation process requires only about one hour on a single NVIDIA H100 GPU.
How to Use
To use the full guided-inference capabilities, you need to load both the generator (the main model) and the discriminator.
import torch
import os
from models.dit import DIT
from models.dit import DiscriminatorHead
model_id = "haoyangzheng/didi-instruct-model"
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Loading generator...")
generator = DIT.from_pretrained(model_id).to(device)
generator.eval()
print("Loading discriminator...")
config = generator.config
vocab_size = generator.vocab_size
discriminator = DIT(config, vocab_size=vocab_size)
discriminator.output_layer = DiscriminatorHead(config.model.hidden_size)
from huggingface_hub import hf_hub_download
discriminator_path = hf_hub_download(repo_id=model_id, filename="discriminator.safetensors")
discriminator.load_state_dict(torch.load(discriminator_path, map_location=device))
discriminator = discriminator.to(device)
discriminator.eval()
print("\nGenerator and Discriminator loaded successfully!")
Performance
DiDi-Instruct excels at few-step generation, consistently outperforming baselines across various numbers of function evaluations (NFEs).
- On the OpenWebText benchmark, DiDi-Instruct achieves a perplexity ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs), outperforming prior accelerated dLLMs and the GPT-2 baseline.
- With only 16 NFEs, the model's perplexity already surpasses that of the 1024-step teacher model.
- These performance gains are achieved with negligible entropy loss (around 1%), indicating that sample diversity is well-preserved.
Citation
If you use DiDi-Instruct in your work, please cite our paper:
@article{zheng2025ultra,
title={{Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct}},
author={Zheng, Haoyang and Liu, Xinyang and Kong, Cindy Xiangrui and Jiang, Nan and Hu, Zheyuan and Luo, Weijian and Deng, Wei and Lin, Guang},
journal={arXiv preprint arXiv:2509.25035},
year={2025}
}
Model Card Contact
Haoyang Zheng ([email protected])
- Downloads last month
- 16