Discrete Diffusion Divergence Instruct (DiDi-Instruct)

DiDi-Instruct is a training-based framework for ultra-fast language generation. It initializes from a pre-trained discrete diffusion language model (dLLM) and distills it into a few-step student model for rapid generation.

The DiDi-Instruct model achieves performance that is comparable or superior to its dLLM teacher and the GPT-2 baseline, while enabling up to 64x acceleration. The method is founded on a novel framework of integral KL-divergence minimization. It also introduces grouped reward normalization, intermediate-state matching, and a reward-guided ancestral sampler (RGAS), which significantly improve training stability, model coverage, and inference quality.

Model Details

Model Architecture: This model is a 169M parameter Masked Diffusion Model with a Diffusion Transformer (DiT) backbone. The specific configuration is a 768 hidden dimension, 12 transformer layers, and 12 attention heads.
Training Data: The model was trained on the OpenWebText dataset. The corpus was tokenized using the GPT-2 tokenizer with a context length of 1024.
Training Procedure: The student model was distilled from a 169M parameter MDLM teacher using the DiDi-Instruct framework for 10,000 iterations. The distillation process requires only about one hour on a single NVIDIA H100 GPU.

How to Use

To use the full guided-inference capabilities, you need to load both the generator (the main model) and the discriminator.

import torch
import os
from models.dit import DIT
from models.dit import DiscriminatorHead

model_id = "haoyangzheng/didi-instruct-model" 
device = "cuda" if torch.cuda.is_available() else "cpu"

print("Loading generator...")
generator = DIT.from_pretrained(model_id).to(device)
generator.eval()

print("Loading discriminator...")
config = generator.config
vocab_size = generator.vocab_size

discriminator = DIT(config, vocab_size=vocab_size)
discriminator.output_layer = DiscriminatorHead(config.model.hidden_size)

from huggingface_hub import hf_hub_download
discriminator_path = hf_hub_download(repo_id=model_id, filename="discriminator.safetensors")
discriminator.load_state_dict(torch.load(discriminator_path, map_location=device))
discriminator = discriminator.to(device)
discriminator.eval()

print("\nGenerator and Discriminator loaded successfully!")

Performance

DiDi-Instruct excels at few-step generation, consistently outperforming baselines across various numbers of function evaluations (NFEs).

On the OpenWebText benchmark, DiDi-Instruct achieves a perplexity ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs), outperforming prior accelerated dLLMs and the GPT-2 baseline.
With only 16 NFEs, the model's perplexity already surpasses that of the 1024-step teacher model.
These performance gains are achieved with negligible entropy loss (around 1%), indicating that sample diversity is well-preserved.

Citation

If you use DiDi-Instruct in your work, please cite our paper:

@article{zheng2025ultra,
  title={{Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct}},
  author={Zheng, Haoyang and Liu, Xinyang and Kong, Cindy Xiangrui and Jiang, Nan and Hu, Zheyuan and Luo, Weijian and Deng, Wei and Lin, Guang},
  journal={arXiv preprint arXiv:2509.25035},
  year={2025}
}

Model Card Contact

Haoyang Zheng ([email protected])

Downloads last month: 16

haoyangzheng
/

didi-instruct-small