arxiv:2510.19779

AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

Published on Oct 22

· Submitted by

Yuezhou Hu on Oct 24

#2 Paper of the day

Georgia Institute of Technology

Upvote

Authors:

Yuezhou Hu ,

Jiaxin Guo ,

Abstract

AdaSPEC enhances speculative decoding by selectively filtering tokens during knowledge distillation, improving token acceptance rates without sacrificing generation quality.

AI-generated summary

Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model's knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15\%). The code is publicly available at https://github.com/yuezhouhu/adaspec.

View arXiv page View PDF GitHub 16 Add to collection

Community

yuezhouhu

Paper author Paper submitter 3 days ago

AdaSPEC introduces a two-stage selective knowledge distillation framework to train draft models that better align with the target model in Speculative Decoding.

Reference Model as a Difficulty Analyzer:
A reference model (initialized identically to the draft model) is first distilled from the target model using standard knowledge distillation (e.g., forward KL divergence). This reference model serves not as the final draft, but as a proxy to estimate token-wise learning difficulty.
Selective Token Filtering:
During distillation of the actual draft model, AdaSPEC computes the KL divergence loss for each token from both the draft and reference models against the target. It then calculates the loss gap ΔL = L_draft − L_ref. Tokens with a larger ΔL are considered easier to learn, because higher ΔL indicates larger potential to optimize on those tokens. AdaSPEC selects the top-k% of these "easy" tokens and trains the draft model only on this filtered subset.

By focusing the draft model’s limited capacity on tokens it can reliably learn, AdaSPEC achieves higher alignment with the target model, leading to consistently improved acceptance rates across diverse tasks—without sacrificing generation quality.