AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders
Abstract
AdaSPEC enhances speculative decoding by selectively filtering tokens during knowledge distillation, improving token acceptance rates without sacrificing generation quality.
Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model's knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15\%). The code is publicly available at https://github.com/yuezhouhu/adaspec.
Community
AdaSPEC introduces a two-stage selective knowledge distillation framework to train draft models that better align with the target model in Speculative Decoding.
Reference Model as a Difficulty Analyzer:
A reference model (initialized identically to the draft model) is first distilled from the target model using standard knowledge distillation (e.g., forward KL divergence). This reference model serves not as the final draft, but as a proxy to estimate token-wise learning difficulty.Selective Token Filtering:
During distillation of the actual draft model, AdaSPEC computes the KL divergence loss for each token from both the draft and reference models against the target. It then calculates the loss gap ΔL = L_draft − L_ref. Tokens with a larger ΔL are considered easier to learn, because higher ΔL indicates larger potential to optimize on those tokens. AdaSPEC selects the top-k% of these "easy" tokens and trains the draft model only on this filtered subset.
By focusing the draft model’s limited capacity on tokens it can reliably learn, AdaSPEC achieves higher alignment with the target model, leading to consistently improved acceptance rates across diverse tasks—without sacrificing generation quality.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- 3-Model Speculative Decoding (2025)
- FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction (2025)
- TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs (2025)
- Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding (2025)
- AdaSwitch: Adaptive Switching Generation for Knowledge Distillation (2025)
- Speculate Deep and Accurate: Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding (2025)
- Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper