A before-and-after comparison of Multi-Token Prediction β real-world results on Qwen3.6
Have you ever waited for an AI to respond to a question? Sometimes it takes several seconds, or even dozens of seconds. The reason is simple: the AI writes every word one at a time.
MTP Head (Multi-Token Prediction) is a new technique that allows the AI to predict several next words in advance, then write them all at once β boosting throughput by 2.5 to 3 times while maintaining quality.
It's like reading a book word by word:
"Hello" β stop β "world" β stop β "" β stop β "lαΊ‘i" β stop ...
Every word requires the AI to "think" individually. If the response is 100 words β the AI has to process 100 times.
It's like reading with understanding β your brain not only predicts the next word, but also the next sentence:
"TrΓͺn ΔΖ°α»ng khα»..." β jump right β "khα» Δau sαΊ½ ΔαΊΏn" β jump right β "nhΖ°ng hαΊ‘nh phΓΊc sαΊ½ ΔαΊΏn sau", lαΊ‘i
During training, the model is taught to:
Each position in the neural network gains additional auxiliary heads (to predict future tokens. These heads learn to predict based on the main model's hidden state.
When generating text, MTP heads act like a built-in draft model:
| Indicator | Before MTP | After MTP | Change |
|---|---|---|---|
| Decode rate | 1x (normal) | 2.5 - 3x | β¬οΈ Faster by up to 3x |
| Quality | 100% | 100% | β No degradation |
| Acceptance rate | N/A | ~ 98% | β¬οΈ Very small drop, fairly acceptable |
| GPU cost | High | Greater saving | π° Cut the number of forward passes on the GPU by 2/3 |
| Setup | Simple | No draft model needed | π§ Simpler |
--spec-type draft-mtp --spec-draft-n-max 3
| Task | Without MTP | With MTP | Speed up |
|---|---|---|---|
| π Narrative | ~ 50 tok/s | ~ 134 tok/s | ~ 2.7x |
| π» Code | ~ 60 tok/s | ~ 177 tok/s | ~ 3.0x |
With every draft of 3 tokens, the main model verifies. Results:
AI doesn't always get it right. It's like speed reading a book and "jumping the words":
But average speed is still 3x faster, and most importantly β the output quality is not compromised. The main model still verifies every token before accepting it.
MTP Head is like the AI learning to "read faster" β it doesn't just predict the next word, but also an entire passage. Results:
And most importantly: it works immediately on models that have been trained with MTP like Qwen3.6, DeepSeek-V3, Nemotron 3 Super.
References: Sebastian Raschka's LLM Architecture Gallery, vLLM Documentation, Qwen3.6 Technical Report, DeepSeek-V3 Technical Report, NVIDIA Megatron-Bridge, FastMTP (Tencent, 2025)