← Back to Blog

πŸš€ MTP Head β€” When AI Learned to "Skip Ahead" to Respond 3x Faster

A before-and-after comparison of Multi-Token Prediction β€” real-world results on Qwen3.6


Have you ever waited for an AI to respond to a question? Sometimes it takes several seconds, or even dozens of seconds. The reason is simple: the AI writes every word one at a time.

MTP Head (Multi-Token Prediction) is a new technique that allows the AI to predict several next words in advance, then write them all at once β€” boosting throughput by 2.5 to 3 times while maintaining quality.

πŸ“– What is MTP? A Simple Explanation

Without MTP (the old way)

It's like reading a book word by word:

"Hello" β†’ stop β†’ "world" β†’ stop β†’ "" β†’ stop β†’ "lαΊ‘i" β†’ stop ...

Every word requires the AI to "think" individually. If the response is 100 words β†’ the AI has to process 100 times.

With MTP (the new way)

It's like reading with understanding β€” your brain not only predicts the next word, but also the next sentence:

"TrΓͺn đường khα»•..." β†’ jump right β†’ "khα»• Δ‘au sαΊ½ Δ‘αΊΏn" β†’ jump right β†’ "nhΖ°ng hαΊ‘nh phΓΊc sαΊ½ Δ‘αΊΏn sau", lαΊ‘i

πŸ’‘ MTP (Multi-Token Prediction): Instead of only predicting the next word, the AI learns to predict 3-4 future words at once based on context. It's like a speed reader β€” they don't read word by word, they read entire sentences.

πŸ”¬ How Does MTP Work?

Training Mechanism

During training, the model is taught to:

  • Old approach: At position t β†’ predict t+1
  • With MTP: At position t β†’ predict t+1, t+2, t+3, t+4

Each position in the neural network gains additional auxiliary heads (to predict future tokens. These heads learn to predict based on the main model's hidden state.

Inference Mechanism (When Responding)

When generating text, MTP heads act like a built-in draft model:

  1. The main model thinks once β†’ returns its hidden state
  2. MTP heads use this hidden state to draft 3-4 tokens at once
  3. The main model verifies the drafted tokens: keep them if correct, fix if not
  4. If correct β†’ saves time (free tokens, since multiple tokens are written in one forward pass)
  5. If wrong β†’ they have to be written from scratch, but even then the average speed is better.

πŸ“Š Before-and-After MTP Comparison

Indicator Before MTP After MTP Change
Decode rate 1x (normal) 2.5 - 3x ⬆️ Faster by up to 3x
Quality 100% 100% βœ… No degradation
Acceptance rate N/A ~ 98% ⬇️ Very small drop, fairly acceptable
GPU cost High Greater saving πŸ’° Cut the number of forward passes on the GPU by 2/3
Setup Simple No draft model needed πŸ”§ Simpler

🎯 Real-World Results on Qwen3.6

πŸ’» Model: Qwen3.6 35B-A3B-MTP-GGUF (UD-Q3_K_XL)
πŸ“ˆ GPU: RTX 3090 (24GB VRAM)
βš™οΈ Setup: llama.cpp with --spec-type draft-mtp --spec-draft-n-max 3

Decode Speed

Task Without MTP With MTP Speed up
πŸ“ Narrative ~ 50 tok/s ~ 134 tok/s ~ 2.7x
πŸ’» Code ~ 60 tok/s ~ 177 tok/s ~ 3.0x

Acceptance Rate (token accuracy)

With every draft of 3 tokens, the main model verifies. Results:

  • Average acceptance rate: ~98%
  • Sometimes MTP is wrong β†’ needs to rewrite (about 2% of time)
  • But spared in the OT% of time, it saves 2/3 of forward passes

❓ Why Isn't Acceptance Rate 100%?

AI doesn't always get it right. It's like speed reading a book and "jumping the words":

  • Correct: You save time β€” no need to read that passage again
  • Wrong: You have to go back and read from the beginning of that passage β€” took a bit of time

But average speed is still 3x faster, and most importantly β€” the output quality is not compromised. The main model still verifies every token before accepting it.

πŸ” How Does MTP Differ From Old Ways?

Before MTP (Traditional Speculative Decoding)

  • Requires two separate models: main model + draft model
  • Increases complexity and GPU cost (feed 2 models)
  • Local rate improvement (~1.5-2x)

After MTP (Built-in Speculative Decoding)

  • Built-in to the model, nothing else needed
  • Speed ups by 2.5x
  • GPU cost does not increase β€” only 1 model is run.
  • Becomes a built-in draft model
  • Predicted tokens have increased accuracy because they were trained directly on that model,

πŸ“ Summary

MTP Head is like the AI learning to "read faster" β€” it doesn't just predict the next word, but also an entire passage. Results:

  • βœ… 3x faster β€” from 50 tok/s to ~134 tok/s
  • βœ… Quality unchanged β€” main model still verifies everything
  • βœ… No extra hardware needed β€” only need a model with MTP head
  • βœ… Cost savings β€” reduce 2/3 of forward passes on the GPU
  • βœ… Easy integration β€” only add 2 parameters in llama.cpp

And most importantly: it works immediately on models that have been trained with MTP like Qwen3.6, DeepSeek-V3, Nemotron 3 Super.


References: Sebastian Raschka's LLM Architecture Gallery, vLLM Documentation, Qwen3.6 Technical Report, DeepSeek-V3 Technical Report, NVIDIA Megatron-Bridge, FastMTP (Tencent, 2025)