Gemma 4 MTP on GB10 — Inference 1.83x faster on 26B, 3.52x on 31B Dense
3.52x for code generation and ~2.5x for natural language on 31B Dense — independent measurement on a single GPU (above Google's '3x' headline)
Headline Result
I measured Google’s Gemma 4 MTP (Multi-Token Prediction) drafter on NVIDIA GB10 (GX10, 121.6GB unified LPDDR5X, 273 GB/s). MTP delivers a real speedup across all three workloads tested. On 31B Dense, Python code generation reaches 3.52x — at or above Google’s headline “up to 3x faster” claim, on a single GPU.
| Category | 26B-A4B + MTP+patch | 31B Dense + MTP+patch |
|---|---|---|
| Japanese financial analysis | 1.20x (16.47s → 13.74s) | 2.54x (91.95s → 36.23s) |
| English macro analysis | 1.36x (15.15s → 11.14s) | 2.52x (91.41s → 36.22s) |
| Python code generation | 1.83x (15.17s → 8.30s) | 3.52x (91.41s → 26.00s) |
Plotting 26B (light blue) and 31B (green) side by side makes the gap obvious — Python on 31B in particular pulls far ahead.
In the search range, no independent benchmark of 31B Dense + 1 GPU + MTP could be found in public blogs or GitHub issues, so this measurement appears to be the first of its kind. Note that the freshly published vLLM image does not work as-is — its bundled gemma4_mtp.py returns 0% acceptance until the head commit of PR #41745 is bind-mounted. Section 5 covers this.
1. Test Environment
| Item | Value |
|---|---|
| Hardware | NVIDIA GB10 (GX10 SoC, ARM aarch64, SM 12.1) |
| Memory | LPDDR5X 128GB unified (CPU/GPU shared, cache coherent) |
| Memory bandwidth | 273 GB/s |
| Target model | RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (26B-A4B MoE) / RedHatAI/gemma-4-31B-it-FP8-Dynamic (31B Dense) |
| Drafter | google/gemma-4-26B-A4B-it-assistant (832MB) / google/gemma-4-31B-it-assistant (832MB) |
| vLLM image | vllm/vllm-openai:gemma4-0505-arm64-cu130 (pushed the same day Google released MTP) |
| Patch | PR #41745 head commit d8b3826, applied via bind mount |
| Settings | num_speculative_tokens=4 (= K), kv-cache-dtype fp8, max-model-len 4096, temperature=0.7 |
References to “production” mean my daily news-summarization vLLM cron (02:00 JST, 31B-FP8-block).
2. Detailed Results
2.1 Speedup by language (3 prompts × 3 runs)
Three prompts measured at num_speculative_tokens=4. Full prompts are in section 4.5 (reproduction material).
26B-A4B + MTP+patch
| Category | baseline (Dynamic, no MTP) | + MTP+patch (K=4) | Speedup |
|---|---|---|---|
ja_finance | 16.47s, 60.2 ch/s, 992 chars | 13.74s, 73.6 ch/s, 1011 chars | 1.20x |
en_finance | 15.15s, 170.1 ch/s, 2576 chars | 11.14s, 232.7 ch/s, 2592 chars | 1.36x |
py_code | 15.17s, 148.9 ch/s, 2260 chars | 8.30s, 279.8 ch/s, 2321 chars | 1.83x |
31B Dense + MTP+patch
| Category | baseline (Dynamic, no MTP) | + MTP+patch (K=4) | Speedup |
|---|---|---|---|
ja_finance | 91.95s, 10.5 ch/s, 967 chars | 36.23s, 26.3 ch/s, 952 chars | 2.54x |
en_finance | 91.41s, 27.9 ch/s, 2551 chars | 36.22s, 71.0 ch/s, 2572 chars | 2.52x |
py_code | 91.41s, 24.9 ch/s, 2277 chars | 26.00s, 86.7 ch/s, 2254 chars | 3.52x |
The “code > English > Japanese” ordering reflects how predictable each output is for the drafter:
- Python: deterministic continuations (
def,import,pd.DataFrame(, etc.) → high acceptance - English: natural language but predictable word order and structure → middle
- Japanese: free word order and more characters per token (CJK ≈ 1.5-2 chars per token) → drafter struggles, fewer characters appear per accepted token
2.2 Per-position acceptance — 31B drafter beats 26B at every position
Comparing per-position acceptance at K=4 shows the 31B drafter ahead at every position.
The gap is +6.5pt at pos 0, +5.3pt at pos 1, +4.4pt at pos 2, +3.7pt at pos 3 — narrowing at deeper positions but averaging about +5pt. The 31B-specific drafter shares the same 32-layer / hidden-dim structure as the 31B target, so the predict-then-verify hidden-state alignment is naturally tighter.
2.3 26B vs 31B head-to-head (Python yfinance task)
Lining up four configurations on absolute speed (chars/s) makes the magnitude of 26B+MTP’s lead obvious.
| Configuration | sec | chars/s | vs 26B baseline | Use case |
|---|---|---|---|---|
| 26B baseline | 15.17s | 148.9 | 1.00× | (reference) |
| 26B + MTP+patch | 8.30s | 279.8 | 1.83× ← fastest | Speed first, MoE |
| 31B baseline | 91.41s | 24.9 | 0.17× | (current production, quality first) |
| 31B + MTP+patch | 26.00s | 86.7 | 0.58× | 31B quality + MTP, practical option |
Absolute speed crown goes to 26B + MTP+patch; the largest speedup ratio (and thus the strongest case for MTP) goes to 31B + MTP+patch. The right choice depends on the workload — section 6 covers this.
2.4 Why 31B’s speedup ratio is larger
The gap from 26B 1.83x to 31B 3.52x is roughly 2x. Three contributing factors:
1. Memory-bandwidth bound is more dominant
Comparing baseline absolute speed shows that a larger target is more deeply memory-bandwidth bound:
- 26B baseline: 15.17s for 600 tokens = 39.5 tok/s
- 31B baseline: 91.41s for 600 tokens = 6.56 tok/s (= 1/6 of 26B)
31B Dense is 5-7x heavier on weight loads than 26B-A4B (which only activates 4B with MoE). GB10’s 273 GB/s is more saturated, and that is precisely the regime where MTP’s amortization (sharing one memory read across K tokens) helps most.
2. Drafter accept rate is higher
26B mean 1.98 → 31B mean 2.18 — effective K (the number of tokens that actually advance) goes up by 10%.
3. Drafter overhead is relatively smaller against the larger target
The drafter is the same gemma-4-*-it-assistant (832MB) in both cases. The bigger the target, the smaller the relative cost of the drafter forward.
Theoretical max speedup = (mean_accepted + 1) / (1 + drafter_overhead):
| Target | mean_accepted + 1 | 1 + drafter_overhead | Theoretical | Measured (Python) | Measured/theoretical |
|---|---|---|---|---|---|
| 26B | ~2.98 | ~1.7 | 1.75× | 1.83× | 105% |
| 31B | ~3.18 | ~0.9 | 3.5× | 3.52× | 100% |
The gap between theory and measurement closes as target grows.
2.5 Python execution = fast and correct
Speed numbers alone do not guarantee that MTP produced correct code. To check, I ran the generated Python code in a subprocess against yfinance + pandas + matplotlib.
Trace data (max_tokens=1500, 26B, full version)
| Variant | Total | Chars | Function name |
|---|---|---|---|
| baseline (run2) | 26.81s | 2816 | fetch_nikkei_monthly_returns() |
| MTP+patch (run3) | 15.58s | 2931 | get_nikkei_monthly_returns() |
Both models produced code that:
- Fetched ten years of monthly close for
^N225via the yfinance API - Computed monthly returns with
pct_change() - Returned a
[date, close, monthly_return]DataFrame - Successfully retrieved 119 months of data
- Saved a matplotlib bar chart of monthly returns (red = down, green = up)
End-to-end (generate + execute):
baseline: 26.81 + 2.35 = 29.16s
MTP+patch: 15.58 + 2.28 = 17.86s
→ end-to-end speedup = 1.63x (saving 11.30s)
Fast (1.72x) and correct (both produced working code) at the same time. MTP is lossless on quality, not just speed. The video at the top of this article shows this end-to-end: generation race → execution → both models producing equivalent Nikkei 225 monthly-return charts.
2.6 Multi-language videos (generation only)
Japanese (BoJ YCC → bank stocks)
baseline 16.47s, 60.2 ch/s → MTP+patch 13.74s, 73.6 ch/s = 1.20x speedup. Japanese is harder for the drafter to land at deeper positions, so the speedup is modest but consistent.
English (Fed QT → US 10Y yield)
baseline 15.15s, 170.1 ch/s → MTP+patch 11.14s, 232.7 ch/s = 1.36x speedup. English yields higher acceptance than Japanese and pulls more speedup.
3. K-sweep (num_speculative_tokens)
I ran four container restarts at K=1, 2, 4, 8, each over 3 prompts × 3 runs (max_tokens=600) on 26B-A4B.
3.1 Speedup vs K (3-run average, 26B-A4B)
baseline (Dynamic, no MTP): ja 16.47s, en 15.15s, py 15.17s (= 1.0x reference)
| K | ja_finance | en_finance | py_code |
|---|---|---|---|
| 1 | 11.89s, 1.39x | 11.57s, 1.31x | 10.74s, 1.41x |
| 2 | 12.19s, 1.35x | 11.33s, 1.34x | 9.88s, 1.54x |
| 4 | 11.90s, 1.38x | 11.38s, 1.33x | 8.66s, 1.75x |
| 8 | 15.77s, 1.04x | 14.12s, 1.07x | 9.54s, 1.59x |
3.2 Practical recommendation
- Code generation / structured output (yaml/json/SQL): K=4 is best
- Natural language (Japanese summaries / English blog text): K=1 is enough (K=2-4 barely changes results)
- Do not use K=8 — pos 7 acceptance drops to 8.9% (essentially noise), and eight drafter forwards plus reject-time KV cache rollback costs eat the accept gain
vLLM defaults to K=4, but for natural-language workloads K=1 saves both memory and compute. Detailed per-position acceptance and theoretical comparison are in Appendix A.
4. Reproduction
4.1 26B-A4B + MTP+patch launch command
docker run -d --name vllm-mtp \
--gpus all --ipc host --shm-size 64gb \
-p 8000:8000 \
-v $HOME/ai_news2fund/hf_cache:/root/.cache/huggingface \
-v $HOME/mtp_assets/gemma4_mtp.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4_mtp.py:ro \
vllm/vllm-openai:gemma4-0505-arm64-cu130 \
RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic \
--served-model-name gemma4-26b-mtp \
--max-model-len 4096 \
--max-num-batched-tokens 4096 \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8 \
--speculative-config '{"method":"mtp","model":"google/gemma-4-26B-A4B-it-assistant","num_speculative_tokens":4}'
4.2 31B Dense + MTP+patch launch command
docker run -d --name vllm-mtp \
--gpus all --ipc host --shm-size 64gb \
-p 8000:8000 \
-v $HOME/ai_news2fund/hf_cache:/root/.cache/huggingface \
-v $HOME/mtp_assets/gemma4_mtp.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4_mtp.py:ro \
vllm/vllm-openai:gemma4-0505-arm64-cu130 \
RedHatAI/gemma-4-31B-it-FP8-Dynamic \
--served-model-name gemma4-31b-mtp \
--max-model-len 4096 \
--max-num-batched-tokens 4096 \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8 \
--speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}'
Neither --enforce-eager nor --enable-chunked-prefill is needed. The patch (bind mount) alone is enough.
4.3 Patch file
Download gemma4_mtp.py directly from PR #41745 head commit d8b3826 and save to $HOME/mtp_assets/gemma4_mtp.py.
mkdir -p $HOME/mtp_assets
curl -fsSL -o $HOME/mtp_assets/gemma4_mtp.py \
https://raw.githubusercontent.com/vllm-project/vllm/d8b3826648da6b407f8c55457a2103be9aeb5d83/vllm/model_executor/models/gemma4_mtp.py
4.4 Memory footprint
- 26B-A4B: target 26GB FP8 + drafter 832MB BF16 + KV cache (max-model-len 4096 ≈ 4GB) ≈ 31GB GPU RAM
- 31B Dense: target 31GB FP8 + drafter 832MB BF16 + KV cache ≈ 36GB GPU RAM
GB10 with 121.6GB unified memory has plenty of headroom. DGX Spark / B200 / H100 80GB / RTX 6000 48GB should all run this fine.
4.5 Full prompts used in this validation
For reproducibility, the exact prompts used are reproduced verbatim below (defined in tools/mtp_traces/prompts.json). temperature=0.7, max_tokens=600 (Python additionally re-run at 1500 in a later subsection).
ja_finance — Japanese-language financial analysis task
日銀のYCC撤廃が日本の銀行株 (三菱UFJ / 三井住友 / みずほ) に与える影響を、NIM・含み損・配当方針の3点で詳細に解説してください。各論点について、足元の数値や具体的な政策のメカニズムを盛り込み、投資家が実務で使える形で整理してください。
(English gloss: explain how the BoJ’s YCC removal affects the major Japanese bank stocks (MUFG / SMFG / Mizuho), covering NIM, unrealized losses, and dividend policy with current data and concrete policy mechanisms, organized for practical investor use.)
en_finance — English-language macro analysis task
Explain how the Fed’s quantitative tightening (QT) affects the US 10Y treasury yield through three channels: TGA drawdown, RRP utilization, and bank reserve scarcity. Include current data context (2025-2026), the transmission mechanism for each channel, and provide three actionable insights for fixed income traders managing duration risk.
py_code — Python code generation task (PEP 8 + type hints + error handling specified)
Write a Python function using yfinance that fetches monthly close prices for the Nikkei 225 (
^N225) for the past 10 years, computes month-over-month returns, and returns a pandas DataFrame with columns[date, close, monthly_return]. Include error handling for missing data and an example usage block. Use type hints and follow PEP 8.
Selection criteria: (1) reflects three workload types representative of my actual production needs (Japanese summarization / English summarization / code generation); (2) covers different drafter difficulty levels (free word order in natural language vs deterministic structure in code); (3) yields output ranges of 600-2900 chars at the same max_tokens (CJK ≈ 1.5-2 chars per token, English ≈ 4 chars per token, code in between).
5. Why the patch is needed
Using vllm/vllm-openai:gemma4-0505-arm64-cu130 as-is, MTP fails on any quantized target — acceptance rate stays at 0% no matter what flags you try. Adding --enforce-eager, dropping --kv-cache-dtype fp8, lowering num_speculative_tokens to 1 — none of these change the result.
The cause is two bugs in the bundled gemma4_mtp.py, identified by @hospedales in PR #41745 with a patch attached.
| # | Location | Bug | Effect |
|---|---|---|---|
| 1 | intermediate_size lookup (line 297) | The drafter MLP requires text_config.intermediate_size = 8192, but the original code pulls 4096 from the top-level config — half the size | Weight load fails or computation is corrupted |
| 2 | quant_config propagation (lines 186/193/299) | The target’s quant_config (FP8-block / FP8-Dynamic / NVFP4) is propagated into the drafter’s Linear layers. The drafter is supposed to hold BF16 weights | vLLM applies packing to drafter parameters → shape mismatch at weight-load time, or silent corruption. Always triggers when the target is quantized |
Mechanically, the Docker image (gemma4-0505-arm64-cu130) was built from the PR’s first commit and does not include the bug fixes that landed afterward. Bind-mounting the head commit’s gemma4_mtp.py is the shortest fix until the image is rebuilt.
Lessons:
- Do not trust newly published vLLM images — when a PR’s first commit is what is bundled, later bug fixes that land in the same PR are not present
- Read the “complete command” of third-party working examples carefully — the bind-mount patch was tucked outside the visible code block in ai-muninn’s DGX Spark 26B article, easy to miss if you only scan the visible
docker runline - Until upstream image rebuilds, bind-mounting is the practical workaround
6. Production guidance
| Use case | Recommended config | Expected performance |
|---|---|---|
| Speed first (code generation, structured output) | 26B-A4B + MTP+patch, K=4 | Python finishes in 8.30s; 270+ chars/s on English |
| Need 31B quality and want speed | 31B Dense + MTP+patch, K=4 | 28% of 31B baseline time (= 3.52x faster); comfortably interactive |
| Natural-language summarization (e.g., Japanese news) | 26B / 31B + MTP+patch, K=1 | Same speedup as K=4 with less memory and compute |
| Balanced quality and speed | 26B-A4B baseline (no MTP) | 39.5 tok/s; sufficient for many use cases, simpler config |
My production setup (currently 31B-FP8-block at 6.46 tok/s) becomes 31B-Dynamic + MTP+patch at ≈28.5 tok/s, 4.4x faster at the same quality — worth seriously considering as a production swap (pending K=1 validation for Japanese fit).
Appendix A: K-sweep details
A.1 Acceptance metrics (cumulative across 9 traces × 4 K, 26B-A4B)
| K | drafts | draft tok | accepted | overall rate | mean accepted | per-position rate (%) |
|---|---|---|---|---|---|---|
| 1 | 3027 | 3027 | 2367 | 78.2% | 0.78/1 | 78.2 |
| 2 | 2306 | 4612 | 3091 | 67.0% | 1.34/2 | 76.2 / 57.8 |
| 4 | 1814 | 7256 | 3591 | 49.5% | 1.98/4 | 72.2 / 53.6 / 41.2 / 30.9 |
| 8 | 1576 | 12608 | 3829 | 30.4% | 2.43/8 | 73.4 / 52.3 / 37.1 / 25.8 / 19.3 / 15.2 / 11.0 / 8.9 |
A.2 Mean accepted length vs effective speedup (26B, Python)
Theoretical speedup ≈ (mean_accepted + 1) / (1 + drafter_overhead) ≈ mean_accepted + 1.
| K | mean accept | Theoretical speedup | Measured (py) | Measured/theoretical |
|---|---|---|---|---|
| 1 | 0.78 | ~1.78x | 1.41x | 79% |
| 2 | 1.34 | ~2.34x | 1.54x | 66% |
| 4 | 1.98 | ~2.98x | 1.75x | 59% |
| 8 | 2.43 | ~3.43x | 1.59x | 46% |
Larger K means a larger gap between theory and measurement. Drafter forwards, KV cache rollback, and scheduling overhead all surface — that is exactly why K=8 turns counterproductive.
A.3 31B acceptance metrics (K=4, cumulative across 9 traces + 1 probe)
drafts = 1890
draft_tokens = 7560 (= 1890 × 4)
accepted = 4117
overall accept = 54.5% (26B was 49.5%)
mean accepted = 2.18 / 4 (26B was 1.98 / 4)
per-position acceptance:
pos 0: 78.7% (26B: 72.2%)
pos 1: 58.9% (26B: 53.6%)
pos 2: 45.6% (26B: 41.2%)
pos 3: 34.6% (26B: 30.9%)
Sources
- Google blog: Accelerating Gemma 4 with MTP drafters
- vLLM Recipes: Gemma 4 Usage Guide
- NVIDIA Developer Forum: Gemma 4 MTP on DGX Spark — TP=1 working config
- ai-muninn: DGX Spark Gemma 4 hits 108 tok/s — 26B success case
- vLLM Issue #36331: 0% acceptance with Qwen NVFP4
- vLLM PR #40898: SWA drafter low acceptance fix
- vLLM PR #41745: Gemma 4 MTP support — patch source
- HuggingFace: google/gemma-4-31B-it-assistant
- HuggingFace: RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic
- HuggingFace: RedHatAI/gemma-4-31B-it-FP8-Dynamic
Join the conversation on LinkedIn — share your thoughts and comments.
Discuss on LinkedIn