Analysis

Gemma 4 MTP on GB10 — Inference 1.83x faster on 26B, 3.52x on 31B Dense

3.52x for code generation and ~2.5x for natural language on 31B Dense — independent measurement on a single GPU (above Google's '3x' headline)

Headline Result

I measured Google’s Gemma 4 MTP (Multi-Token Prediction) drafter on NVIDIA GB10 (GX10, 121.6GB unified LPDDR5X, 273 GB/s). MTP delivers a real speedup across all three workloads tested. On 31B Dense, Python code generation reaches 3.52x — at or above Google’s headline “up to 3x faster” claim, on a single GPU.

Category26B-A4B + MTP+patch31B Dense + MTP+patch
Japanese financial analysis1.20x (16.47s → 13.74s)2.54x (91.95s → 36.23s)
English macro analysis1.36x (15.15s → 11.14s)2.52x (91.41s → 36.22s)
Python code generation1.83x (15.17s → 8.30s)3.52x (91.41s → 26.00s)

Plotting 26B (light blue) and 31B (green) side by side makes the gap obvious — Python on 31B in particular pulls far ahead.

Speedup ratio by workload (baseline = 1.00×, K=4)
Japanese financial analysis
26B+MTP
1.20×
31B+MTP
2.54×
English macro analysis
26B+MTP
1.36×
31B+MTP
2.52×
Python code generation
26B+MTP
1.83×
31B+MTP
3.52×
Bars normalized to 3.52× = 100%. 31B (green) clearly outpaces 26B (light blue) on every workload.

In the search range, no independent benchmark of 31B Dense + 1 GPU + MTP could be found in public blogs or GitHub issues, so this measurement appears to be the first of its kind. Note that the freshly published vLLM image does not work as-is — its bundled gemma4_mtp.py returns 0% acceptance until the head commit of PR #41745 is bind-mounted. Section 5 covers this.


1. Test Environment

ItemValue
HardwareNVIDIA GB10 (GX10 SoC, ARM aarch64, SM 12.1)
MemoryLPDDR5X 128GB unified (CPU/GPU shared, cache coherent)
Memory bandwidth273 GB/s
Target modelRedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (26B-A4B MoE) / RedHatAI/gemma-4-31B-it-FP8-Dynamic (31B Dense)
Draftergoogle/gemma-4-26B-A4B-it-assistant (832MB) / google/gemma-4-31B-it-assistant (832MB)
vLLM imagevllm/vllm-openai:gemma4-0505-arm64-cu130 (pushed the same day Google released MTP)
PatchPR #41745 head commit d8b3826, applied via bind mount
Settingsnum_speculative_tokens=4 (= K), kv-cache-dtype fp8, max-model-len 4096, temperature=0.7

References to “production” mean my daily news-summarization vLLM cron (02:00 JST, 31B-FP8-block).


2. Detailed Results

2.1 Speedup by language (3 prompts × 3 runs)

Three prompts measured at num_speculative_tokens=4. Full prompts are in section 4.5 (reproduction material).

26B-A4B + MTP+patch

Categorybaseline (Dynamic, no MTP)+ MTP+patch (K=4)Speedup
ja_finance16.47s, 60.2 ch/s, 992 chars13.74s, 73.6 ch/s, 1011 chars1.20x
en_finance15.15s, 170.1 ch/s, 2576 chars11.14s, 232.7 ch/s, 2592 chars1.36x
py_code15.17s, 148.9 ch/s, 2260 chars8.30s, 279.8 ch/s, 2321 chars1.83x

31B Dense + MTP+patch

Categorybaseline (Dynamic, no MTP)+ MTP+patch (K=4)Speedup
ja_finance91.95s, 10.5 ch/s, 967 chars36.23s, 26.3 ch/s, 952 chars2.54x
en_finance91.41s, 27.9 ch/s, 2551 chars36.22s, 71.0 ch/s, 2572 chars2.52x
py_code91.41s, 24.9 ch/s, 2277 chars26.00s, 86.7 ch/s, 2254 chars3.52x

The “code > English > Japanese” ordering reflects how predictable each output is for the drafter:

  • Python: deterministic continuations (def , import, pd.DataFrame(, etc.) → high acceptance
  • English: natural language but predictable word order and structure → middle
  • Japanese: free word order and more characters per token (CJK ≈ 1.5-2 chars per token) → drafter struggles, fewer characters appear per accepted token

2.2 Per-position acceptance — 31B drafter beats 26B at every position

Comparing per-position acceptance at K=4 shows the 31B drafter ahead at every position.

Per-position acceptance rate (K=4)
pos 0
26B
72.2%
31B
78.7%
pos 1
26B
53.6%
31B
58.9%
pos 2
26B
41.2%
31B
45.6%
pos 3
26B
30.9%
31B
34.6%

The gap is +6.5pt at pos 0, +5.3pt at pos 1, +4.4pt at pos 2, +3.7pt at pos 3 — narrowing at deeper positions but averaging about +5pt. The 31B-specific drafter shares the same 32-layer / hidden-dim structure as the 31B target, so the predict-then-verify hidden-state alignment is naturally tighter.

2.3 26B vs 31B head-to-head (Python yfinance task)

Lining up four configurations on absolute speed (chars/s) makes the magnitude of 26B+MTP’s lead obvious.

Configurationsecchars/svs 26B baselineUse case
26B baseline15.17s148.91.00×(reference)
26B + MTP+patch8.30s279.81.83× ← fastestSpeed first, MoE
31B baseline91.41s24.90.17×(current production, quality first)
31B + MTP+patch26.00s86.70.58×31B quality + MTP, practical option
Python generation speed, absolute (chars/s)
26B baseline
148.9 ch/s
26B + MTP+patch
279.8 ch/s ← fastest
31B baseline
24.9 ch/s
31B + MTP+patch
86.7 ch/s

Absolute speed crown goes to 26B + MTP+patch; the largest speedup ratio (and thus the strongest case for MTP) goes to 31B + MTP+patch. The right choice depends on the workload — section 6 covers this.

2.4 Why 31B’s speedup ratio is larger

The gap from 26B 1.83x to 31B 3.52x is roughly 2x. Three contributing factors:

1. Memory-bandwidth bound is more dominant

Comparing baseline absolute speed shows that a larger target is more deeply memory-bandwidth bound:

  • 26B baseline: 15.17s for 600 tokens = 39.5 tok/s
  • 31B baseline: 91.41s for 600 tokens = 6.56 tok/s (= 1/6 of 26B)

31B Dense is 5-7x heavier on weight loads than 26B-A4B (which only activates 4B with MoE). GB10’s 273 GB/s is more saturated, and that is precisely the regime where MTP’s amortization (sharing one memory read across K tokens) helps most.

2. Drafter accept rate is higher

26B mean 1.98 → 31B mean 2.18 — effective K (the number of tokens that actually advance) goes up by 10%.

3. Drafter overhead is relatively smaller against the larger target

The drafter is the same gemma-4-*-it-assistant (832MB) in both cases. The bigger the target, the smaller the relative cost of the drafter forward.

Theoretical max speedup = (mean_accepted + 1) / (1 + drafter_overhead):

Targetmean_accepted + 11 + drafter_overheadTheoreticalMeasured (Python)Measured/theoretical
26B~2.98~1.71.75×1.83×105%
31B~3.18~0.93.5×3.52×100%

The gap between theory and measurement closes as target grows.

2.5 Python execution = fast and correct

Speed numbers alone do not guarantee that MTP produced correct code. To check, I ran the generated Python code in a subprocess against yfinance + pandas + matplotlib.

Trace data (max_tokens=1500, 26B, full version)

VariantTotalCharsFunction name
baseline (run2)26.81s2816fetch_nikkei_monthly_returns()
MTP+patch (run3)15.58s2931get_nikkei_monthly_returns()

Both models produced code that:

  • Fetched ten years of monthly close for ^N225 via the yfinance API
  • Computed monthly returns with pct_change()
  • Returned a [date, close, monthly_return] DataFrame
  • Successfully retrieved 119 months of data
  • Saved a matplotlib bar chart of monthly returns (red = down, green = up)

End-to-end (generate + execute):

baseline:  26.81 + 2.35 = 29.16s
MTP+patch: 15.58 + 2.28 = 17.86s
→ end-to-end speedup = 1.63x (saving 11.30s)

Fast (1.72x) and correct (both produced working code) at the same time. MTP is lossless on quality, not just speed. The video at the top of this article shows this end-to-end: generation race → execution → both models producing equivalent Nikkei 225 monthly-return charts.

2.6 Multi-language videos (generation only)

Japanese (BoJ YCC → bank stocks)

baseline 16.47s, 60.2 ch/s → MTP+patch 13.74s, 73.6 ch/s = 1.20x speedup. Japanese is harder for the drafter to land at deeper positions, so the speedup is modest but consistent.

English (Fed QT → US 10Y yield)

baseline 15.15s, 170.1 ch/s → MTP+patch 11.14s, 232.7 ch/s = 1.36x speedup. English yields higher acceptance than Japanese and pulls more speedup.


3. K-sweep (num_speculative_tokens)

I ran four container restarts at K=1, 2, 4, 8, each over 3 prompts × 3 runs (max_tokens=600) on 26B-A4B.

3.1 Speedup vs K (3-run average, 26B-A4B)

baseline (Dynamic, no MTP): ja 16.47s, en 15.15s, py 15.17s (= 1.0x reference)

Kja_financeen_financepy_code
111.89s, 1.39x11.57s, 1.31x10.74s, 1.41x
212.19s, 1.35x11.33s, 1.34x9.88s, 1.54x
411.90s, 1.38x11.38s, 1.33x8.66s, 1.75x
815.77s, 1.04x14.12s, 1.07x9.54s, 1.59x

3.2 Practical recommendation

  • Code generation / structured output (yaml/json/SQL): K=4 is best
  • Natural language (Japanese summaries / English blog text): K=1 is enough (K=2-4 barely changes results)
  • Do not use K=8 — pos 7 acceptance drops to 8.9% (essentially noise), and eight drafter forwards plus reject-time KV cache rollback costs eat the accept gain

vLLM defaults to K=4, but for natural-language workloads K=1 saves both memory and compute. Detailed per-position acceptance and theoretical comparison are in Appendix A.


4. Reproduction

4.1 26B-A4B + MTP+patch launch command

docker run -d --name vllm-mtp \
  --gpus all --ipc host --shm-size 64gb \
  -p 8000:8000 \
  -v $HOME/ai_news2fund/hf_cache:/root/.cache/huggingface \
  -v $HOME/mtp_assets/gemma4_mtp.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4_mtp.py:ro \
  vllm/vllm-openai:gemma4-0505-arm64-cu130 \
  RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic \
  --served-model-name gemma4-26b-mtp \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","model":"google/gemma-4-26B-A4B-it-assistant","num_speculative_tokens":4}'

4.2 31B Dense + MTP+patch launch command

docker run -d --name vllm-mtp \
  --gpus all --ipc host --shm-size 64gb \
  -p 8000:8000 \
  -v $HOME/ai_news2fund/hf_cache:/root/.cache/huggingface \
  -v $HOME/mtp_assets/gemma4_mtp.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4_mtp.py:ro \
  vllm/vllm-openai:gemma4-0505-arm64-cu130 \
  RedHatAI/gemma-4-31B-it-FP8-Dynamic \
  --served-model-name gemma4-31b-mtp \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}'

Neither --enforce-eager nor --enable-chunked-prefill is needed. The patch (bind mount) alone is enough.

4.3 Patch file

Download gemma4_mtp.py directly from PR #41745 head commit d8b3826 and save to $HOME/mtp_assets/gemma4_mtp.py.

mkdir -p $HOME/mtp_assets
curl -fsSL -o $HOME/mtp_assets/gemma4_mtp.py \
  https://raw.githubusercontent.com/vllm-project/vllm/d8b3826648da6b407f8c55457a2103be9aeb5d83/vllm/model_executor/models/gemma4_mtp.py

4.4 Memory footprint

  • 26B-A4B: target 26GB FP8 + drafter 832MB BF16 + KV cache (max-model-len 4096 ≈ 4GB) ≈ 31GB GPU RAM
  • 31B Dense: target 31GB FP8 + drafter 832MB BF16 + KV cache ≈ 36GB GPU RAM

GB10 with 121.6GB unified memory has plenty of headroom. DGX Spark / B200 / H100 80GB / RTX 6000 48GB should all run this fine.

4.5 Full prompts used in this validation

For reproducibility, the exact prompts used are reproduced verbatim below (defined in tools/mtp_traces/prompts.json). temperature=0.7, max_tokens=600 (Python additionally re-run at 1500 in a later subsection).

ja_finance — Japanese-language financial analysis task

日銀のYCC撤廃が日本の銀行株 (三菱UFJ / 三井住友 / みずほ) に与える影響を、NIM・含み損・配当方針の3点で詳細に解説してください。各論点について、足元の数値や具体的な政策のメカニズムを盛り込み、投資家が実務で使える形で整理してください。

(English gloss: explain how the BoJ’s YCC removal affects the major Japanese bank stocks (MUFG / SMFG / Mizuho), covering NIM, unrealized losses, and dividend policy with current data and concrete policy mechanisms, organized for practical investor use.)

en_finance — English-language macro analysis task

Explain how the Fed’s quantitative tightening (QT) affects the US 10Y treasury yield through three channels: TGA drawdown, RRP utilization, and bank reserve scarcity. Include current data context (2025-2026), the transmission mechanism for each channel, and provide three actionable insights for fixed income traders managing duration risk.

py_code — Python code generation task (PEP 8 + type hints + error handling specified)

Write a Python function using yfinance that fetches monthly close prices for the Nikkei 225 (^N225) for the past 10 years, computes month-over-month returns, and returns a pandas DataFrame with columns [date, close, monthly_return]. Include error handling for missing data and an example usage block. Use type hints and follow PEP 8.

Selection criteria: (1) reflects three workload types representative of my actual production needs (Japanese summarization / English summarization / code generation); (2) covers different drafter difficulty levels (free word order in natural language vs deterministic structure in code); (3) yields output ranges of 600-2900 chars at the same max_tokens (CJK ≈ 1.5-2 chars per token, English ≈ 4 chars per token, code in between).


5. Why the patch is needed

Using vllm/vllm-openai:gemma4-0505-arm64-cu130 as-is, MTP fails on any quantized target — acceptance rate stays at 0% no matter what flags you try. Adding --enforce-eager, dropping --kv-cache-dtype fp8, lowering num_speculative_tokens to 1 — none of these change the result.

The cause is two bugs in the bundled gemma4_mtp.py, identified by @hospedales in PR #41745 with a patch attached.

#LocationBugEffect
1intermediate_size lookup (line 297)The drafter MLP requires text_config.intermediate_size = 8192, but the original code pulls 4096 from the top-level config — half the sizeWeight load fails or computation is corrupted
2quant_config propagation (lines 186/193/299)The target’s quant_config (FP8-block / FP8-Dynamic / NVFP4) is propagated into the drafter’s Linear layers. The drafter is supposed to hold BF16 weightsvLLM applies packing to drafter parameters → shape mismatch at weight-load time, or silent corruption. Always triggers when the target is quantized

Mechanically, the Docker image (gemma4-0505-arm64-cu130) was built from the PR’s first commit and does not include the bug fixes that landed afterward. Bind-mounting the head commit’s gemma4_mtp.py is the shortest fix until the image is rebuilt.

Lessons:

  1. Do not trust newly published vLLM images — when a PR’s first commit is what is bundled, later bug fixes that land in the same PR are not present
  2. Read the “complete command” of third-party working examples carefully — the bind-mount patch was tucked outside the visible code block in ai-muninn’s DGX Spark 26B article, easy to miss if you only scan the visible docker run line
  3. Until upstream image rebuilds, bind-mounting is the practical workaround

6. Production guidance

Use caseRecommended configExpected performance
Speed first (code generation, structured output)26B-A4B + MTP+patch, K=4Python finishes in 8.30s; 270+ chars/s on English
Need 31B quality and want speed31B Dense + MTP+patch, K=428% of 31B baseline time (= 3.52x faster); comfortably interactive
Natural-language summarization (e.g., Japanese news)26B / 31B + MTP+patch, K=1Same speedup as K=4 with less memory and compute
Balanced quality and speed26B-A4B baseline (no MTP)39.5 tok/s; sufficient for many use cases, simpler config

My production setup (currently 31B-FP8-block at 6.46 tok/s) becomes 31B-Dynamic + MTP+patch at ≈28.5 tok/s, 4.4x faster at the same quality — worth seriously considering as a production swap (pending K=1 validation for Japanese fit).


Appendix A: K-sweep details

A.1 Acceptance metrics (cumulative across 9 traces × 4 K, 26B-A4B)

Kdraftsdraft tokacceptedoverall ratemean acceptedper-position rate (%)
130273027236778.2%0.78/178.2
223064612309167.0%1.34/276.2 / 57.8
418147256359149.5%1.98/472.2 / 53.6 / 41.2 / 30.9
8157612608382930.4%2.43/873.4 / 52.3 / 37.1 / 25.8 / 19.3 / 15.2 / 11.0 / 8.9

A.2 Mean accepted length vs effective speedup (26B, Python)

Theoretical speedup ≈ (mean_accepted + 1) / (1 + drafter_overhead) ≈ mean_accepted + 1.

Kmean acceptTheoretical speedupMeasured (py)Measured/theoretical
10.78~1.78x1.41x79%
21.34~2.34x1.54x66%
41.98~2.98x1.75x59%
82.43~3.43x1.59x46%

Larger K means a larger gap between theory and measurement. Drafter forwards, KV cache rollback, and scheduling overhead all surface — that is exactly why K=8 turns counterproductive.

A.3 31B acceptance metrics (K=4, cumulative across 9 traces + 1 probe)

drafts          = 1890
draft_tokens    = 7560 (= 1890 × 4)
accepted        = 4117
overall accept  = 54.5%   (26B was 49.5%)
mean accepted   = 2.18 / 4   (26B was 1.98 / 4)

per-position acceptance:
  pos 0: 78.7%   (26B: 72.2%)
  pos 1: 58.9%   (26B: 53.6%)
  pos 2: 45.6%   (26B: 41.2%)
  pos 3: 34.6%   (26B: 30.9%)

Sources

Share this article

Join the conversation on LinkedIn — share your thoughts and comments.

Discuss on LinkedIn

Related Analysis