Gemma 4 MTP on GB10 — Inference 1.83x faster on 26B, 3.52x on 31B Dense

Headline Result

I measured Google’s Gemma 4 MTP (Multi-Token Prediction) drafter on NVIDIA GB10 (GX10, 121.6GB unified LPDDR5X, 273 GB/s). MTP delivers a real speedup across all three workloads tested. On 31B Dense, Python code generation reaches 3.52x — at or above Google’s headline “up to 3x faster” claim, on a single GPU.

Category	26B-A4B + MTP+patch	31B Dense + MTP+patch
Japanese financial analysis	1.20x (16.47s → 13.74s)	2.54x (91.95s → 36.23s)
English macro analysis	1.36x (15.15s → 11.14s)	2.52x (91.41s → 36.22s)
Python code generation	1.83x (15.17s → 8.30s)	3.52x (91.41s → 26.00s)

Plotting 26B (light blue) and 31B (green) side by side makes the gap obvious — Python on 31B in particular pulls far ahead.

Speedup ratio by workload (baseline = 1.00×, K=4)

Japanese financial analysis

26B+MTP

1.20×

31B+MTP

2.54×

English macro analysis

26B+MTP

1.36×

31B+MTP

2.52×

Python code generation

26B+MTP

1.83×

31B+MTP

3.52×

Bars normalized to 3.52× = 100%. 31B (green) clearly outpaces 26B (light blue) on every workload.

In the search range, no independent benchmark of 31B Dense + 1 GPU + MTP could be found in public blogs or GitHub issues, so this measurement appears to be the first of its kind. Note that the freshly published vLLM image does not work as-is — its bundled gemma4_mtp.py returns 0% acceptance until the head commit of PR #41745 is bind-mounted. Section 5 covers this.

1. Test Environment

Item	Value
Hardware	NVIDIA GB10 (GX10 SoC, ARM aarch64, SM 12.1)
Memory	LPDDR5X 128GB unified (CPU/GPU shared, cache coherent)
Memory bandwidth	273 GB/s
Target model	`RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic` (26B-A4B MoE) / `RedHatAI/gemma-4-31B-it-FP8-Dynamic` (31B Dense)
Drafter	`google/gemma-4-26B-A4B-it-assistant` (832MB) / `google/gemma-4-31B-it-assistant` (832MB)
vLLM image	`vllm/vllm-openai:gemma4-0505-arm64-cu130` (pushed the same day Google released MTP)
Patch	PR #41745 head commit `d8b3826`, applied via bind mount
Settings	`num_speculative_tokens=4` (= K), `kv-cache-dtype fp8`, `max-model-len 4096`, temperature=0.7

References to “production” mean my daily news-summarization vLLM cron (02:00 JST, 31B-FP8-block).

2. Detailed Results

2.1 Speedup by language (3 prompts × 3 runs)

Three prompts measured at num_speculative_tokens=4. Full prompts are in section 4.5 (reproduction material).

26B-A4B + MTP+patch

Category	baseline (Dynamic, no MTP)	+ MTP+patch (K=4)	Speedup
`ja_finance`	16.47s, 60.2 ch/s, 992 chars	13.74s, 73.6 ch/s, 1011 chars	1.20x
`en_finance`	15.15s, 170.1 ch/s, 2576 chars	11.14s, 232.7 ch/s, 2592 chars	1.36x
`py_code`	15.17s, 148.9 ch/s, 2260 chars	8.30s, 279.8 ch/s, 2321 chars	1.83x

31B Dense + MTP+patch

Category	baseline (Dynamic, no MTP)	+ MTP+patch (K=4)	Speedup
`ja_finance`	91.95s, 10.5 ch/s, 967 chars	36.23s, 26.3 ch/s, 952 chars	2.54x
`en_finance`	91.41s, 27.9 ch/s, 2551 chars	36.22s, 71.0 ch/s, 2572 chars	2.52x
`py_code`	91.41s, 24.9 ch/s, 2277 chars	26.00s, 86.7 ch/s, 2254 chars	3.52x

The “code > English > Japanese” ordering reflects how predictable each output is for the drafter:

Python: deterministic continuations (def , import, pd.DataFrame(, etc.) → high acceptance
English: natural language but predictable word order and structure → middle
Japanese: free word order and more characters per token (CJK ≈ 1.5-2 chars per token) → drafter struggles, fewer characters appear per accepted token

2.2 Per-position acceptance — 31B drafter beats 26B at every position

Comparing per-position acceptance at K=4 shows the 31B drafter ahead at every position.

Per-position acceptance rate (K=4)

pos 0

26B

72.2%

31B

78.7%

pos 1

26B

53.6%

31B

58.9%

pos 2

26B

41.2%

31B

45.6%

pos 3

26B

30.9%

31B

34.6%

The gap is +6.5pt at pos 0, +5.3pt at pos 1, +4.4pt at pos 2, +3.7pt at pos 3 — narrowing at deeper positions but averaging about +5pt. The 31B-specific drafter shares the same 32-layer / hidden-dim structure as the 31B target, so the predict-then-verify hidden-state alignment is naturally tighter.

2.3 26B vs 31B head-to-head (Python yfinance task)

Lining up four configurations on absolute speed (chars/s) makes the magnitude of 26B+MTP’s lead obvious.

Configuration	sec	chars/s	vs 26B baseline	Use case
26B baseline	15.17s	148.9	1.00×	(reference)
26B + MTP+patch	8.30s	279.8	1.83× ← fastest	Speed first, MoE
31B baseline	91.41s	24.9	0.17×	(current production, quality first)
31B + MTP+patch	26.00s	86.7	0.58×	31B quality + MTP, practical option

Python generation speed, absolute (chars/s)

26B baseline

148.9 ch/s

26B + MTP+patch

279.8 ch/s ← fastest

31B baseline

24.9 ch/s

31B + MTP+patch

86.7 ch/s

Absolute speed crown goes to 26B + MTP+patch; the largest speedup ratio (and thus the strongest case for MTP) goes to 31B + MTP+patch. The right choice depends on the workload — section 6 covers this.

2.4 Why 31B’s speedup ratio is larger

The gap from 26B 1.83x to 31B 3.52x is roughly 2x. Three contributing factors:

1. Memory-bandwidth bound is more dominant

Comparing baseline absolute speed shows that a larger target is more deeply memory-bandwidth bound:

26B baseline: 15.17s for 600 tokens = 39.5 tok/s
31B baseline: 91.41s for 600 tokens = 6.56 tok/s (= 1/6 of 26B)

31B Dense is 5-7x heavier on weight loads than 26B-A4B (which only activates 4B with MoE). GB10’s 273 GB/s is more saturated, and that is precisely the regime where MTP’s amortization (sharing one memory read across K tokens) helps most.

2. Drafter accept rate is higher

26B mean 1.98 → 31B mean 2.18 — effective K (the number of tokens that actually advance) goes up by 10%.

3. Drafter overhead is relatively smaller against the larger target

The drafter is the same gemma-4-*-it-assistant (832MB) in both cases. The bigger the target, the smaller the relative cost of the drafter forward.

Theoretical max speedup = (mean_accepted + 1) / (1 + drafter_overhead):

Target	mean_accepted + 1	1 + drafter_overhead	Theoretical	Measured (Python)	Measured/theoretical
26B	~2.98	~1.7	1.75×	1.83×	105%
31B	~3.18	~0.9	3.5×	3.52×	100%

The gap between theory and measurement closes as target grows.

2.5 Python execution = fast and correct

Speed numbers alone do not guarantee that MTP produced correct code. To check, I ran the generated Python code in a subprocess against yfinance + pandas + matplotlib.

Trace data (max_tokens=1500, 26B, full version)

Variant	Total	Chars	Function name
baseline (run2)	26.81s	2816	`fetch_nikkei_monthly_returns()`
MTP+patch (run3)	15.58s	2931	`get_nikkei_monthly_returns()`

Both models produced code that:

Fetched ten years of monthly close for ^N225 via the yfinance API
Computed monthly returns with pct_change()
Returned a [date, close, monthly_return] DataFrame
Successfully retrieved 119 months of data
Saved a matplotlib bar chart of monthly returns (red = down, green = up)

End-to-end (generate + execute):

baseline:  26.81 + 2.35 = 29.16s
MTP+patch: 15.58 + 2.28 = 17.86s
→ end-to-end speedup = 1.63x (saving 11.30s)

Fast (1.72x) and correct (both produced working code) at the same time. MTP is lossless on quality, not just speed. The video at the top of this article shows this end-to-end: generation race → execution → both models producing equivalent Nikkei 225 monthly-return charts.

2.6 Multi-language videos (generation only)

Japanese (BoJ YCC → bank stocks)

baseline 16.47s, 60.2 ch/s → MTP+patch 13.74s, 73.6 ch/s = 1.20x speedup. Japanese is harder for the drafter to land at deeper positions, so the speedup is modest but consistent.

English (Fed QT → US 10Y yield)

baseline 15.15s, 170.1 ch/s → MTP+patch 11.14s, 232.7 ch/s = 1.36x speedup. English yields higher acceptance than Japanese and pulls more speedup.

3. K-sweep (`num_speculative_tokens`)

I ran four container restarts at K=1, 2, 4, 8, each over 3 prompts × 3 runs (max_tokens=600) on 26B-A4B.

3.1 Speedup vs K (3-run average, 26B-A4B)

baseline (Dynamic, no MTP): ja 16.47s, en 15.15s, py 15.17s (= 1.0x reference)

K	ja_finance	en_finance	py_code
1	11.89s, 1.39x	11.57s, 1.31x	10.74s, 1.41x
2	12.19s, 1.35x	11.33s, 1.34x	9.88s, 1.54x
4	11.90s, 1.38x	11.38s, 1.33x	8.66s, 1.75x
8	15.77s, 1.04x	14.12s, 1.07x	9.54s, 1.59x

3.2 Practical recommendation

Code generation / structured output (yaml/json/SQL): K=4 is best
Natural language (Japanese summaries / English blog text): K=1 is enough (K=2-4 barely changes results)
Do not use K=8 — pos 7 acceptance drops to 8.9% (essentially noise), and eight drafter forwards plus reject-time KV cache rollback costs eat the accept gain

vLLM defaults to K=4, but for natural-language workloads K=1 saves both memory and compute. Detailed per-position acceptance and theoretical comparison are in Appendix A.

4. Reproduction

4.1 26B-A4B + MTP+patch launch command

docker run -d --name vllm-mtp \
  --gpus all --ipc host --shm-size 64gb \
  -p 8000:8000 \
  -v $HOME/ai_news2fund/hf_cache:/root/.cache/huggingface \
  -v $HOME/mtp_assets/gemma4_mtp.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4_mtp.py:ro \
  vllm/vllm-openai:gemma4-0505-arm64-cu130 \
  RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic \
  --served-model-name gemma4-26b-mtp \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","model":"google/gemma-4-26B-A4B-it-assistant","num_speculative_tokens":4}'

4.2 31B Dense + MTP+patch launch command

docker run -d --name vllm-mtp \
  --gpus all --ipc host --shm-size 64gb \
  -p 8000:8000 \
  -v $HOME/ai_news2fund/hf_cache:/root/.cache/huggingface \
  -v $HOME/mtp_assets/gemma4_mtp.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4_mtp.py:ro \
  vllm/vllm-openai:gemma4-0505-arm64-cu130 \
  RedHatAI/gemma-4-31B-it-FP8-Dynamic \
  --served-model-name gemma4-31b-mtp \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}'

Neither --enforce-eager nor --enable-chunked-prefill is needed. The patch (bind mount) alone is enough.

4.3 Patch file

Download gemma4_mtp.py directly from PR #41745 head commit d8b3826 and save to $HOME/mtp_assets/gemma4_mtp.py.

mkdir -p $HOME/mtp_assets
curl -fsSL -o $HOME/mtp_assets/gemma4_mtp.py \
  https://raw.githubusercontent.com/vllm-project/vllm/d8b3826648da6b407f8c55457a2103be9aeb5d83/vllm/model_executor/models/gemma4_mtp.py

4.4 Memory footprint

26B-A4B: target 26GB FP8 + drafter 832MB BF16 + KV cache (max-model-len 4096 ≈ 4GB) ≈ 31GB GPU RAM
31B Dense: target 31GB FP8 + drafter 832MB BF16 + KV cache ≈ 36GB GPU RAM

GB10 with 121.6GB unified memory has plenty of headroom. DGX Spark / B200 / H100 80GB / RTX 6000 48GB should all run this fine.

4.5 Full prompts used in this validation

For reproducibility, the exact prompts used are reproduced verbatim below (defined in tools/mtp_traces/prompts.json). temperature=0.7, max_tokens=600 (Python additionally re-run at 1500 in a later subsection).

ja_finance — Japanese-language financial analysis task

日銀のYCC撤廃が日本の銀行株 (三菱UFJ / 三井住友 / みずほ) に与える影響を、NIM・含み損・配当方針の3点で詳細に解説してください。各論点について、足元の数値や具体的な政策のメカニズムを盛り込み、投資家が実務で使える形で整理してください。

(English gloss: explain how the BoJ’s YCC removal affects the major Japanese bank stocks (MUFG / SMFG / Mizuho), covering NIM, unrealized losses, and dividend policy with current data and concrete policy mechanisms, organized for practical investor use.)

en_finance — English-language macro analysis task

Explain how the Fed’s quantitative tightening (QT) affects the US 10Y treasury yield through three channels: TGA drawdown, RRP utilization, and bank reserve scarcity. Include current data context (2025-2026), the transmission mechanism for each channel, and provide three actionable insights for fixed income traders managing duration risk.

py_code — Python code generation task (PEP 8 + type hints + error handling specified)

Write a Python function using yfinance that fetches monthly close prices for the Nikkei 225 (^N225) for the past 10 years, computes month-over-month returns, and returns a pandas DataFrame with columns [date, close, monthly_return]. Include error handling for missing data and an example usage block. Use type hints and follow PEP 8.

Selection criteria: (1) reflects three workload types representative of my actual production needs (Japanese summarization / English summarization / code generation); (2) covers different drafter difficulty levels (free word order in natural language vs deterministic structure in code); (3) yields output ranges of 600-2900 chars at the same max_tokens (CJK ≈ 1.5-2 chars per token, English ≈ 4 chars per token, code in between).

5. Why the patch is needed

Using vllm/vllm-openai:gemma4-0505-arm64-cu130 as-is, MTP fails on any quantized target — acceptance rate stays at 0% no matter what flags you try. Adding --enforce-eager, dropping --kv-cache-dtype fp8, lowering num_speculative_tokens to 1 — none of these change the result.

The cause is two bugs in the bundled gemma4_mtp.py, identified by @hospedales in PR #41745 with a patch attached.

#	Location	Bug	Effect
1	`intermediate_size` lookup (line 297)	The drafter MLP requires `text_config.intermediate_size = 8192`, but the original code pulls `4096` from the top-level config — half the size	Weight load fails or computation is corrupted
2	`quant_config` propagation (lines 186/193/299)	The target’s `quant_config` (FP8-block / FP8-Dynamic / NVFP4) is propagated into the drafter’s Linear layers. The drafter is supposed to hold BF16 weights	vLLM applies packing to drafter parameters → shape mismatch at weight-load time, or silent corruption. Always triggers when the target is quantized

Mechanically, the Docker image (gemma4-0505-arm64-cu130) was built from the PR’s first commit and does not include the bug fixes that landed afterward. Bind-mounting the head commit’s gemma4_mtp.py is the shortest fix until the image is rebuilt.

Lessons:

Do not trust newly published vLLM images — when a PR’s first commit is what is bundled, later bug fixes that land in the same PR are not present
Read the “complete command” of third-party working examples carefully — the bind-mount patch was tucked outside the visible code block in ai-muninn’s DGX Spark 26B article, easy to miss if you only scan the visible docker run line
Until upstream image rebuilds, bind-mounting is the practical workaround

6. Production guidance

Use case	Recommended config	Expected performance
Speed first (code generation, structured output)	26B-A4B + MTP+patch, K=4	Python finishes in 8.30s; 270+ chars/s on English
Need 31B quality and want speed	31B Dense + MTP+patch, K=4	28% of 31B baseline time (= 3.52x faster); comfortably interactive
Natural-language summarization (e.g., Japanese news)	26B / 31B + MTP+patch, K=1	Same speedup as K=4 with less memory and compute
Balanced quality and speed	26B-A4B baseline (no MTP)	39.5 tok/s; sufficient for many use cases, simpler config

My production setup (currently 31B-FP8-block at 6.46 tok/s) becomes 31B-Dynamic + MTP+patch at ≈28.5 tok/s, 4.4x faster at the same quality — worth seriously considering as a production swap (pending K=1 validation for Japanese fit).

Appendix A: K-sweep details

A.1 Acceptance metrics (cumulative across 9 traces × 4 K, 26B-A4B)

K	drafts	draft tok	accepted	overall rate	mean accepted	per-position rate (%)
1	3027	3027	2367	78.2%	0.78/1	78.2
2	2306	4612	3091	67.0%	1.34/2	76.2 / 57.8
4	1814	7256	3591	49.5%	1.98/4	72.2 / 53.6 / 41.2 / 30.9
8	1576	12608	3829	30.4%	2.43/8	73.4 / 52.3 / 37.1 / 25.8 / 19.3 / 15.2 / 11.0 / 8.9

A.2 Mean accepted length vs effective speedup (26B, Python)

Theoretical speedup ≈ (mean_accepted + 1) / (1 + drafter_overhead) ≈ mean_accepted + 1.

K	mean accept	Theoretical speedup	Measured (py)	Measured/theoretical
1	0.78	~1.78x	1.41x	79%
2	1.34	~2.34x	1.54x	66%
4	1.98	~2.98x	1.75x	59%
8	2.43	~3.43x	1.59x	46%

Larger K means a larger gap between theory and measurement. Drafter forwards, KV cache rollback, and scheduling overhead all surface — that is exactly why K=8 turns counterproductive.

A.3 31B acceptance metrics (K=4, cumulative across 9 traces + 1 probe)

drafts          = 1890
draft_tokens    = 7560 (= 1890 × 4)
accepted        = 4117
overall accept  = 54.5%   (26B was 49.5%)
mean accepted   = 2.18 / 4   (26B was 1.98 / 4)

per-position acceptance:
  pos 0: 78.7%   (26B: 72.2%)
  pos 1: 58.9%   (26B: 53.6%)
  pos 2: 45.6%   (26B: 41.2%)
  pos 3: 34.6%   (26B: 30.9%)

Sources

Google blog: Accelerating Gemma 4 with MTP drafters
vLLM Recipes: Gemma 4 Usage Guide
NVIDIA Developer Forum: Gemma 4 MTP on DGX Spark — TP=1 working config
ai-muninn: DGX Spark Gemma 4 hits 108 tok/s — 26B success case
vLLM Issue #36331: 0% acceptance with Qwen NVFP4
vLLM PR #40898: SWA drafter low acceptance fix
vLLM PR #41745: Gemma 4 MTP support — patch source
HuggingFace: google/gemma-4-31B-it-assistant
HuggingFace: RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic
HuggingFace: RedHatAI/gemma-4-31B-it-FP8-Dynamic

Gemma 4 MTP on GB10 — Inference 1.83x faster on 26B, 3.52x on 31B Dense

Headline Result

1. Test Environment

2. Detailed Results

2.1 Speedup by language (3 prompts × 3 runs)

26B-A4B + MTP+patch

31B Dense + MTP+patch

2.2 Per-position acceptance — 31B drafter beats 26B at every position

2.3 26B vs 31B head-to-head (Python yfinance task)

2.4 Why 31B’s speedup ratio is larger

2.5 Python execution = fast and correct

Trace data (max_tokens=1500, 26B, full version)

2.6 Multi-language videos (generation only)

Japanese (BoJ YCC → bank stocks)

English (Fed QT → US 10Y yield)

3. K-sweep (`num_speculative_tokens`)

3.1 Speedup vs K (3-run average, 26B-A4B)

3.2 Practical recommendation

4. Reproduction

4.1 26B-A4B + MTP+patch launch command

4.2 31B Dense + MTP+patch launch command

4.3 Patch file

4.4 Memory footprint

4.5 Full prompts used in this validation

5. Why the patch is needed

6. Production guidance

Appendix A: K-sweep details

A.1 Acceptance metrics (cumulative across 9 traces × 4 K, 26B-A4B)

A.2 Mean accepted length vs effective speedup (26B, Python)

A.3 31B acceptance metrics (K=4, cumulative across 9 traces + 1 probe)

Sources

Related Analysis

Decomposing the AI Value Chain in the Agentic Era

How DeepSeek V4's 90% KV Cache Reduction Reshapes HBM Demand — From Capacity to Bandwidth, Packaging, and Thermal Control

Headline Result

1. Test Environment

2. Detailed Results

2.1 Speedup by language (3 prompts × 3 runs)

26B-A4B + MTP+patch

31B Dense + MTP+patch

2.2 Per-position acceptance — 31B drafter beats 26B at every position

2.3 26B vs 31B head-to-head (Python yfinance task)

2.4 Why 31B’s speedup ratio is larger

2.5 Python execution = fast and correct

Trace data (max_tokens=1500, 26B, full version)

2.6 Multi-language videos (generation only)

Japanese (BoJ YCC → bank stocks)

English (Fed QT → US 10Y yield)

3. K-sweep (num_speculative_tokens)

3.1 Speedup vs K (3-run average, 26B-A4B)

3.2 Practical recommendation

4. Reproduction

4.1 26B-A4B + MTP+patch launch command

4.2 31B Dense + MTP+patch launch command

4.3 Patch file

4.4 Memory footprint

4.5 Full prompts used in this validation

5. Why the patch is needed

6. Production guidance

Appendix A: K-sweep details

A.1 Acceptance metrics (cumulative across 9 traces × 4 K, 26B-A4B)

A.2 Mean accepted length vs effective speedup (26B, Python)

A.3 31B acceptance metrics (K=4, cumulative across 9 traces + 1 probe)

Sources

Related Analysis

Decomposing the AI Value Chain in the Agentic Era

How DeepSeek V4's 90% KV Cache Reduction Reshapes HBM Demand — From Capacity to Bandwidth, Packaging, and Thermal Control

3. K-sweep (`num_speculative_tokens`)