Gemma4 rocking on the 3090

Overview

Headline

Efficiency

Best Joules per token

6.64 J/tok— MoE + creative + dm=2

82.1 tok/s · AL 1.63 · 18.9 GB VRAM. Best speculation-on efficiency in the 24-cell sweep. Baseline (dm=1, no spec) is even more efficient at 5.70 J/tok but slower at 58 tok/s.

MoE+creative+dm=1 · 58 tok/s · 5.70 J/tok
MoE+creative+dm=2 · 82 tok/s · 6.64 J/tok ★
MoE+creative+dm=16 · 62 tok/s · 7.15 J/tok

MoE 26B-A4B

Speed king on code

132.1 tok/s— code, dm=16

AL 5.22 · 11.82 J/tok · 18.9 GB · ~130 W average. Long-context (post-rebase): 63.82 tok/s at 256K (Q8/Q8 + pflash + dm=4, peak observed 67.95); 1M context fits with TQ3/TQ3 + pflash at 19.8 tok/s, 22.3 GB (1.7 GB headroom). MTP γ=2 pushes 1M to 23.7 tok/s at 23.9 GB (sub-GB headroom).

Code AL plateau: 5.22
Creative AL plateau: 2.49 ← drafter is code-distribution-trained
256K ship config: 63.82 tok/s · 21.75 GB (peak observed 67.95; 2× prior estimate)
1M TQ3 + pflash: 19.8–20.1 tok/s on code prompts across three sweeps (prose 19.12); +MTP γ=2: 22.0–23.7 tok/s at 23.9 GB — cliff unlocked by daef232a6

Dense 31B

Generalist; better OOD drafter

82.5 tok/s— creative, dm=32

AL 5.12 · 10.75 J/tok · 22.1 GB · avg ~318 W on dense_creative_dm1 (scientific sweep). dm=16 essentially identical (81.7 tok/s).

Code AL plateau: 4.20
Creative AL plateau: 5.12 ← drafter generalises OOD
Long-ctx viable ≥128K (~2.5 tok/s, real text); 64K cliff resolved (2.54 tok/s post-rebase, was 1.78 tok/s anomaly)

On a 24 GB 3090, the right Gemma 4 config is a regime function of (model, context). MoE 26B-A4B: Q8/Q8 KV + pflash + DFlash dm=4 dominates from 4K through 512K (60–132 tok/s depending on ctx). TurboQuant (TQ3) becomes mandatory only at 1M+ where Q8 KV pages — there DFlash holds at 26 ± 5 tok/s. Dense 31B: same Q8 stack peaks at 81.7 tok/s (4K, creative). Above 32K the model is VRAM-bound; MTP γ=2 + TQ3 + pflash is the only viable drafter (~10 tok/s through 128K). 256K+ Dense is infeasible on 24 GB. See §Recipes for the full decision matrix and §Discoveries for the 20 ggml/llama.cpp fixes this required.

Context

How we compare to public RTX 3090 + Gemma 4 numbers

Gemma 4 was released 2026-04-02 — five weeks before these runs. Public benchmarks for this model on RTX 3090 hardware are sparse, and none publish long-context numbers. The table below compares what we measured against the best community-reported figures we found.

Metric	Lucebox	Best public 3090	Source / notes
MoE 26B-A4B peak decode @ 4K	132.1 tok/s	80–110 typical; 119 best	Q4_K_M, no spec decode in community reports
MoE 26B-A4B decode @ 256K	63.82 tok/s peak observed 67.95	No published 3090 number at 256K	Uncontested; Q8/Q8 + DFlash + pflash + dm=4 (closing sweep replicated mean)
MoE 26B-A4B max context (24 GB)	1M @ 19.8 tok/s no drafter; 23.7 tok/s with MTP γ=2	No published 3090 1M number	Uncontested; TQ3/TQ3 + pflash; 22.3 GB / 23.9 GB
Dense 31B peak decode @ 4K	81.7 tok/s creative dm=16, AL=5.12 (24-cell sweep) up to ~98 tok/s observed on HumanEval (single-prompt; less rigorous)	40–50 (FA hangs in mainstream)	Ollama #15350 FA hang; craftrigs.com reports 44 tok/s
Speculative AL — code / creative	5.22 / 5.12	No published 3090 + Gemma 4 drafter results	Uncontested; DFlash drafter; MoE code / Dense creative respectively
Dense 64K decode (TQ3, no drafter)	2.54 tok/s	No published comparable	Post-rebase; 64K cliff resolved (was 1.78 tok/s anomaly)

Coverage caveat. Gemma 4 was released 2026-04-02. As of 2026-05-10, llama.cpp does not yet support Google's official MTP drafter architecture (Gemma4AssistantForCausalLM, discussion #22735). Lucebox runs both DFlash and MTP drafters today via a forked stack. Community coverage of Gemma 4 on consumer hardware is thin — we expect public numbers to improve as mainstream tools catch up. See also gemma4.wiki for an aggregated public scorecard.

Post-Rebase Validation

TQ3_0 Frontier — 1M context on 24 GB

Post-rebase (daef232a6 + 4b0c158), TQ3/TQ3 MoE generates coherently from 4K to 1M tokens with no cliff. Numbers below are measured on RTX 3090 24 GB, MoE 26B-A4B, TQ3_0/TQ3_0 KV, no drafter, n_predict=64.

MoE 26B-A4B — TQ3/TQ3, no drafter, no pflash (pre-pflash baseline; current ship adds pflash → 19.8–20.1 tok/s on code, 19.12 on prose, through 1M)

Context	Prefill tok/s	Decode tok/s	VRAM peak GB	KV cache saved %
4K	—	6.95	17.43	41.7
16K	1 803.5	6.59	17.77	72.9
32K	1 801.0	6.56	17.83	78.1
64K	1 424.7	5.77	18.04	80.7
96K	1 406.2	5.64	18.18	81.6
128K	1 407.3	5.63	18.31	82.0
256K	1 392.9	5.65	19.14	82.7
384K	1 452.6	5.89	19.61	82.9
512K	1 441.2	5.77	20.13	83.0
768K	1 441.5	5.82	21.23	83.1
1M	1 444.6	5.82	22.26	83.2

Dense 31B — TQ3/TQ3, no drafter

Context	Prefill tok/s	Decode tok/s	VRAM peak GB
4K	190.5	3.38	19.49
64K	416.7	2.54	21.07
128K	414.2	2.52	22.12
256K (saturated)	47.4	2.45	24.00

Headline With pflash on (current ship), MoE TQ3 holds 19.8–20.1 tok/s on code prompts (prose 19.12) from 64K through 1M, no cliff. The table above is the pre-pflash tq3-frontier/ baseline (5.6–6.6 tok/s) — TQ3 prefill dominates without chunked+pflash. 22.3 GB peak VRAM at 1M leaves 1.7 GB headroom on a 3090. TQ3/TQ3 pays a context-dependent decode-tok/s tax vs Q8/Q8 (Q8 saturates ≥512K) but unlocks contexts Q8 cannot reach.

Practical First

Recipes & Regime Map

All measurements are from the current binary (post commits 4bcb972 + submodule 6715acf13, 2026-05-11) on a single 24 GB RTX 3090, WSL2, greedy decode (--temp 0 --seed 0 --ignore-eos --n-predict 64) on the 50 K-token code prompt unless noted. Earlier inflated framing has been walked back — see the cited evidence and the full dossier on GitHub.

MoE 26B-A4B — Q8 + pflash + DFlash wins through 512K

Context	Recommended config	Decode tok/s	J/tok	VRAM (GB)
4K	Q8/Q8 + pflash + DFlash dm=16 (code)	132.1	11.82	18.90
64K	Q8/Q8 + pflash + DFlash dm=8	34.66	12.02	19.84
128K	Q8/Q8 + pflash + DFlash dm=4	66.97	6.26	20.45
256K	Q8/Q8 + pflash + DFlash dm=4	63.82	8.13	21.75
512K	Q8/Q8 + pflash + DFlash dm=4	62.40	6.82	24.00 (sat)
1M	TQ3/TQ3 + pflash + DFlash dm=4 (Q8 pages)	26 ± 5 (triple-trial σ)	awaiting	24.00 (sat)

Q8 KV fits comfortably to 512K on MoE 26B-A4B (KV is small because only 4B params are active). DFlash dm=4 amortizes verify across multiple tokens at 60+ tok/s through that band. At 1M, Q8 KV doesn't fit; TurboQuant becomes mandatory. The triple-trial σ at 1M reflects run-to-run variance under WSL2/CUDA allocator state (see discovery #15).

Dense 31B — VRAM-bound above 32K; MTP γ=2 takes over

Context	Recommended config	Decode tok/s	J/tok	VRAM (GB)
4K	Q8/Q8 + pflash + DFlash dm=16 (creative)	81.7	8.82	22.08
32K	Q8/Q8 + pflash (no drafter — DFlash hurts here)	18.33	35.04	21.36
64K	TQ3/TQ3 + pflash + MTP γ=2	10.07	33.79	~22
128K	TQ3/TQ3 + pflash + MTP γ=2	10.18	34.50	~23
≥ 256K	Infeasible on 24 GB — Dense 31B model + KV + drafter exceeds VRAM. Untested.

Dense 31B is much larger than MoE 26B-A4B's active footprint. By 32K, Q8 KV already pushes VRAM to 21 GB; adding DFlash's draft KV cache pushes it over 24 GB and decode regresses. MTP γ=2 cross-attends to the target KV (no extra cache), so it fits where DFlash doesn't. Accept rate 0.78 at 128K (code) — the MTP head transfers cleanly to long context.

Per-component contribution at MoE 64K code (OVAT)

Change from naive baseline	Decode tok/s	Δ vs Q8 naive	Verdict
Naive: Q8/Q8 + no pflash + no drafter	23.25	—	baseline
Q8 → TQ3 KV (alone)	20.09	−14 %	TQ3 is a decode cost when ctx fits in Q8
+ pflash (on Q8)	23.65	+2 %	pflash is decode-neutral at 64K (huge prefill win)
+ DFlash dm=8 on TQ3+pflash	34.66	+49 %	DFlash is the headline speedup AND best J/tok (12.02)
+ MTP γ=2 on TQ3+pflash	23.10	−0.6 %	γ=2 ties the naive baseline — TQ3 penalty cancels MTP gain

One-variable-at-a-time ablation at MoE 26B-A4B × ctx=64K × code prompt. Earlier framing "γ=2 at 64K = +61 % over no-MTP" was vs the TQ3-handicapped baseline. Vs the proper naive Q8 baseline, MTP γ=2 essentially ties. Logs at mtp-gamma/ovat-moe-64k-code/.

Bottom line Q8 KV + pflash + DFlash is the speed king on MoE up to 512K (and on Dense at 4K). TurboQuant earns its place only when Q8 doesn't fit — MoE 1M, and as a fallback for Dense at 64K+ where adding DFlash would breach 24 GB. MTP γ>1 is implemented (Phases 1–3.5 today) and is the right drafter specifically for Dense above 32K where DFlash is infeasible — elsewhere DFlash dominates.

Practical Guide

What this means if you run a 3090

A $700–1 000 used RTX 3090 now runs Gemma 4 on real long-context workloads — coding, document analysis, agent loops — without offloading weights or KV data. The exact config to use lives in §Recipes; this section is about what those numbers actually buy you. Bottom-line decode rates: MoE 26B-A4B 60–67 tok/s at 128K–512K, ~26 tok/s at 1M. Dense 31B ~10 tok/s through 128K. Everything in 24 GB VRAM.

Real workloads now possible on 24 GB

Whole-codebase Q&A — load a 200 K-token monorepo into context once, ask interactively at ~64 tok/s (Q8/Q8 + pflash + DFlash dm=4). Prefill runs at ~5 000 tok/s so the 256 K initial load takes roughly 50 seconds, then questions are answered without reloading.
Long-document summarization — entire 500-page PDFs (~400 K tokens) in a single shot at ~62 tok/s decode (Q8/Q8 + DFlash dm=4 at 512K, VRAM saturated near 24 GB). No chunking, no retrieval hacks.
Agent loops with deep history — 1 M-token rolling context for tool-use agents that never need to re-summarise after every step. 19.8 tok/s decode (no drafter, TQ3/TQ3 + pflash, 22.3 GB / 1.7 GB headroom) or 23.7 tok/s with MTP γ=2 at 23.9 GB (sub-GB headroom, edge of the wall).
Local privacy-first AI — none of this requires offload to RAM or disk; all weights and KV state remain in VRAM. Your data never leaves the machine.

Bottom line If you have a 3090 and you're running an LLM locally, you no longer have to choose between context and speed below 64 K — and you can finally have ≥256 K context above it. The TQ3/TQ3 KV path closes the gap to datacenter inference for long-context use cases on a single $1 000 card.

At a glance

Four charts

Drawn directly from scientific/results.csv (24 cells, RTX 3090 24 GB, Q8/Q8 KV, 4 K ctx, n_predict=256, temp=0 seed=0, pflash on). Power integrated trapezoidally from nvidia-smi at ~5 Hz; decode energy apportioned by phase fraction.

Decode J/token — efficiency

Sorted ascending. Lower is better. Bars solid = code, hatched = creative.

MoE code MoE creative Dense code Dense creative

MoE+creative configurations dominate the efficient end. dense+code is the most expensive across all dm budgets.

Decode tok/s — speed

Sorted descending. Higher is better. dm=32 ≈ dm=16 — the budget plateau.

Top three rows are MoE+code at dm=8/16/32 — speeds within 1 % of each other. dm=16 is the practical winner; dm=32 wastes draft compute.

Acceptance length × draft_max — OOD gap

Lines plateau by dm=16. MoE drafter collapses on creative; Dense drafter does not.

MoE code (5.22) MoE creative (2.49) Dense code (4.20) Dense creative (5.12)

Dense lines are tightly clustered. MoE creative diverges sharply downward — the drafter doesn't generalise out-of-distribution. Numbers in legend are dm=32 plateau values.

Pareto — speed × efficiency

Each point a config. Up + right = win. Pareto frontier in coral.

MoE high-spec (dm≥4) MoE low-spec (dm≤2) Dense high-spec Dense low-spec code creative

The Pareto frontier is entirely MoE. Dense points sit strictly above (more J/tok) for any given speed. The frontier inflects sharply at moe_creative_dm2 — the most efficient real-speculation operating point.

Practical Recipes

Run it yourself — top configurations for RTX 3090 (24 GB)

Five copy-paste-ready server launches. Each starts the OpenAI-compatible endpoint on port 8080. Replace /path/to/… with your actual GGUF paths.

Pre-flight requirements

RTX 3090 (24 GB) or equivalent VRAM; CUDA 12.x toolkit
Build the test binary: cmake --build dflash/build -j 4 --target test_gemma4_dflash
For 24 GiB cards, build with cmake -DGGML_CUDA_NO_VMM=ON .. for cleaner long-context prefill (auto-set by server.py when it detects ≤24 GiB)
Pull GGUFs from Hugging Face: gemma-4-26B-A4B-it, gemma-4-31B-it, plus matching draft GGUFs. To quantize a draft from safetensors: python3 dflash/scripts/quantize_draft_q8.py --arch gemma4 ...

Config 1 — MoE Speed King MoE 26B-A4B

Use case: code completion, agent prompts under 4K tokens · Expected throughput: ~132 tok/s decode, AL ≈ 5.22

python3 dflash/scripts/server.py \
  --target /path/to/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
  --draft  /path/to/gemma4-26b-a4b-dflash/draft-q8_0.gguf \
  --max-ctx 4096 \
  --ctk q8_0 --ctv q8_0 \
  --budget 64 \
  --pflash \
  --port 8080

--budget 64 maps to effective dm=16 via server's budget→dm mapping.

Config 2 — MoE Long-Context Ship MoE 26B-A4B

Use case: 256K-context coding agents, document analysis · Expected throughput: ~67.95 tok/s decode, 21.73 GB VRAM

python3 dflash/scripts/server.py \
  --target /path/to/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
  --draft  /path/to/gemma4-26b-a4b-dflash/draft-q8_0.gguf \
  --max-ctx 262144 \
  --ctk q8_0 --ctv q8_0 \
  --budget 16 \
  --pflash \
  --lazy-draft \
  --port 8080

Config 3 — MoE 1M Frontier MoE 26B-A4B

Use case: massive document analysis, multi-file repository inspection · Expected throughput: 19.8 tok/s decode (23.7 with MTP γ=2; 1.7 GB headroom no-drafter, 120 MB headroom with MTP)

python3 dflash/scripts/server.py \
  --target /path/to/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
  --max-ctx 1048576 \
  --ctk tq3_0 --ctv tq3_0 \
  --pflash \
  --port 8080

No drafter — speculation hurts at 1M with VRAM-saturated TQ3 KV. Pure target-only decode.

Config 4 — Dense 31B Quality Dense 31B

Use case: creative writing, OOD prompts where MoE drafter under-performs · Expected throughput: 81.7 tok/s @ 4K, AL=5.12 (creative dm=16, from the 24-cell scientific sweep). HumanEval-class prompts have shown peaks up to ~98 tok/s in single-prompt checks; we list the rigorously-measured number as the headline.

python3 dflash/scripts/server.py \
  --target /path/to/gemma-4-31B-it-Q4_K_M.gguf \
  --draft  /path/to/gemma4-31b-dflash/draft-q8_0.gguf \
  --max-ctx 4096 \
  --ctk q8_0 --ctv q8_0 \
  --budget 64 \
  --pflash \
  --port 8080

Config 5 — Dense 31B Long-Context Min-VRAM Dense 31B

Use case: 64K dense context on 24 GB card with TQ3 compression · Expected throughput: 2.54 tok/s decode (post-rebase fix; was 1.78 tok/s anomaly)

python3 dflash/scripts/server.py \
  --target /path/to/gemma-4-31B-it-Q4_K_M.gguf \
  --max-ctx 65536 \
  --ctk tq3_0 --ctv tq3_0 \
  --pflash \
  --port 8080

Common notes — OpenAI-compatible client

Drop-in OpenAI client: OPENAI_API_BASE=http://localhost:8080/v1 and OPENAI_API_KEY=sk-any

Test endpoint:

curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"luce-dflash","messages":[{"role":"user","content":"hi"}],"stream":true}'

Benchmarks

What Worked

§1A — Dense 31B Dense 31B

Gemma-4-31B-it (dense). Drafter: Gemma-4-DFlash 5-layer block-diffusion.

§1.1 Matrix-v2 — 4K, target-only baseline Dense 31B

Matrix-v2 (4K context, target-only baseline)

Run	Draft	K / V	TTFT (ms)	Decode tok/s	Accept
M1_none_tq3	none	TQ3_0 / TQ3_0	40.47	26.69	—
M2_none_q8	none	Q8_0 / Q8_0	32.22	33.86	—
M3_mtp_tq3	mtp	TQ3_0 / TQ3_0	48.81	25.28	39%
M4_mtp_q8	mtp	Q8_0 / Q8_0	—	crashed @ step 208	68% pre-crash

M4 crash: fattn.cu:652. Latent kernel-side issue with MTP + Q8 V-cache.

Pre-rebase numbers (before daef232a6 + 4b0c158). See §TQ3 Frontier for current TQ3 baselines.

§1.2 Matrix-v3 — 4K, K=Q8_0 fixed, V varied Dense 31B

Matrix-v3 (4K context, K=Q8_0 fixed, V-cache varied)

Run	V-cache	TTFT (ms)	Decode tok/s
N1_none_q8_tq3	TQ3_0	38.67	30.21
N2_none_q8_q8	Q8_0	31.21	34.79 (+15%)
N3_mtp_q8_tq3	TQ3_0	crashed @ step 216 (fattn.cu:652) — 5% accept

Pre-rebase numbers (before daef232a6 + 4b0c158). See §TQ3 Frontier for current TQ3 baselines.

§1.3 Matrix-64K-v2 — 50K-token prompt, post-fix rerun Dense 31B

Pre-rebase numbers (before daef232a6 + 4b0c158). See §TQ3 Frontier for current TQ3 baselines.

Matrix-64K-v2 (50K-token prompt, post-fix rerun)

Run	Draft	K / V	Prefill (s)	TTFT (ms)	Decode tok/s	Avg accept
V1_none	none	TQ3_0/TQ3_0	585.2	145.88	6.90	—
V2_mtp (γ=1, pre-fix)	mtp	TQ3_0/TQ3_0	585.8	164.94	6.33 (regression)	0.02 (5/256) superseded by γ>1 — see §Recipes
V3_dflash_dm8	dflash	TQ3_0/TQ3_0	585.8	257.06	9.22 (+34%)	2.29

§1.6 Dense-31B + DFlash + Q8/Q8 + dm=8 Ladder — pre-rebase 2026-05-10 (1.78 tok/s 64K anomaly resolved post-daef232a6; see §Recipes for current Dense recipes) Dense 31B

Dense-31B + DFlash + Q8/Q8 + dm=8 Ladder

ctx	Prefill tok/s	Decode tok/s	AL / 8	VRAM
64K	319	1.78 ← anomaly	1.94 (24%)	24 / 24 GB
128K	256	24.89	7.11 (89%)	24 / 24 GB
256K	236	23.87	7.11 (89%)	24 / 24 GB

AL caveat — the 128K and 256K AL=7.11 (89%) is partly inflated by a degenerate token-715 repetition loop in the output (token 715 dominates from step 5 onward). The drafter trivially predicts the loop. Real-quality acceptance under non-degenerate generation is unknown until a fresh prompt is run.

pFlash is active. Logs show [chunked+pflash, chunk_size=1024] in all three dense runs. The 15× prefill gap vs MoE (4912 vs 319 tok/s at 64K) is the dense-vs-MoE compute ratio (Dense 31B over 60 layers vs MoE ~4B active params over 30 layers ≈ 15× compute), not a pflash failure. pflash skips attention; FFN compute is unavoidable.

Surprise: Dense-31B is fine at 128K and 256K (~24 tok/s, healthy AL 7.11) but collapses specifically at 64K with AL crashing to 1.94. All three cells report 24.00/24.00 GB VRAM — likely a VRAM-allocator edge case at exactly 64K. Headline: dense 31B is viable at long context (≥128K) on a 24 GB GPU. MoE remains ~50% faster but dense is now an option.

§1B — MoE 26B-A4B MoE 26B-A4B

Gemma-4-26B-A4B (MoE, ~4B active). Drafter: Gemma-4-DFlash 5-layer block-diffusion. (No MoE MTP drafter exists yet — see open question P1.)

§1.4 Scaling Sweep — MoE DFlash Q8/Q8, 16K → 262K (pre-pflash baseline 2026-05-10; current ship configs in §Recipes give 60+ tok/s across this range) MoE 26B-A4B

Scaling Sweep — MoE + DFlash (Q8_0/Q8_0)

Context	Prefill tok/s	TTFT (ms)	Decode tok/s	Avg accept
16K	3 711	41.7	72.32	1.66
32K	3 833	38.5	70.54	1.66
64K (dense)	1 402	158	7.96	—
65K (dflash)	4 878	65.2	28.92	1.45
131K (dflash)	4 888	62.3	29.90	1.45
262K (dflash)	4 894	62.1	29.40	1.45

DFlash holds throughput near-constant from 65K to 262K — prefill tok/s rises slightly as batch efficiency improves. Dense path at 64K collapses without DFlash.

§1.5 dm-sweep — draft_max tuning, 50K code prompt, MoE MoE 26B-A4B

dm-sweep — draft_max tuning (50K code prompt, MoE)

ctx	dm	TTFT (ms)	Decode tok/s	Spec accept	Avg
64K	1	69.5	23.01	256/256	1.00 (no wins)
64K	2	62.8	33.81	169/256	1.51
64K	4	64.0	36.57	143/256	1.79 ★
64K	8	81.7	29.45	138/256	1.86
256K	4	61.1	36.63	143/256	1.79 ✓
256K	8	78.2	29.36	138/256	1.86 (degrade)

dm=4 is the global optimum for long code prompts. dm=16 wins only on short prompts (HumanEval): TTFT 88.8 ms, decode 97.81 tok/s, accept 6.56.

§1C — Scientific 24-cell sweep Energy-instrumented

2 archs × 2 distributions × 6 dm budgets. RTX 3090 24 GB, Q8/Q8 KV, 4K ctx, n_predict=256, temp=0 seed=0 --ignore-eos, pflash on. Power sampled ~5 Hz via nvidia-smi; total energy is trapezoidal ∫ P dt; phase-split by reported phase fraction. See scientific/results.csv.

Peak decode MoE+code+dm=16 → 132.1 tok/s, AL 5.22, 11.82 J/tok

Peak efficiency (real-spec) MoE+creative+dm=2 → 6.64 J/tok at 82.1 tok/s, AL 1.63

dm=32 wastes draft compute Identical to dm=16 in all four distribution × arch cells (e.g. moe·code 132.1 vs 132.5; dense·creative 81.7 vs 82.5).

OOD generalisation gap Dense AL stable across distributions (4.20 code → 5.12 creative). MoE drafter collapses 5.22 → 2.49 — code-distribution-trained.

config	arch	dist	dm	decode tok/s	AL	decode J/tok	avg W	VRAM GB	wall s
`moe_code_dm16`	moe	code	16	132.11	5.22	11.82	129.6	18.90	27.26
`moe_code_dm32`	moe	code	32	132.45	5.22	13.93	129.3	18.90	27.70
`moe_code_dm8`	moe	code	8	131.76	3.88	14.39	137.8	18.89	27.70
`moe_code_dm4`	moe	code	4	118.30	2.61	12.19	131.9	18.90	27.17
`moe_code_dm2`	moe	code	2	95.52	1.90	11.04	132.3	18.92	23.54
`moe_creative_dm4`	moe	creative	4	94.62	2.12	8.33	147.6	18.88	16.02
`dense_creative_dm32`	dense	creative	32	82.54	5.12	10.75	169.6	22.08	18.75
`moe_creative_dm2`	moe	creative	2	82.12	1.63	6.64 ★	205.4	18.90	10.17
`dense_creative_dm16`	dense	creative	16	81.68	5.12	8.82	176.9	22.08	14.47
`moe_creative_dm8`	moe	creative	8	68.87	2.03	7.85	196.0	18.88	11.41
`dense_code_dm16`	dense	code	16	68.00	4.20	17.33	147.6	22.11	33.79
`dense_code_dm32`	dense	code	32	67.60	4.20	17.87	146.5	22.11	34.69
`moe_creative_dm16`	moe	creative	16	61.91	2.49	7.15	175.7	18.89	11.71
`moe_creative_dm32`	moe	creative	32	61.35	2.49	9.55	154.4	18.89	17.34
`dense_creative_dm4`	dense	creative	4	58.45	3.05	12.43	182.9	22.08	18.55
`moe_creative_dm1`	moe	creative	1	58.04	1.00	5.70	209.5	18.87	7.70
`moe_code_dm1`	moe	code	1	56.67	1.00	14.46	139.7	18.89	29.18
`dense_code_dm4`	dense	code	4	55.73	2.91	22.63	159.6	22.10	33.64
`dense_creative_dm8`	dense	creative	8	52.34	4.41	12.25	230.2	22.07	16.63
`dense_code_dm8`	dense	code	8	47.93	4.06	20.28	163.9	22.10	34.52
`dense_creative_dm2`	dense	creative	2	47.49	1.87	18.15	171.6	22.10	29.30
`dense_code_dm2`	dense	code	2	46.49	1.82	20.08	163.6	22.09	34.19
`dense_creative_dm1`	dense	creative	1	30.75	1.00	13.84	317.5	22.07	12.04
`dense_code_dm1`	dense	code	1	30.42	1.00	27.72	180.2	22.10	36.32

Sorted by decode tok/s descending. ★ = peak speed (dm=16 — recommended over dm=32 because dm=32 wastes draft compute) and peak speculation-on efficiency. Baseline (dm=1) is more efficient still but slower.

Failures & Regressions

What Did Not Work

Hard Failures

Repeated fattn.cu:652 crash with MTP + Q8 V-cache: matrix-v2/M4_mtp_q8.log:60, matrix-v3/N3_mtp_q8_tq3.log:61, matrix-64k/MTP_humaneval.log:48 — 96% accept pre-crash, kernel-side latent issue.
T3_dflash.log rc=143 in matrix-64k — fixed by 5b6ba1b ("SWA mask coordinate frame").
dm-sweep 4K dm{4,8} — prompt (50003 tokens) >= ctx_size (4096) test misconfiguration.

Regressions (historical — most now resolved)

~~MTP @ 64K (V2_mtp): 6.33 vs 6.90 baseline (−8.2%), accept 0.02.~~ Resolved: post-rebase γ=1 at 64K shows 0.69 accept; γ=2 chain shows 0.73 accept and +61% over no-MTP. The 0.02 figure was pre-Bug-2 fix.
~~MTP @ 4K (M3_mtp_tq3): 25.28 vs 26.69 even at 39% accept.~~ Resolved: γ=1 is structurally a "correctness gate" with zero amortisation. γ>1 + approach B (multi-row h_prev) gives γ=4 at 4K = 25.6 tok/s vs no-MTP 18.4 (+39%).
Sanity smoke: 0/8 accepted, garbled output. (early-stage stochastic-sampling bug; pre-greedy gate)
Dense @ 64K: 7.96 tok/s; Dense @ 256K: 1.78 tok/s. (resolved per §1.6 64K cliff fix)

Quality Issues

V3_dflash_dm8 leaked <mask>, <unused94> tokens early (matrix-64k-v2/SUMMARY.md:55).
T2_mtp emitted token id 236772 repeatedly (vocab=262144). (pre-Bug-2 / pre-γ>1 chain; not reproducible on current binary)
~~MTP semantic mode collapse: tokens oscillated between id 109 / 49.~~ Resolved: γ=1 single-step accept of 0.66 produces coherent output; γ=2 chain at 64K accepts both drafts most of the time.

Abandoned Approaches

Uniform TQ3_0 across all layers → narrow asymmetric (Q8 on captured full-attn, TQ3 on SWA), commit ce4da35.
Monotonic SWA full alloc → non-monotonic ring + correct mask (d68e7c4 / 19def9c / 5b6ba1b).
TQ3_0 preserved through ring-wrap → still F32 concat on wrap (workaround).
Standalone ggml_turbo_wht calls → fused into FA (c15f93a).

Key Intellectual Property

What We Discovered — 20 Findings

Each row below represents a non-obvious finding with a concrete fix location. These findings do not exist in upstream llama.cpp.

#	Subsystem	Discovery	Where it lives
1	SWA ring geometry	Mask must map abs query → ring slot `(qpos − abs_win_start) % ring_size`; without this, chunks 2+ silently read stale KV.	`gemma4_target_graph.cpp:205-222` commit 5b6ba1b
2	TQ3_0 KV alignment	TQ3_0 + CUDA FA needs `ne[1] % 256 == 0`. `FATTN_KQ_STRIDE=256` is hardcoded in CUDA — no public ggml API exposes it.	`gemma4_target_graph.cpp:99-101` `gemma4_target_graph.cpp:350-355` commits c56879c, 41e4848
3	Non-monotonic ring restoration	Disabling ring opt cost 50%+ VRAM on SWA layers. Re-enabled with correct mask geometry.	commits 19def9c, d68e7c4
4	Narrow asymmetric KV	pFlash block-sparse path does not support TQ3_0 → forced Q8 on captured full-attn layers; TQ3 stays on SWA.	`internal.h:593-597` commit ce4da35
5	TQ3_0 type-tag preservation through FA	Casting K/V to F32/F16 before FA breaks the kernel's TQ3 → FWHT routing. Pass tensors native; CHUNKED kernel handles FWHT internally.	`gemma4_mtp_graph.cpp:401-414` commit c56879c
6	GQA block-broadcast for MTP	MTP's `n_head_norm` ≠ target `head_dim_kv`. Explicit reshape required before FA.	`gemma4_mtp_graph.cpp:120-122` `gemma4_mtp_graph.cpp:225-336` commit 30b2b50
7	MTP h_prev capture	Last-committed token's post-block hidden state from the last full-attn layer. KQ scale = 1.0 (not 1/√d). Assistant's own `rope_freqs.weight`.	`gemma4_mtp_graph.cpp:266-285` commit 138de4d
8	FA mask mandatory for head_dim≥512	CUDA MMA dispatcher requires `gqa_opt_applies` ⇒ `K->ne[1]%256==0` AND `mask != nullptr`. Provide all-zero mask when logically all positions admitted.	`gemma4_mtp_graph.cpp:468-481` commit f1f811e
9	n_tokens=1 decode + SWA	Single-token decode also needs allocated+filled SWA mask sized to `ring_size`.	commit 7b62c07
10	Shared FA mask across layers	Single mask buffer reused across all need-mask layers (256-align or head_dim≥512).	`internal.h:847-852`
11	DFlash decode correctness	BOS/EOS handling + per-layer SWA mask on draft (4 SWA + 1 full). Prefix-direct KV semantics.	commits 1386690, 9588c97
12	TQ3 K dequant intercept	MMA path silently strips FWHT rotation when reading TQ3_0 K-cache. Forced chunked-path dispatch for TQ3 K to avoid the MMA intercept.	submodule `daef232a6` (force-chunked dispatcher; drops broken MMA intercept) + outer-repo `4b0c158` (graph-level FWHT rotation via `ggml_cont` wrap)
13	Drafter distribution coverage	Dense DFlash drafter generalises out-of-distribution; MoE drafter is code-distribution-bound. AL plateau dense·code 4.20 vs dense·creative 5.12 (uniform); MoE·code 5.22 vs MoE·creative 2.49 (collapse). Empirically discovered via the 24-cell scientific sweep.	empirical: `scientific/results.csv` `gemma4-journey.md §11`
14	TQ3 long-context unlocked	Post-`daef232a6`, MoE TQ3 sustains 5.6–6.6 tok/s flat from 4K to 1M with sub-linear VRAM growth (~0.34 GiB / 64K ctx). 1M context on 24 GB confirmed. Update: with pflash on (current ship config), this band shifts up to 19.8–20.1 tok/s* on code prompts (prose 19.12) across 64K–1M — see Headline.*	submodule `daef232a6` logs: `tq3-frontier/`
15	DFlash 1M run-to-run variance	Same CLI / same seed / same prompt yields tok/s ∈ [20, 30] on MoE 1M cold; can degrade to 4.86 after ~19 prior cells in same shell. WSL2/CUDA allocator state matters. Sweep-harness numbers at saturated VRAM ctx need a cold-start protocol.	`mtp-gamma/triple-falsify/trial_{1,2,3}.log`
16	γ>1 MTP chain (first public implementation)	Approach B (multi-row `mtp_h_prev_batch`) eliminates per-chain re-capture overhead. Mirrors Google HF reference (post-final-RMSNorm capture, Q-only cross-attn, weight-tied lm_head). γ=2 produces accept 0.55–0.78 at long ctx on Dense and 0.57 on MoE.	commits `d8ebd12` (A), `4bcb972` (B) `gemma4_target_graph.cpp:1106-1123`
17	MoE MTP assistant exists	`AtomicChat/gemma-4-26B-A4B-it-assistant-GGUF` on HuggingFace. Loads cleanly against MoE 30-layer target — loader auto-remaps donor layer indices written for the Dense 60-layer target. Accept rate 0.81 at 4K MoE.	`models/gemma4-mtp-26b-a4b/`
18	MoE Q8 ceiling at 512K, not earlier	Q8 + pflash decode is flat at 24–25 tok/s from 64K through 512K with 0.86 GB headroom at 512K. Pages only at 1M. Earlier framing "TQ3 wins at long context" was wrong below 1M.	`mtp-gamma/q8-ceiling/` `mtp-gamma/closing/moe__q8_pf_.log`
19	Dense + DFlash above 32K is anti-economical	DFlash drafter KV cache pushes Dense 31B over 24 GB at 64K. Decode drops to 5–7 tok/s, prefill collapses to 130 tok/s. MTP γ=2 — which cross-attends to target KV with no own cache — is the right drafter for Dense ≥ 32K.	`mtp-gamma/closing/dense_.log` `mtp-gamma/paired-matrix/dense_mtp_.log`
20	Q8 + pflash + DFlash dm=4 is the MoE long-ctx winner	At MoE 128K–512K: 60–67 tok/s decode at 6–8 J/tok. 2× faster than the TQ3 + pflash + DFlash stack at the same contexts (which gave 32–33 tok/s). The site's previous framing oversold TQ3 below 1M.	`mtp-gamma/closing/moe_*_q8_pf_dfl4.log`

Why none of this is upstream — TQ3_0 is bleeding-edge (Google TurboQuant ~2024); MTP assistants and DFlash drafters are proprietary architectures; pFlash block-sparse paths are research-grade. Upstream llama.cpp ships none of these.

Scientific Benchmark

Super-Bench — scientific/results.csv

Wall-clock-sourced, energy-instrumented benchmark sweep. Generated when the super-bench harness completes; live-reloaded from scientific/results.csv.

Schema (expected CSV columns)

config — run identifier (e.g. moe_dflash_q8q8_dm4_64k)
arch — dense | moe
draft — none | dflash | mtp
kv_k / kv_v — tq3_0 | q8_0
ctx — context length tokens
dm — draft_max
wall_s — wall time T_end − T0 (seconds, date +%s.%N)
prefill_ms, prefill_tps
decode_ms, decode_tps, al, accept_rate, first_tok_ms
vram_peak_gb
avg_power_w — averaged over full active window
total_energy_j — trapezoidal ∫ P dt
prefill_energy_j, decode_energy_j — phase-split by time-fraction
decode_j_per_tok — efficiency metric (decode_energy_j / decoded_tokens)

Chronology

Timeline — May 7–11, 2026

May 7

Plan: Gemma4 31B Dense + 26B-A4B MoE; design TQ3_0 KV cache architecture.

May 7–8

Target + draft baselines, TDD smoke tests; chunked prefill 12–16× speedup (3335ee2).

May 8

Draft KV cache; pFlash sparse SWA layer-by-layer (33b6e9d); first long-ctx output corruption observed.

May 9 AM

Diagnosed SWA mask coordinate frame bug (5b6ba1b); gate pFlash on supported KV types.

May 9 PM

Built MTP from scratch: loader (1115064), graph (d4659ca), wiring (05e36e4); accept rate stayed near zero on 64K tests.

May 9 eve

Five MTP fixes in 24 h: 138de4d, 30b2b50, c56879c, 7b62c07, f1f811e. MTP runs to completion but accept ≤2% at 64K — fixes restored correctness, not yield.

May 9–10

Bench matrices: matrix-v2 / matrix-v3 / matrix-64k / matrix-64k-v2; scaling 16K → 262K; dm-sweep. dm=4 confirmed as global optimum for long code prompts.

May 10

Debugging-journey blog (b441587, gemma4-journey.md). Dense-31B ladder: viable at 128K/256K (~24 tok/s) but decode collapses at 64K (1.78 tok/s, AL=1.94).

May 10 eve

γ>1 MTP plan written, momus-reviewed. Phases 1+2+3 (approach A — re-capture target forward) landed as commit d8ebd12. First γ=4 result at 4K: 20.75 tok/s (+10% over γ=1, but missed 1.5× target due to re-capture overhead).

May 11 AM

Phase 3.5 — approach B (4bcb972): multi-row mtp_h_prev_batch capture eliminates re-capture forward. γ=4 at 4K: 25.6 tok/s (+39% over no-MTP). γ=2 at 64K: 10.1 tok/s (+61% over no-MTP). 16K dead zone resolved. The pre-rebase "MTP 0.02 accept at 64K" claim is officially superseded — current 64K γ=2 accept is 0.73.

Next Steps

Open Questions

Resolved this session

~~Dense-31B 64K decode collapse — 1.78 tok/s, AL=1.94.~~ Resolved post-rebase (commit daef232a6): Dense-31B 64K decode now 2.54 tok/s with TQ3/TQ3 (no drafter, no pflash); 6.26 tok/s with TQ3/TQ3 + pflash baseline; 10.07 tok/s with MTP γ=2 (+61% over the +pflash baseline). The "1.78 anomaly" was the pre-rebase TQ3 dispatcher bug.
~~MTP accept ≤ 2% at 64K.~~ Resolved: post-rebase γ=1 accept = 0.69; γ=2 chain accept = 0.73. The earlier figure was pre-Bug-2 noise on a broken binary.
~~MTP semantic mode collapse (tokens cycle between id 109 / 49).~~ Resolved: same binary fix; current γ=2 at 64K produces coherent output with mean_accept_drafts = 1.46 / 2.
~~fattn.cu:652 MTP+Q8 latent crash.~~ Resolved by Bug-2 (daef232a6) + Bug-3 (f1f811e); MTP + Q8 V-cache now runs cleanly through the chain.

Still open

<mask> / <unused94> token leak in DFlash V3 output. TQ3 dequant artifact, or draft-KV mask off-by-one? Worth retesting on the current binary.
TQ3_0 ring-wrap path forces F32 concat — loses FWHT benefit on long-ctx MTP. Workaround only. Structural fix (rotate-once-at-cache-write) is the obvious next move.
FATTN_KQ_STRIDE=256 is private — if upstream changes it, our 256-pad assumption silently breaks.
Per-layer FA masks not implemented — asserted-equal across need-mask layers; long-ctx SWA-cap mismatch would trip it.
P1: No MoE MTP drafter exists — the MTP assistant was built and validated against the Dense-31B target only. MoE speculative decode relies exclusively on DFlash.
P1 (γ>1): γ=8 underperforms γ=4 at every context — drafter's autoregressive feedback drifts too far from target's hidden distribution past chain depth 4. Either a retrained MTP head with deeper chain awareness, or tree-mask drafting (llama.cpp PR #22838), would push the sweet spot higher.
P2 (γ>1): Stochastic γ>1 sampling is currently fatal — requires Leviathan/SpecInfer-style importance rescaling. Greedy-only for now.

Acknowledgements

Credits — built on the shoulders of others

Models

Google DeepMind — Gemma 4 — the underlying language models (Gemma-4-26B-A4B-it MoE and Gemma-4-31B-it Dense); released 2026-04-02.
Google — Gemma 4 MTP drafters — official 0.5B speculative drafters; released 2026-05-06. We run them alongside the DFlash drafter via our forked stack.

Inference & Frameworks

ggml-org / llama.cpp — the foundation. ggml CUDA backend, GGUF format, FlashAttention kernels. Everything stands on this.
z-lab / dflash — the DFlash speculative-decoding block-diffusion drafter architecture that our drafters extend.
ggml team — TQ3_0 / TurboQuant — 3.5-bit FWHT-rotated KV quantization. We extended TQ3_0 support for Gemma 4's SWA + full-attention hybrid layout.

Maintainers & Community

howard0su — Luce-Org/lucebox-hub maintainer; PR #131 reviewer; SWA truncation port (PR #129) we adapted to Gemma 4.
Luce-Org — upstream framework: daemon, server, prefix cache, scientific harness.
All llama.cpp contributors past and present — the 15 fixes we made would not have been possible without their infrastructure.

Reviewers & Tools

cubic-dev-ai bot — automated P1/P2 review feedback that caught real defensive bugs across the PR stack.

Inspiration

Speculative decoding — Leviathan et al., 2023. The theoretical backbone of all drafter experiments.
FlashAttention — Tri Dao et al. The IO-aware attention algorithm that makes pFlash viable at scale.

Gemma4 rocking on the 3090

#Headline

#How we compare to public RTX 3090 + Gemma 4 numbers

#TQ3_0 Frontier — 1M context on 24 GB

MoE 26B-A4B — TQ3/TQ3, no drafter, no pflash (pre-pflash baseline; current ship adds pflash → 19.8–20.1 tok/s on code, 19.12 on prose, through 1M)

Dense 31B — TQ3/TQ3, no drafter

#Recipes & Regime Map

MoE 26B-A4B — Q8 + pflash + DFlash wins through 512K

Dense 31B — VRAM-bound above 32K; MTP γ=2 takes over

Per-component contribution at MoE 64K code (OVAT)

#What this means if you run a 3090

Real workloads now possible on 24 GB

#Four charts

Decode J/token — efficiency

Decode tok/s — speed

Acceptance length × draft_max — OOD gap

Pareto — speed × efficiency

#Run it yourself — top configurations for RTX 3090 (24 GB)

#What Worked

§1A — Dense 31B Dense 31B

Matrix-v2 (4K context, target-only baseline)

Matrix-v3 (4K context, K=Q8_0 fixed, V-cache varied)

Matrix-64K-v2 (50K-token prompt, post-fix rerun)

Dense-31B + DFlash + Q8/Q8 + dm=8 Ladder

§1B — MoE 26B-A4B MoE 26B-A4B

Scaling Sweep — MoE + DFlash (Q8_0/Q8_0)

dm-sweep — draft_max tuning (50K code prompt, MoE)

§1C — Scientific 24-cell sweep Energy-instrumented

#What Did Not Work

Hard Failures

Regressions (historical — most now resolved)

Quality Issues

Abandoned Approaches

#What We Discovered — 20 Findings

#Super-Bench — scientific/results.csv

#Timeline — May 7–11, 2026

#Open Questions

Resolved this session

Still open

#Credits — built on the shoulders of others

Headline

How we compare to public RTX 3090 + Gemma 4 numbers

TQ3_0 Frontier — 1M context on 24 GB

Recipes & Regime Map

What this means if you run a 3090

Four charts

Run it yourself — top configurations for RTX 3090 (24 GB)

What Worked

What Did Not Work

What We Discovered — 20 Findings

Super-Bench — scientific/results.csv

Timeline — May 7–11, 2026

Open Questions

Credits — built on the shoulders of others