Table of Contents
feature/gemma4-support

Gemma4 rocking on the 3090

1 M tokens at 19.8 tok/s (23.7 with MTP). 256 K at 63.82 tok/s (peak 67.95).
Includes: receipts, code, copy-paste configs.

6.64 J/tok best efficiency 132 tok/s MoE peak (4K) 63.82 tok/s MoE 256K ship (peak 67.95) 1M ctx TQ3 on 24 GB 20 distinct fixes 80 commits, 4 days

Built in 4 days (May 8–11, 2026; 101 commits across 4 unique days) by humans and AI in collaboration — direction over autonomy, simple tools over magic: ggml, git, a 24 GB GPU.

Overview

Headline

Efficiency

Best Joules per token

6.64 J/tok— MoE + creative + dm=2

82.1 tok/s · AL 1.63 · 18.9 GB VRAM. Best speculation-on efficiency in the 24-cell sweep. Baseline (dm=1, no spec) is even more efficient at 5.70 J/tok but slower at 58 tok/s.

  • MoE+creative+dm=1 · 58 tok/s · 5.70 J/tok
  • MoE+creative+dm=2 · 82 tok/s · 6.64 J/tok ★
  • MoE+creative+dm=16 · 62 tok/s · 7.15 J/tok
MoE 26B-A4B

Speed king on code

132.1 tok/s— code, dm=16

AL 5.22 · 11.82 J/tok · 18.9 GB · ~130 W average. Long-context (post-rebase): 63.82 tok/s at 256K (Q8/Q8 + pflash + dm=4, peak observed 67.95); 1M context fits with TQ3/TQ3 + pflash at 19.8 tok/s, 22.3 GB (1.7 GB headroom). MTP γ=2 pushes 1M to 23.7 tok/s at 23.9 GB (sub-GB headroom).

  • Code AL plateau: 5.22
  • Creative AL plateau: 2.49 ← drafter is code-distribution-trained
  • 256K ship config: 63.82 tok/s · 21.75 GB (peak observed 67.95; 2× prior estimate)
  • 1M TQ3 + pflash: 19.8–20.1 tok/s on code prompts across three sweeps (prose 19.12); +MTP γ=2: 22.0–23.7 tok/s at 23.9 GB — cliff unlocked by daef232a6
Dense 31B

Generalist; better OOD drafter

82.5 tok/s— creative, dm=32

AL 5.12 · 10.75 J/tok · 22.1 GB · avg ~318 W on dense_creative_dm1 (scientific sweep). dm=16 essentially identical (81.7 tok/s).

  • Code AL plateau: 4.20
  • Creative AL plateau: 5.12 ← drafter generalises OOD
  • Long-ctx viable ≥128K (~2.5 tok/s, real text); 64K cliff resolved (2.54 tok/s post-rebase, was 1.78 tok/s anomaly)

On a 24 GB 3090, the right Gemma 4 config is a regime function of (model, context). MoE 26B-A4B: Q8/Q8 KV + pflash + DFlash dm=4 dominates from 4K through 512K (60–132 tok/s depending on ctx). TurboQuant (TQ3) becomes mandatory only at 1M+ where Q8 KV pages — there DFlash holds at 26 ± 5 tok/s. Dense 31B: same Q8 stack peaks at 81.7 tok/s (4K, creative). Above 32K the model is VRAM-bound; MTP γ=2 + TQ3 + pflash is the only viable drafter (~10 tok/s through 128K). 256K+ Dense is infeasible on 24 GB. See §Recipes for the full decision matrix and §Discoveries for the 20 ggml/llama.cpp fixes this required.

Context

How we compare to public RTX 3090 + Gemma 4 numbers

Gemma 4 was released 2026-04-02 — five weeks before these runs. Public benchmarks for this model on RTX 3090 hardware are sparse, and none publish long-context numbers. The table below compares what we measured against the best community-reported figures we found.

Metric Lucebox Best public 3090 Source / notes
MoE 26B-A4B peak decode @ 4K 132.1 tok/s 80–110 typical; 119 best Q4_K_M, no spec decode in community reports
MoE 26B-A4B decode @ 256K 63.82 tok/s
peak observed 67.95
No published 3090 number at 256K Uncontested; Q8/Q8 + DFlash + pflash + dm=4 (closing sweep replicated mean)
MoE 26B-A4B max context (24 GB) 1M @ 19.8 tok/s
no drafter; 23.7 tok/s with MTP γ=2
No published 3090 1M number Uncontested; TQ3/TQ3 + pflash; 22.3 GB / 23.9 GB
Dense 31B peak decode @ 4K 81.7 tok/s
creative dm=16, AL=5.12 (24-cell sweep)
up to ~98 tok/s observed on HumanEval (single-prompt; less rigorous)
40–50 (FA hangs in mainstream) Ollama #15350 FA hang; craftrigs.com reports 44 tok/s
Speculative AL — code / creative 5.22 / 5.12 No published 3090 + Gemma 4 drafter results Uncontested; DFlash drafter; MoE code / Dense creative respectively
Dense 64K decode (TQ3, no drafter) 2.54 tok/s No published comparable Post-rebase; 64K cliff resolved (was 1.78 tok/s anomaly)
Coverage caveat. Gemma 4 was released 2026-04-02. As of 2026-05-10, llama.cpp does not yet support Google's official MTP drafter architecture (Gemma4AssistantForCausalLM, discussion #22735). Lucebox runs both DFlash and MTP drafters today via a forked stack. Community coverage of Gemma 4 on consumer hardware is thin — we expect public numbers to improve as mainstream tools catch up. See also gemma4.wiki for an aggregated public scorecard.

Post-Rebase Validation

TQ3_0 Frontier — 1M context on 24 GB

Post-rebase (daef232a6 + 4b0c158), TQ3/TQ3 MoE generates coherently from 4K to 1M tokens with no cliff. Numbers below are measured on RTX 3090 24 GB, MoE 26B-A4B, TQ3_0/TQ3_0 KV, no drafter, n_predict=64.

MoE 26B-A4B — TQ3/TQ3, no drafter, no pflash (pre-pflash baseline; current ship adds pflash → 19.8–20.1 tok/s on code, 19.12 on prose, through 1M)

Context Prefill tok/s Decode tok/s VRAM peak GB KV cache saved %
4K 6.95 17.43 41.7
16K 1 803.5 6.59 17.77 72.9
32K 1 801.0 6.56 17.83 78.1
64K 1 424.7 5.77 18.04 80.7
96K 1 406.2 5.64 18.18 81.6
128K 1 407.3 5.63 18.31 82.0
256K 1 392.9 5.65 19.14 82.7
384K 1 452.6 5.89 19.61 82.9
512K 1 441.2 5.77 20.13 83.0
768K 1 441.5 5.82 21.23 83.1
1M 1 444.6 5.82 22.26 83.2

Dense 31B — TQ3/TQ3, no drafter

Context Prefill tok/s Decode tok/s VRAM peak GB
4K 190.5 3.38 19.49
64K 416.7 2.54 21.07
128K 414.2 2.52 22.12
256K (saturated) 47.4 2.45 24.00
Headline With pflash on (current ship), MoE TQ3 holds 19.8–20.1 tok/s on code prompts (prose 19.12) from 64K through 1M, no cliff. The table above is the pre-pflash tq3-frontier/ baseline (5.6–6.6 tok/s) — TQ3 prefill dominates without chunked+pflash. 22.3 GB peak VRAM at 1M leaves 1.7 GB headroom on a 3090. TQ3/TQ3 pays a context-dependent decode-tok/s tax vs Q8/Q8 (Q8 saturates ≥512K) but unlocks contexts Q8 cannot reach.

Practical First

Recipes & Regime Map

All measurements are from the current binary (post commits 4bcb972 + submodule 6715acf13, 2026-05-11) on a single 24 GB RTX 3090, WSL2, greedy decode (--temp 0 --seed 0 --ignore-eos --n-predict 64) on the 50 K-token code prompt unless noted. Earlier inflated framing has been walked back — see the cited evidence and the full dossier on GitHub.

MoE 26B-A4B — Q8 + pflash + DFlash wins through 512K

Context Recommended config Decode tok/s J/tok VRAM (GB)
4K Q8/Q8 + pflash + DFlash dm=16 (code) 132.1 11.82 18.90
64K Q8/Q8 + pflash + DFlash dm=8 34.66 12.02 19.84
128K Q8/Q8 + pflash + DFlash dm=4 66.97 6.26 20.45
256K Q8/Q8 + pflash + DFlash dm=4 63.82 8.13 21.75
512K Q8/Q8 + pflash + DFlash dm=4 62.40 6.82 24.00 (sat)
1M TQ3/TQ3 + pflash + DFlash dm=4 (Q8 pages) 26 ± 5 (triple-trial σ) awaiting 24.00 (sat)

Q8 KV fits comfortably to 512K on MoE 26B-A4B (KV is small because only 4B params are active). DFlash dm=4 amortizes verify across multiple tokens at 60+ tok/s through that band. At 1M, Q8 KV doesn't fit; TurboQuant becomes mandatory. The triple-trial σ at 1M reflects run-to-run variance under WSL2/CUDA allocator state (see discovery #15).

Dense 31B — VRAM-bound above 32K; MTP γ=2 takes over

Context Recommended config Decode tok/s J/tok VRAM (GB)
4K Q8/Q8 + pflash + DFlash dm=16 (creative) 81.7 8.82 22.08
32K Q8/Q8 + pflash (no drafter — DFlash hurts here) 18.33 35.04 21.36
64K TQ3/TQ3 + pflash + MTP γ=2 10.07 33.79 ~22
128K TQ3/TQ3 + pflash + MTP γ=2 10.18 34.50 ~23
≥ 256K Infeasible on 24 GB — Dense 31B model + KV + drafter exceeds VRAM. Untested.

Dense 31B is much larger than MoE 26B-A4B's active footprint. By 32K, Q8 KV already pushes VRAM to 21 GB; adding DFlash's draft KV cache pushes it over 24 GB and decode regresses. MTP γ=2 cross-attends to the target KV (no extra cache), so it fits where DFlash doesn't. Accept rate 0.78 at 128K (code) — the MTP head transfers cleanly to long context.

Per-component contribution at MoE 64K code (OVAT)

Change from naive baseline Decode tok/s Δ vs Q8 naive Verdict
Naive: Q8/Q8 + no pflash + no drafter 23.25 baseline
Q8 → TQ3 KV (alone) 20.09 −14 % TQ3 is a decode cost when ctx fits in Q8
+ pflash (on Q8) 23.65 +2 % pflash is decode-neutral at 64K (huge prefill win)
+ DFlash dm=8 on TQ3+pflash 34.66 +49 % DFlash is the headline speedup AND best J/tok (12.02)
+ MTP γ=2 on TQ3+pflash 23.10 −0.6 % γ=2 ties the naive baseline — TQ3 penalty cancels MTP gain

One-variable-at-a-time ablation at MoE 26B-A4B × ctx=64K × code prompt. Earlier framing "γ=2 at 64K = +61 % over no-MTP" was vs the TQ3-handicapped baseline. Vs the proper naive Q8 baseline, MTP γ=2 essentially ties. Logs at mtp-gamma/ovat-moe-64k-code/.

Bottom line Q8 KV + pflash + DFlash is the speed king on MoE up to 512K (and on Dense at 4K). TurboQuant earns its place only when Q8 doesn't fit — MoE 1M, and as a fallback for Dense at 64K+ where adding DFlash would breach 24 GB. MTP γ>1 is implemented (Phases 1–3.5 today) and is the right drafter specifically for Dense above 32K where DFlash is infeasible — elsewhere DFlash dominates.

Practical Guide

What this means if you run a 3090

A $700–1 000 used RTX 3090 now runs Gemma 4 on real long-context workloads — coding, document analysis, agent loops — without offloading weights or KV data. The exact config to use lives in §Recipes; this section is about what those numbers actually buy you. Bottom-line decode rates: MoE 26B-A4B 60–67 tok/s at 128K–512K, ~26 tok/s at 1M. Dense 31B ~10 tok/s through 128K. Everything in 24 GB VRAM.

Real workloads now possible on 24 GB

  • Whole-codebase Q&A — load a 200 K-token monorepo into context once, ask interactively at ~64 tok/s (Q8/Q8 + pflash + DFlash dm=4). Prefill runs at ~5 000 tok/s so the 256 K initial load takes roughly 50 seconds, then questions are answered without reloading.
  • Long-document summarization — entire 500-page PDFs (~400 K tokens) in a single shot at ~62 tok/s decode (Q8/Q8 + DFlash dm=4 at 512K, VRAM saturated near 24 GB). No chunking, no retrieval hacks.
  • Agent loops with deep history — 1 M-token rolling context for tool-use agents that never need to re-summarise after every step. 19.8 tok/s decode (no drafter, TQ3/TQ3 + pflash, 22.3 GB / 1.7 GB headroom) or 23.7 tok/s with MTP γ=2 at 23.9 GB (sub-GB headroom, edge of the wall).
  • Local privacy-first AI — none of this requires offload to RAM or disk; all weights and KV state remain in VRAM. Your data never leaves the machine.
Bottom line If you have a 3090 and you're running an LLM locally, you no longer have to choose between context and speed below 64 K — and you can finally have ≥256 K context above it. The TQ3/TQ3 KV path closes the gap to datacenter inference for long-context use cases on a single $1 000 card.

At a glance

Four charts

Drawn directly from scientific/results.csv (24 cells, RTX 3090 24 GB, Q8/Q8 KV, 4 K ctx, n_predict=256, temp=0 seed=0, pflash on). Power integrated trapezoidally from nvidia-smi at ~5 Hz; decode energy apportioned by phase fraction.

Decode J/token — efficiency

Sorted ascending. Lower is better. Bars solid = code, hatched = creative.

0 5 10 15 20 25 30 decode J/token moe·creative·dm1 5.70 ★ moe·creative·dm2 6.64 moe·creative·dm16 7.15 moe·creative·dm8 7.85 moe·creative·dm4 8.33 dense·creative·dm16 8.82 moe·creative·dm32 9.55 dense·creative·dm32 10.75 moe·code·dm2 11.04 moe·code·dm16 11.82 moe·code·dm4 12.19 dense·creative·dm8 12.25 dense·creative·dm4 12.43 dense·creative·dm1 13.84 moe·code·dm32 13.93 moe·code·dm8 14.39 moe·code·dm1 14.46 dense·code·dm16 17.33 dense·code·dm32 17.87 dense·creative·dm2 18.15 dense·code·dm2 20.08 dense·code·dm8 20.28 dense·code·dm4 22.63 dense·code·dm1 27.72
MoE code MoE creative Dense code Dense creative

MoE+creative configurations dominate the efficient end. dense+code is the most expensive across all dm budgets.

Decode tok/s — speed

Sorted descending. Higher is better. dm=32 ≈ dm=16 — the budget plateau.

0 30 60 90 120 150 decode tok/s moe·code·dm32 132.5 moe·code·dm16 132.1 ★ moe·code·dm8 131.8 moe·code·dm4 118.3 moe·code·dm2 95.5 moe·creative·dm4 94.6 dense·creative·dm32 82.5 moe·creative·dm2 82.1 dense·creative·dm16 81.7 moe·creative·dm8 68.9 dense·code·dm16 68.0 dense·code·dm32 67.6 moe·creative·dm16 61.9 moe·creative·dm32 61.4 dense·creative·dm4 58.5 moe·creative·dm1 58.0 moe·code·dm1 56.7 dense·code·dm4 55.7 dense·creative·dm8 52.3 dense·code·dm8 47.9 dense·creative·dm2 47.5 dense·code·dm2 46.5 dense·creative·dm1 30.8 dense·code·dm1 30.4

Top three rows are MoE+code at dm=8/16/32 — speeds within 1 % of each other. dm=16 is the practical winner; dm=32 wastes draft compute.

Acceptance length × draft_max — OOD gap

Lines plateau by dm=16. MoE drafter collapses on creative; Dense drafter does not.

0 1 2 3 4 5 6 1 2 4 8 16 32 draft_max budget (log₂ scale) acceptance length (AL) MoE drafter is code-distribution-trained
MoE code (5.22) MoE creative (2.49) Dense code (4.20) Dense creative (5.12)

Dense lines are tightly clustered. MoE creative diverges sharply downward — the drafter doesn't generalise out-of-distribution. Numbers in legend are dm=32 plateau values.

Pareto — speed × efficiency

Each point a config. Up + right = win. Pareto frontier in coral.

0 30 60 90 120 150 5 10 15 20 25 30 decode tok/s → ↑ better — decode J/token Pareto frontier (MoE)
MoE high-spec (dm≥4) MoE low-spec (dm≤2) Dense high-spec Dense low-spec code creative

The Pareto frontier is entirely MoE. Dense points sit strictly above (more J/tok) for any given speed. The frontier inflects sharply at moe_creative_dm2 — the most efficient real-speculation operating point.

Practical Recipes

Run it yourself — top configurations for RTX 3090 (24 GB)

Five copy-paste-ready server launches. Each starts the OpenAI-compatible endpoint on port 8080. Replace /path/to/… with your actual GGUF paths.

Pre-flight requirements

  • RTX 3090 (24 GB) or equivalent VRAM; CUDA 12.x toolkit
  • Build the test binary: cmake --build dflash/build -j 4 --target test_gemma4_dflash
  • For 24 GiB cards, build with cmake -DGGML_CUDA_NO_VMM=ON .. for cleaner long-context prefill (auto-set by server.py when it detects ≤24 GiB)
  • Pull GGUFs from Hugging Face: gemma-4-26B-A4B-it, gemma-4-31B-it, plus matching draft GGUFs. To quantize a draft from safetensors: python3 dflash/scripts/quantize_draft_q8.py --arch gemma4 ...

Config 1 — MoE Speed King MoE 26B-A4B

Use case: code completion, agent prompts under 4K tokens  ·  Expected throughput: ~132 tok/s decode, AL ≈ 5.22

python3 dflash/scripts/server.py \
  --target /path/to/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
  --draft  /path/to/gemma4-26b-a4b-dflash/draft-q8_0.gguf \
  --max-ctx 4096 \
  --ctk q8_0 --ctv q8_0 \
  --budget 64 \
  --pflash \
  --port 8080

--budget 64 maps to effective dm=16 via server's budget→dm mapping.

Config 2 — MoE Long-Context Ship MoE 26B-A4B

Use case: 256K-context coding agents, document analysis  ·  Expected throughput: ~67.95 tok/s decode, 21.73 GB VRAM

python3 dflash/scripts/server.py \
  --target /path/to/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
  --draft  /path/to/gemma4-26b-a4b-dflash/draft-q8_0.gguf \
  --max-ctx 262144 \
  --ctk q8_0 --ctv q8_0 \
  --budget 16 \
  --pflash \
  --lazy-draft \
  --port 8080

Config 3 — MoE 1M Frontier MoE 26B-A4B

Use case: massive document analysis, multi-file repository inspection  ·  Expected throughput: 19.8 tok/s decode (23.7 with MTP γ=2; 1.7 GB headroom no-drafter, 120 MB headroom with MTP)

python3 dflash/scripts/server.py \
  --target /path/to/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
  --max-ctx 1048576 \
  --ctk tq3_0 --ctv tq3_0 \
  --pflash \
  --port 8080

No drafter — speculation hurts at 1M with VRAM-saturated TQ3 KV. Pure target-only decode.

Config 4 — Dense 31B Quality Dense 31B

Use case: creative writing, OOD prompts where MoE drafter under-performs  ·  Expected throughput: 81.7 tok/s @ 4K, AL=5.12 (creative dm=16, from the 24-cell scientific sweep). HumanEval-class prompts have shown peaks up to ~98 tok/s in single-prompt checks; we list the rigorously-measured number as the headline.

python3 dflash/scripts/server.py \
  --target /path/to/gemma-4-31B-it-Q4_K_M.gguf \
  --draft  /path/to/gemma4-31b-dflash/draft-q8_0.gguf \
  --max-ctx 4096 \
  --ctk q8_0 --ctv q8_0 \
  --budget 64 \
  --pflash \
  --port 8080

Config 5 — Dense 31B Long-Context Min-VRAM Dense 31B

Use case: 64K dense context on 24 GB card with TQ3 compression  ·  Expected throughput: 2.54 tok/s decode (post-rebase fix; was 1.78 tok/s anomaly)

python3 dflash/scripts/server.py \
  --target /path/to/gemma-4-31B-it-Q4_K_M.gguf \
  --max-ctx 65536 \
  --ctk tq3_0 --ctv tq3_0 \
  --pflash \
  --port 8080

Common notes — OpenAI-compatible client

  • Drop-in OpenAI client: OPENAI_API_BASE=http://localhost:8080/v1 and OPENAI_API_KEY=sk-any
  • Test endpoint:
    curl http://localhost:8080/v1/chat/completions \
      -H 'Content-Type: application/json' \
      -d '{"model":"luce-dflash","messages":[{"role":"user","content":"hi"}],"stream":true}'

Benchmarks

What Worked

§1A — Dense 31B Dense 31B

Gemma-4-31B-it (dense). Drafter: Gemma-4-DFlash 5-layer block-diffusion.

§1.1 Matrix-v2 — 4K, target-only baseline Dense 31B

Matrix-v2 (4K context, target-only baseline)

Run Draft K / V TTFT (ms) Decode tok/s Accept
M1_none_tq3 none TQ3_0 / TQ3_0 40.47 26.69
M2_none_q8 none Q8_0 / Q8_0 32.22 33.86
M3_mtp_tq3 mtp TQ3_0 / TQ3_0 48.81 25.28 39%
M4_mtp_q8 mtp Q8_0 / Q8_0 crashed @ step 208 68% pre-crash

M4 crash: fattn.cu:652. Latent kernel-side issue with MTP + Q8 V-cache.

Pre-rebase numbers (before daef232a6 + 4b0c158). See §TQ3 Frontier for current TQ3 baselines.

§1.2 Matrix-v3 — 4K, K=Q8_0 fixed, V varied Dense 31B

Matrix-v3 (4K context, K=Q8_0 fixed, V-cache varied)

Run V-cache TTFT (ms) Decode tok/s
N1_none_q8_tq3 TQ3_0 38.67 30.21
N2_none_q8_q8 Q8_0 31.21 34.79 (+15%)
N3_mtp_q8_tq3 TQ3_0 crashed @ step 216 (fattn.cu:652) — 5% accept

Pre-rebase numbers (before daef232a6 + 4b0c158). See §TQ3 Frontier for current TQ3 baselines.

§1.3 Matrix-64K-v2 — 50K-token prompt, post-fix rerun Dense 31B

Pre-rebase numbers (before daef232a6 + 4b0c158). See §TQ3 Frontier for current TQ3 baselines.

Matrix-64K-v2 (50K-token prompt, post-fix rerun)

Run Draft K / V Prefill (s) TTFT (ms) Decode tok/s Avg accept
V1_none none TQ3_0/TQ3_0 585.2 145.88 6.90
V2_mtp (γ=1, pre-fix) mtp TQ3_0/TQ3_0 585.8 164.94 6.33 (regression) 0.02 (5/256) superseded by γ>1 — see §Recipes
V3_dflash_dm8 dflash TQ3_0/TQ3_0 585.8 257.06 9.22 (+34%) 2.29
§1.6 Dense-31B + DFlash + Q8/Q8 + dm=8 Ladder — pre-rebase 2026-05-10 (1.78 tok/s 64K anomaly resolved post-daef232a6; see §Recipes for current Dense recipes) Dense 31B

Dense-31B + DFlash + Q8/Q8 + dm=8 Ladder

ctx Prefill tok/s Decode tok/s AL / 8 VRAM
64K 319 1.78 ← anomaly 1.94 (24%) 24 / 24 GB
128K 256 24.89 7.11 (89%) 24 / 24 GB
256K 236 23.87 7.11 (89%) 24 / 24 GB
AL caveat — the 128K and 256K AL=7.11 (89%) is partly inflated by a degenerate token-715 repetition loop in the output (token 715 dominates from step 5 onward). The drafter trivially predicts the loop. Real-quality acceptance under non-degenerate generation is unknown until a fresh prompt is run.
pFlash is active. Logs show [chunked+pflash, chunk_size=1024] in all three dense runs. The 15× prefill gap vs MoE (4912 vs 319 tok/s at 64K) is the dense-vs-MoE compute ratio (Dense 31B over 60 layers vs MoE ~4B active params over 30 layers ≈ 15× compute), not a pflash failure. pflash skips attention; FFN compute is unavoidable.

Surprise: Dense-31B is fine at 128K and 256K (~24 tok/s, healthy AL 7.11) but collapses specifically at 64K with AL crashing to 1.94. All three cells report 24.00/24.00 GB VRAM — likely a VRAM-allocator edge case at exactly 64K. Headline: dense 31B is viable at long context (≥128K) on a 24 GB GPU. MoE remains ~50% faster but dense is now an option.

§1B — MoE 26B-A4B MoE 26B-A4B

Gemma-4-26B-A4B (MoE, ~4B active). Drafter: Gemma-4-DFlash 5-layer block-diffusion. (No MoE MTP drafter exists yet — see open question P1.)

§1.4 Scaling Sweep — MoE DFlash Q8/Q8, 16K → 262K (pre-pflash baseline 2026-05-10; current ship configs in §Recipes give 60+ tok/s across this range) MoE 26B-A4B

Scaling Sweep — MoE + DFlash (Q8_0/Q8_0)

Context Prefill tok/s TTFT (ms) Decode tok/s Avg accept
16K 3 711 41.7 72.32 1.66
32K 3 833 38.5 70.54 1.66
64K (dense) 1 402 158 7.96
65K (dflash) 4 878 65.2 28.92 1.45
131K (dflash) 4 888 62.3 29.90 1.45
262K (dflash) 4 894 62.1 29.40 1.45

DFlash holds throughput near-constant from 65K to 262K — prefill tok/s rises slightly as batch efficiency improves. Dense path at 64K collapses without DFlash.

§1.5 dm-sweep — draft_max tuning, 50K code prompt, MoE MoE 26B-A4B

dm-sweep — draft_max tuning (50K code prompt, MoE)

ctx dm TTFT (ms) Decode tok/s Spec accept Avg
64K 1 69.5 23.01 256/256 1.00 (no wins)
64K 2 62.8 33.81 169/256 1.51
64K 4 64.0 36.57 143/256 1.79 ★
64K 8 81.7 29.45 138/256 1.86
256K 4 61.1 36.63 143/256 1.79 ✓
256K 8 78.2 29.36 138/256 1.86 (degrade)

dm=4 is the global optimum for long code prompts. dm=16 wins only on short prompts (HumanEval): TTFT 88.8 ms, decode 97.81 tok/s, accept 6.56.

§1C — Scientific 24-cell sweep Energy-instrumented

2 archs × 2 distributions × 6 dm budgets. RTX 3090 24 GB, Q8/Q8 KV, 4K ctx, n_predict=256, temp=0 seed=0 --ignore-eos, pflash on. Power sampled ~5 Hz via nvidia-smi; total energy is trapezoidal ∫ P dt; phase-split by reported phase fraction. See scientific/results.csv.

Peak decode MoE+code+dm=16 → 132.1 tok/s, AL 5.22, 11.82 J/tok
Peak efficiency (real-spec) MoE+creative+dm=2 → 6.64 J/tok at 82.1 tok/s, AL 1.63
dm=32 wastes draft compute Identical to dm=16 in all four distribution × arch cells (e.g. moe·code 132.1 vs 132.5; dense·creative 81.7 vs 82.5).
OOD generalisation gap Dense AL stable across distributions (4.20 code → 5.12 creative). MoE drafter collapses 5.22 → 2.49 — code-distribution-trained.
configarchdistdm decode tok/sAL decode J/tokavg W VRAM GBwall s
moe_code_dm16moecode16132.115.2211.82129.618.9027.26
moe_code_dm32moecode32132.455.2213.93129.318.9027.70
moe_code_dm8moecode8131.763.8814.39137.818.8927.70
moe_code_dm4moecode4118.302.6112.19131.918.9027.17
moe_code_dm2moecode295.521.9011.04132.318.9223.54
moe_creative_dm4moecreative494.622.128.33147.618.8816.02
dense_creative_dm32densecreative3282.545.1210.75169.622.0818.75
moe_creative_dm2moecreative282.121.636.64 ★205.418.9010.17
dense_creative_dm16densecreative1681.685.128.82176.922.0814.47
moe_creative_dm8moecreative868.872.037.85196.018.8811.41
dense_code_dm16densecode1668.004.2017.33147.622.1133.79
dense_code_dm32densecode3267.604.2017.87146.522.1134.69
moe_creative_dm16moecreative1661.912.497.15175.718.8911.71
moe_creative_dm32moecreative3261.352.499.55154.418.8917.34
dense_creative_dm4densecreative458.453.0512.43182.922.0818.55
moe_creative_dm1moecreative158.041.005.70209.518.877.70
moe_code_dm1moecode156.671.0014.46139.718.8929.18
dense_code_dm4densecode455.732.9122.63159.622.1033.64
dense_creative_dm8densecreative852.344.4112.25230.222.0716.63
dense_code_dm8densecode847.934.0620.28163.922.1034.52
dense_creative_dm2densecreative247.491.8718.15171.622.1029.30
dense_code_dm2densecode246.491.8220.08163.622.0934.19
dense_creative_dm1densecreative130.751.0013.84317.522.0712.04
dense_code_dm1densecode130.421.0027.72180.222.1036.32

Sorted by decode tok/s descending. ★ = peak speed (dm=16 — recommended over dm=32 because dm=32 wastes draft compute) and peak speculation-on efficiency. Baseline (dm=1) is more efficient still but slower.

Failures & Regressions

What Did Not Work

Hard Failures

  • Repeated fattn.cu:652 crash with MTP + Q8 V-cache: matrix-v2/M4_mtp_q8.log:60, matrix-v3/N3_mtp_q8_tq3.log:61, matrix-64k/MTP_humaneval.log:48 — 96% accept pre-crash, kernel-side latent issue.
  • T3_dflash.log rc=143 in matrix-64k — fixed by 5b6ba1b ("SWA mask coordinate frame").
  • dm-sweep 4K dm{4,8} — prompt (50003 tokens) >= ctx_size (4096) test misconfiguration.

Regressions (historical — most now resolved)

  • MTP @ 64K (V2_mtp): 6.33 vs 6.90 baseline (−8.2%), accept 0.02. Resolved: post-rebase γ=1 at 64K shows 0.69 accept; γ=2 chain shows 0.73 accept and +61% over no-MTP. The 0.02 figure was pre-Bug-2 fix.
  • MTP @ 4K (M3_mtp_tq3): 25.28 vs 26.69 even at 39% accept. Resolved: γ=1 is structurally a "correctness gate" with zero amortisation. γ>1 + approach B (multi-row h_prev) gives γ=4 at 4K = 25.6 tok/s vs no-MTP 18.4 (+39%).
  • Sanity smoke: 0/8 accepted, garbled output. (early-stage stochastic-sampling bug; pre-greedy gate)
  • Dense @ 64K: 7.96 tok/s; Dense @ 256K: 1.78 tok/s. (resolved per §1.6 64K cliff fix)

Quality Issues

  • V3_dflash_dm8 leaked <mask>, <unused94> tokens early (matrix-64k-v2/SUMMARY.md:55).
  • T2_mtp emitted token id 236772 repeatedly (vocab=262144). (pre-Bug-2 / pre-γ>1 chain; not reproducible on current binary)
  • MTP semantic mode collapse: tokens oscillated between id 109 / 49. Resolved: γ=1 single-step accept of 0.66 produces coherent output; γ=2 chain at 64K accepts both drafts most of the time.

Abandoned Approaches

  • Uniform TQ3_0 across all layers → narrow asymmetric (Q8 on captured full-attn, TQ3 on SWA), commit ce4da35.
  • Monotonic SWA full alloc → non-monotonic ring + correct mask (d68e7c4 / 19def9c / 5b6ba1b).
  • TQ3_0 preserved through ring-wrap → still F32 concat on wrap (workaround).
  • Standalone ggml_turbo_wht calls → fused into FA (c15f93a).

Key Intellectual Property

What We Discovered — 20 Findings

Each row below represents a non-obvious finding with a concrete fix location. These findings do not exist in upstream llama.cpp.

# Subsystem Discovery Where it lives
1 SWA ring geometry Mask must map abs query → ring slot (qpos − abs_win_start) % ring_size; without this, chunks 2+ silently read stale KV. gemma4_target_graph.cpp:205-222
commit 5b6ba1b
2 TQ3_0 KV alignment TQ3_0 + CUDA FA needs ne[1] % 256 == 0. FATTN_KQ_STRIDE=256 is hardcoded in CUDA — no public ggml API exposes it. gemma4_target_graph.cpp:99-101
gemma4_target_graph.cpp:350-355
commits c56879c, 41e4848
3 Non-monotonic ring restoration Disabling ring opt cost 50%+ VRAM on SWA layers. Re-enabled with correct mask geometry. commits 19def9c, d68e7c4
4 Narrow asymmetric KV pFlash block-sparse path does not support TQ3_0 → forced Q8 on captured full-attn layers; TQ3 stays on SWA. internal.h:593-597
commit ce4da35
5 TQ3_0 type-tag preservation through FA Casting K/V to F32/F16 before FA breaks the kernel's TQ3 → FWHT routing. Pass tensors native; CHUNKED kernel handles FWHT internally. gemma4_mtp_graph.cpp:401-414
commit c56879c
6 GQA block-broadcast for MTP MTP's n_head_norm ≠ target head_dim_kv. Explicit reshape required before FA. gemma4_mtp_graph.cpp:120-122
gemma4_mtp_graph.cpp:225-336
commit 30b2b50
7 MTP h_prev capture Last-committed token's post-block hidden state from the last full-attn layer. KQ scale = 1.0 (not 1/√d). Assistant's own rope_freqs.weight. gemma4_mtp_graph.cpp:266-285
commit 138de4d
8 FA mask mandatory for head_dim≥512 CUDA MMA dispatcher requires gqa_opt_appliesK->ne[1]%256==0 AND mask != nullptr. Provide all-zero mask when logically all positions admitted. gemma4_mtp_graph.cpp:468-481
commit f1f811e
9 n_tokens=1 decode + SWA Single-token decode also needs allocated+filled SWA mask sized to ring_size. commit 7b62c07
10 Shared FA mask across layers Single mask buffer reused across all need-mask layers (256-align or head_dim≥512). internal.h:847-852
11 DFlash decode correctness BOS/EOS handling + per-layer SWA mask on draft (4 SWA + 1 full). Prefix-direct KV semantics. commits 1386690, 9588c97
12 TQ3 K dequant intercept MMA path silently strips FWHT rotation when reading TQ3_0 K-cache. Forced chunked-path dispatch for TQ3 K to avoid the MMA intercept. submodule daef232a6 (force-chunked dispatcher; drops broken MMA intercept) + outer-repo 4b0c158 (graph-level FWHT rotation via ggml_cont wrap)
13 Drafter distribution coverage Dense DFlash drafter generalises out-of-distribution; MoE drafter is code-distribution-bound. AL plateau dense·code 4.20 vs dense·creative 5.12 (uniform); MoE·code 5.22 vs MoE·creative 2.49 (collapse). Empirically discovered via the 24-cell scientific sweep. empirical: scientific/results.csv
gemma4-journey.md §11
14 TQ3 long-context unlocked Post-daef232a6, MoE TQ3 sustains 5.6–6.6 tok/s flat from 4K to 1M with sub-linear VRAM growth (~0.34 GiB / 64K ctx). 1M context on 24 GB confirmed. Update: with pflash on (current ship config), this band shifts up to 19.8–20.1 tok/s on code prompts (prose 19.12) across 64K–1M — see Headline. submodule daef232a6
logs: tq3-frontier/
15 DFlash 1M run-to-run variance Same CLI / same seed / same prompt yields tok/s ∈ [20, 30] on MoE 1M cold; can degrade to 4.86 after ~19 prior cells in same shell. WSL2/CUDA allocator state matters. Sweep-harness numbers at saturated VRAM ctx need a cold-start protocol. mtp-gamma/triple-falsify/trial_{1,2,3}.log
16 γ>1 MTP chain (first public implementation) Approach B (multi-row mtp_h_prev_batch) eliminates per-chain re-capture overhead. Mirrors Google HF reference (post-final-RMSNorm capture, Q-only cross-attn, weight-tied lm_head). γ=2 produces accept 0.55–0.78 at long ctx on Dense and 0.57 on MoE. commits d8ebd12 (A), 4bcb972 (B)
gemma4_target_graph.cpp:1106-1123
17 MoE MTP assistant exists AtomicChat/gemma-4-26B-A4B-it-assistant-GGUF on HuggingFace. Loads cleanly against MoE 30-layer target — loader auto-remaps donor layer indices written for the Dense 60-layer target. Accept rate 0.81 at 4K MoE. models/gemma4-mtp-26b-a4b/
18 MoE Q8 ceiling at 512K, not earlier Q8 + pflash decode is flat at 24–25 tok/s from 64K through 512K with 0.86 GB headroom at 512K. Pages only at 1M. Earlier framing "TQ3 wins at long context" was wrong below 1M. mtp-gamma/q8-ceiling/
mtp-gamma/closing/moe_*_q8_pf_*.log
19 Dense + DFlash above 32K is anti-economical DFlash drafter KV cache pushes Dense 31B over 24 GB at 64K. Decode drops to 5–7 tok/s, prefill collapses to 130 tok/s. MTP γ=2 — which cross-attends to target KV with no own cache — is the right drafter for Dense ≥ 32K. mtp-gamma/closing/dense_*.log
mtp-gamma/paired-matrix/dense_mtp_*.log
20 Q8 + pflash + DFlash dm=4 is the MoE long-ctx winner At MoE 128K–512K: 60–67 tok/s decode at 6–8 J/tok. 2× faster than the TQ3 + pflash + DFlash stack at the same contexts (which gave 32–33 tok/s). The site's previous framing oversold TQ3 below 1M. mtp-gamma/closing/moe_*_q8_pf_dfl4.log

Why none of this is upstream — TQ3_0 is bleeding-edge (Google TurboQuant ~2024); MTP assistants and DFlash drafters are proprietary architectures; pFlash block-sparse paths are research-grade. Upstream llama.cpp ships none of these.

Scientific Benchmark

Super-Bench — scientific/results.csv

Wall-clock-sourced, energy-instrumented benchmark sweep. Generated when the super-bench harness completes; live-reloaded from scientific/results.csv.

Schema (expected CSV columns)
  • config — run identifier (e.g. moe_dflash_q8q8_dm4_64k)
  • archdense | moe
  • draftnone | dflash | mtp
  • kv_k / kv_vtq3_0 | q8_0
  • ctx — context length tokens
  • dm — draft_max
  • wall_s — wall time T_end − T0 (seconds, date +%s.%N)
  • prefill_ms, prefill_tps
  • decode_ms, decode_tps, al, accept_rate, first_tok_ms
  • vram_peak_gb
  • avg_power_w — averaged over full active window
  • total_energy_j — trapezoidal ∫ P dt
  • prefill_energy_j, decode_energy_j — phase-split by time-fraction
  • decode_j_per_tok — efficiency metric (decode_energy_j / decoded_tokens)

Chronology

Timeline — May 7–11, 2026

May 7

Plan: Gemma4 31B Dense + 26B-A4B MoE; design TQ3_0 KV cache architecture.

May 7–8

Target + draft baselines, TDD smoke tests; chunked prefill 12–16× speedup (3335ee2).

May 8

Draft KV cache; pFlash sparse SWA layer-by-layer (33b6e9d); first long-ctx output corruption observed.

May 9 AM

Diagnosed SWA mask coordinate frame bug (5b6ba1b); gate pFlash on supported KV types.

May 9 PM

Built MTP from scratch: loader (1115064), graph (d4659ca), wiring (05e36e4); accept rate stayed near zero on 64K tests.

May 9 eve

Five MTP fixes in 24 h: 138de4d, 30b2b50, c56879c, 7b62c07, f1f811e. MTP runs to completion but accept ≤2% at 64K — fixes restored correctness, not yield.

May 9–10

Bench matrices: matrix-v2 / matrix-v3 / matrix-64k / matrix-64k-v2; scaling 16K → 262K; dm-sweep. dm=4 confirmed as global optimum for long code prompts.

May 10

Debugging-journey blog (b441587, gemma4-journey.md). Dense-31B ladder: viable at 128K/256K (~24 tok/s) but decode collapses at 64K (1.78 tok/s, AL=1.94).

May 10 eve

γ>1 MTP plan written, momus-reviewed. Phases 1+2+3 (approach A — re-capture target forward) landed as commit d8ebd12. First γ=4 result at 4K: 20.75 tok/s (+10% over γ=1, but missed 1.5× target due to re-capture overhead).

May 11 AM

Phase 3.5 — approach B (4bcb972): multi-row mtp_h_prev_batch capture eliminates re-capture forward. γ=4 at 4K: 25.6 tok/s (+39% over no-MTP). γ=2 at 64K: 10.1 tok/s (+61% over no-MTP). 16K dead zone resolved. The pre-rebase "MTP 0.02 accept at 64K" claim is officially superseded — current 64K γ=2 accept is 0.73.

Next Steps

Open Questions

Resolved this session

  • Dense-31B 64K decode collapse — 1.78 tok/s, AL=1.94. Resolved post-rebase (commit daef232a6): Dense-31B 64K decode now 2.54 tok/s with TQ3/TQ3 (no drafter, no pflash); 6.26 tok/s with TQ3/TQ3 + pflash baseline; 10.07 tok/s with MTP γ=2 (+61% over the +pflash baseline). The "1.78 anomaly" was the pre-rebase TQ3 dispatcher bug.
  • MTP accept ≤ 2% at 64K. Resolved: post-rebase γ=1 accept = 0.69; γ=2 chain accept = 0.73. The earlier figure was pre-Bug-2 noise on a broken binary.
  • MTP semantic mode collapse (tokens cycle between id 109 / 49). Resolved: same binary fix; current γ=2 at 64K produces coherent output with mean_accept_drafts = 1.46 / 2.
  • fattn.cu:652 MTP+Q8 latent crash. Resolved by Bug-2 (daef232a6) + Bug-3 (f1f811e); MTP + Q8 V-cache now runs cleanly through the chain.

Still open

  • <mask> / <unused94> token leak in DFlash V3 output. TQ3 dequant artifact, or draft-KV mask off-by-one? Worth retesting on the current binary.
  • TQ3_0 ring-wrap path forces F32 concat — loses FWHT benefit on long-ctx MTP. Workaround only. Structural fix (rotate-once-at-cache-write) is the obvious next move.
  • FATTN_KQ_STRIDE=256 is private — if upstream changes it, our 256-pad assumption silently breaks.
  • Per-layer FA masks not implemented — asserted-equal across need-mask layers; long-ctx SWA-cap mismatch would trip it.
  • P1: No MoE MTP drafter exists — the MTP assistant was built and validated against the Dense-31B target only. MoE speculative decode relies exclusively on DFlash.
  • P1 (γ>1): γ=8 underperforms γ=4 at every context — drafter's autoregressive feedback drifts too far from target's hidden distribution past chain depth 4. Either a retrained MTP head with deeper chain awareness, or tree-mask drafting (llama.cpp PR #22838), would push the sweet spot higher.
  • P2 (γ>1): Stochastic γ>1 sampling is currently fatal — requires Leviathan/SpecInfer-style importance rescaling. Greedy-only for now.

Acknowledgements

Credits — built on the shoulders of others

Models

  • Google DeepMind — Gemma 4 — the underlying language models (Gemma-4-26B-A4B-it MoE and Gemma-4-31B-it Dense); released 2026-04-02.
  • Google — Gemma 4 MTP drafters — official 0.5B speculative drafters; released 2026-05-06. We run them alongside the DFlash drafter via our forked stack.

Inference & Frameworks

  • ggml-org / llama.cpp — the foundation. ggml CUDA backend, GGUF format, FlashAttention kernels. Everything stands on this.
  • z-lab / dflash — the DFlash speculative-decoding block-diffusion drafter architecture that our drafters extend.
  • ggml team — TQ3_0 / TurboQuant — 3.5-bit FWHT-rotated KV quantization. We extended TQ3_0 support for Gemma 4's SWA + full-attention hybrid layout.

Maintainers & Community

  • howard0su — Luce-Org/lucebox-hub maintainer; PR #131 reviewer; SWA truncation port (PR #129) we adapted to Gemma 4.
  • Luce-Org — upstream framework: daemon, server, prefix cache, scientific harness.
  • All llama.cpp contributors past and present — the 15 fixes we made would not have been possible without their infrastructure.

Reviewers & Tools

  • cubic-dev-ai bot — automated P1/P2 review feedback that caught real defensive bugs across the PR stack.

Inspiration

  • Speculative decoding — Leviathan et al., 2023. The theoretical backbone of all drafter experiments.
  • FlashAttention — Tri Dao et al. The IO-aware attention algorithm that makes pFlash viable at scale.