| 1 |
SWA ring geometry |
Mask must map abs query → ring slot (qpos − abs_win_start) % ring_size; without this, chunks 2+ silently read stale KV. |
gemma4_target_graph.cpp:205-222 commit 5b6ba1b |
| 2 |
TQ3_0 KV alignment |
TQ3_0 + CUDA FA needs ne[1] % 256 == 0. FATTN_KQ_STRIDE=256 is hardcoded in CUDA — no public ggml API exposes it. |
gemma4_target_graph.cpp:99-101
gemma4_target_graph.cpp:350-355 commits c56879c, 41e4848 |
| 3 |
Non-monotonic ring restoration |
Disabling ring opt cost 50%+ VRAM on SWA layers. Re-enabled with correct mask geometry. |
commits 19def9c, d68e7c4 |
| 4 |
Narrow asymmetric KV |
pFlash block-sparse path does not support TQ3_0 → forced Q8 on captured full-attn layers; TQ3 stays on SWA. |
internal.h:593-597 commit ce4da35 |
| 5 |
TQ3_0 type-tag preservation through FA |
Casting K/V to F32/F16 before FA breaks the kernel's TQ3 → FWHT routing. Pass tensors native; CHUNKED kernel handles FWHT internally. |
gemma4_mtp_graph.cpp:401-414 commit c56879c |
| 6 |
GQA block-broadcast for MTP |
MTP's n_head_norm ≠ target head_dim_kv. Explicit reshape required before FA. |
gemma4_mtp_graph.cpp:120-122
gemma4_mtp_graph.cpp:225-336 commit 30b2b50 |
| 7 |
MTP h_prev capture |
Last-committed token's post-block hidden state from the last full-attn layer. KQ scale = 1.0 (not 1/√d). Assistant's own rope_freqs.weight. |
gemma4_mtp_graph.cpp:266-285 commit 138de4d |
| 8 |
FA mask mandatory for head_dim≥512 |
CUDA MMA dispatcher requires gqa_opt_applies ⇒ K->ne[1]%256==0 AND mask != nullptr. Provide all-zero mask when logically all positions admitted. |
gemma4_mtp_graph.cpp:468-481 commit f1f811e |
| 9 |
n_tokens=1 decode + SWA |
Single-token decode also needs allocated+filled SWA mask sized to ring_size. |
commit 7b62c07 |
| 10 |
Shared FA mask across layers |
Single mask buffer reused across all need-mask layers (256-align or head_dim≥512). |
internal.h:847-852 |
| 11 |
DFlash decode correctness |
BOS/EOS handling + per-layer SWA mask on draft (4 SWA + 1 full). Prefix-direct KV semantics. |
commits 1386690, 9588c97 |
| 12 |
TQ3 K dequant intercept |
MMA path silently strips FWHT rotation when reading TQ3_0 K-cache. Forced chunked-path dispatch for TQ3 K to avoid the MMA intercept. |
submodule daef232a6 (force-chunked dispatcher; drops broken MMA intercept) + outer-repo 4b0c158 (graph-level FWHT rotation via ggml_cont wrap) |
| 13 |
Drafter distribution coverage |
Dense DFlash drafter generalises out-of-distribution; MoE drafter is code-distribution-bound. AL plateau dense·code 4.20 vs dense·creative 5.12 (uniform); MoE·code 5.22 vs MoE·creative 2.49 (collapse). Empirically discovered via the 24-cell scientific sweep. |
empirical: scientific/results.csv
gemma4-journey.md §11 |
| 14 |
TQ3 long-context unlocked |
Post-daef232a6, MoE TQ3 sustains 5.6–6.6 tok/s flat from 4K to 1M with sub-linear VRAM growth (~0.34 GiB / 64K ctx). 1M context on 24 GB confirmed. Update: with pflash on (current ship config), this band shifts up to 19.8–20.1 tok/s on code prompts (prose 19.12) across 64K–1M — see Headline. |
submodule daef232a6 logs: tq3-frontier/ |
| 15 |
DFlash 1M run-to-run variance |
Same CLI / same seed / same prompt yields tok/s ∈ [20, 30] on MoE 1M cold; can degrade to 4.86 after ~19 prior cells in same shell. WSL2/CUDA allocator state matters. Sweep-harness numbers at saturated VRAM ctx need a cold-start protocol. |
mtp-gamma/triple-falsify/trial_{1,2,3}.log |
| 16 |
γ>1 MTP chain (first public implementation) |
Approach B (multi-row mtp_h_prev_batch) eliminates per-chain re-capture overhead. Mirrors Google HF reference (post-final-RMSNorm capture, Q-only cross-attn, weight-tied lm_head). γ=2 produces accept 0.55–0.78 at long ctx on Dense and 0.57 on MoE. |
commits d8ebd12 (A), 4bcb972 (B)
gemma4_target_graph.cpp:1106-1123 |
| 17 |
MoE MTP assistant exists |
AtomicChat/gemma-4-26B-A4B-it-assistant-GGUF on HuggingFace. Loads cleanly against MoE 30-layer target — loader auto-remaps donor layer indices written for the Dense 60-layer target. Accept rate 0.81 at 4K MoE. |
models/gemma4-mtp-26b-a4b/ |
| 18 |
MoE Q8 ceiling at 512K, not earlier |
Q8 + pflash decode is flat at 24–25 tok/s from 64K through 512K with 0.86 GB headroom at 512K. Pages only at 1M. Earlier framing "TQ3 wins at long context" was wrong below 1M. |
mtp-gamma/q8-ceiling/
mtp-gamma/closing/moe_*_q8_pf_*.log |
| 19 |
Dense + DFlash above 32K is anti-economical |
DFlash drafter KV cache pushes Dense 31B over 24 GB at 64K. Decode drops to 5–7 tok/s, prefill collapses to 130 tok/s. MTP γ=2 — which cross-attends to target KV with no own cache — is the right drafter for Dense ≥ 32K. |
mtp-gamma/closing/dense_*.log
mtp-gamma/paired-matrix/dense_mtp_*.log |
| 20 |
Q8 + pflash + DFlash dm=4 is the MoE long-ctx winner |
At MoE 128K–512K: 60–67 tok/s decode at 6–8 J/tok. 2× faster than the TQ3 + pflash + DFlash stack at the same contexts (which gave 32–33 tok/s). The site's previous framing oversold TQ3 below 1M. |
mtp-gamma/closing/moe_*_q8_pf_dfl4.log |