CoDA-GQA-L: Bounded-Memory Differential Attention

February 25, 2026 preprint

Overview

CoDA-GQA-L is an attention mechanism that compresses the KV cache from O(n) to a fixed budget of W+M_e+M_s slots per layer---independent of sequence length---while retaining selective long-range context through dual memory banks. Applied to Mistral-7B-v0.3, the system achieves bounded perplexity of 5.94 on WikiText-2 with a fixed 218 KB per-layer cache (+23.5% PPL overhead, 9.5x memory reduction).

Key Contributions

Constrained Orthogonal Differential Attention (CoDA): Produces the inhibitory query via learnable orthogonal rotation of the signal query, saving D x D parameters compared to a second projection while preserving noise-cancellation properties
Bounded dual-bank KV memory: A three-segment buffer [Recent W | Exact M_e | Summary M_s] that provably bounds per-layer cache to O(W+M_e+M_s), independent of sequence length
Value-routed semantic matching: Memory bank updates route on values (RoPE-free) rather than keys (RoPE-contaminated), solving the fundamental tension between RoPE-at-write efficiency and position-dependent key similarity
Two custom Triton kernels: Fused differential FlashAttention and fused exact-bank routing, each replacing ~15 PyTorch kernel launches with single-pass GPU computation
Two-phase training protocol: Phase 1 teaches differential attention with full context; Phase 2 adapts to bounded memory

Results on Mistral-7B

Metric	Value
Bounded PPL (2K context)	5.94 (+23.5% vs baseline)
KV cache per layer	218 KB (9.5x reduction)
Needle-in-haystack	100% retention at 16K tokens
Context scaling	5.94 at 2K, 5.95 at 4K (flat)
Projected compression at 128K	>1,100x

Differential Attention Synergy: 2x2 Factorial

A 2x2 factorial ablation with matched training budgets reveals a strong synergy between differential attention and bounded memory:

Method	Unbounded PPL	Bounded PPL	Bounded Penalty
Standard GQA	5.75	6.84	+1.09
CoDA (diff attn)	5.75	5.94	+0.19

Both methods achieve identical unbounded PPL (5.75), confirming that differential attention adds zero overhead with full KV cache. But under memory pressure, CoDA’s bounded penalty is 5.7x smaller (+0.19 vs +1.09). The interaction effect (+0.90 PPL) is larger than either component’s individual contribution---a genuine synergy, not additive improvement.

Fused Triton Kernel: 97x Speedup

The exact-bank routing kernel was validated and optimized using Makora’s evaluate and expert-generate pipelines. The fused kernel combines 15 separate CUDA kernel launches---cosine similarity scoring, novelty classification, and LRU victim selection---into a single Triton kernel where all routing state lives in SRAM registers. Confirmed 97x speedup over the PyTorch implementation on H100, making landmark memory bank updates no longer the throughput bottleneck. CoDA-GQA-L’s attention is now competitive in performance with standard GQA while maintaining a constant-sized KV cache.

Stateful Neural Databases

The bounded state is a fixed-size serializable artifact (48 MB for 7B across 32 layers), enabling save/load/query semantics for agentic RAG. For 100 documents at 7B scale, the total state footprint is 4.8 GB---feasible for a single GPU.