CoDA-GQA-L
preprintConstrained Orthogonal Differential Attention with Grouped-Query Value-Routed Landmark Banks
Anthony Maio · February 2026
Abstract
CoDA-GQA-L compresses the KV cache from O(n) to a fixed budget of W+Me+Ms slots per layer—independent of sequence length—while retaining selective long-range context through dual memory banks. Applied to Mistral-7B-v0.3, the system achieves bounded perplexity of 5.94 on WikiText-2 at 2,048 context with a fixed 218 KB per-layer cache, compared to >2 MB for the baseline (+23.5% PPL overhead, 9.5× memory reduction). A two-phase training protocol first teaches differential attention with full context (2,000 steps), then adapts the model to bounded memory (600 steps). A 2×2 factorial ablation shows both methods achieve 5.75 PPL unbounded, but GQA loses +1.09 PPL going bounded while CoDA loses only +0.19—a 5.7× reduction in bounded penalty.
Architecture
Bounded KV Buffer — Fixed Size Per Layer
CoDA: Orthogonal Rotation
Produces the inhibitory query via learnable Givens rotation of the signal query. Saves D×D parameters per head (~16.7M for Mistral-7B) while preserving noise cancellation. Initialized near-identity for transparent warm-start.
Value-Routed Matching
Routes memory updates on values (RoPE-free) instead of keys (RoPE-contaminated). Preserves cos = 1 for identical inputs regardless of position, enabling reliable deduplication and prototype formation.
Fused Triton Kernels
Two custom kernels: fused differential FlashAttention (both streams in one HBM pass) and fused bank routing (replaces ~15 PyTorch launches). Verified on H200 with Triton 3.4.0.
Differential Attention Synergy: 2×2 Factorial
Both methods achieve identical unbounded PPL, confirming zero overhead from differential attention. The benefit is specific to bounded memory—a genuine synergy.
| Method | Unbounded | Bounded | Penalty |
|---|---|---|---|
| Standard GQA | 5.75 | 6.84 | +1.09 |
| CoDA (diff attn) | 5.75 | 5.94 | +0.19 |
| Interaction effect | +0.90 | ||
| Penalty reduction factor | 5.7× | ||
Bounded Penalty Comparison
Bar width proportional to bounded penalty. Both start from identical 5.75 unbounded baseline.
Results on Mistral-7B
Perplexity Across Configurations
WikiText-2, bf16, 2,048 context
| Config | PPL | vs Base | Cache/Layer |
|---|---|---|---|
| Mistral-7B baseline | 4.81 | — | O(L) |
| CoDA unbounded | 5.38 | +11.9% | O(L) |
| CoDA bounded, medium | 5.94 | +23.5% | 217.9 KB |
| CoDA bounded, large | 6.22 | +29.3% | 3.0 MB |
| CoDA bounded, tiny | 6.31 | +31.2% | 108.9 KB |
| Window-only (no banks) | 6.22 | +29.3% | 129.2 KB |
Context-Length Scaling
Bounded medium-cache, trained at 8K
Remarkably flat between 1K–4K (5.94–5.95). Bar = inverse PPL (longer = better).
Compression at Scale
| Scenario | Standard KV | CoDA State | Compression |
|---|---|---|---|
| 7B, 2K ctx | 512 MB | 48 MB | 10.7× |
| 7B, 32K ctx | 8 GB | 48 MB | 170× |
| 7B, 128K ctx | 32 GB | 48 MB | 682× |
| 70B, 128K ctx | 160 GB | 120 MB | 1,365× |
At 70B/128K, bounded state saves ~160 GB—the difference between a multi-GPU cluster and a single consumer accelerator.
Two-Phase Training Protocol
Unbounded Training
Teach differential attention with full KV cache.
Bounded Adaptation
Adapt to fixed-size KV cache with memory banks.
Needle-in-Haystack Retention
100% retention at all tested lengths. The exact landmark bank preserves needle tokens with near-perfect fidelity, even 16K tokens after injection with only 32 bank slots.
Quick Start
pip install coda-gqa-l
# Swap Mistral-7B attention layers
from coda_gqa_l import LlamaCoDAAdapter
adapters = LlamaCoDAAdapter.swap_llama_layers(
model, bounded=True,
window=256, num_landmarks_exact=64,
num_landmarks_summary=64,
)
# Load trained weights
import torch
state = torch.load("coda_adapters.pt", weights_only=True)
for i, adapter in enumerate(adapters):
adapter.load_state_dict(state[f"layer_{i}"])