CoDA-GQA-L

preprint

Constrained Orthogonal Differential Attention with Grouped-Query Value-Routed Landmark Banks

Anthony Maio · February 2026

Paper PDF GitHub 🤗 Checkpoint PyPI v1.2.0

5.94

Bounded PPL

+23.5% vs baseline

9.5×

KV Compression

218 KB/layer

100%

NIAH Retention

up to 16K tokens

5.7×

Penalty Reduction

CoDA vs GQA bounded

Abstract

CoDA-GQA-L compresses the KV cache from O(n) to a fixed budget of W+M_e+M_s slots per layer—independent of sequence length—while retaining selective long-range context through dual memory banks. Applied to Mistral-7B-v0.3, the system achieves bounded perplexity of 5.94 on WikiText-2 at 2,048 context with a fixed 218 KB per-layer cache, compared to >2 MB for the baseline (+23.5% PPL overhead, 9.5× memory reduction). A two-phase training protocol first teaches differential attention with full context (2,000 steps), then adapts the model to bounded memory (600 steps). A 2×2 factorial ablation shows both methods achieve 5.75 PPL unbounded, but GQA loses +1.09 PPL going bounded while CoDA loses only +0.19—a 5.7× reduction in bounded penalty.

Architecture

Bounded KV Buffer — Fixed Size Per Layer

Recent Window

W = 256 slots

Exact Bank

M_e = 64

Summary Bank

M_s = 64

FIFO ring buffer Novelty-filtered LRU EMA prototypes

Total: 384 slots = 218 KB/layer (bf16) — constant regardless of sequence length

🔄

CoDA: Orthogonal Rotation

Produces the inhibitory query via learnable Givens rotation of the signal query. Saves D×D parameters per head (~16.7M for Mistral-7B) while preserving noise cancellation. Initialized near-identity for transparent warm-start.

🎯

Value-Routed Matching

Routes memory updates on values (RoPE-free) instead of keys (RoPE-contaminated). Preserves cos = 1 for identical inputs regardless of position, enabling reliable deduplication and prototype formation.

⚡

Fused Triton Kernels

Two custom kernels: fused differential FlashAttention (both streams in one HBM pass) and fused bank routing (replaces ~15 PyTorch launches). Verified on H200 with Triton 3.4.0.

Differential Attention Synergy: 2×2 Factorial

Both methods achieve identical unbounded PPL, confirming zero overhead from differential attention. The benefit is specific to bounded memory—a genuine synergy.

Method	Unbounded	Bounded	Penalty
Standard GQA	5.75	6.84	+1.09
CoDA (diff attn)	5.75	5.94	+0.19
Interaction effect			+0.90
Penalty reduction factor			5.7×

Bounded Penalty Comparison

Standard GQA +1.09 PPL

6.84

CoDA (differential attn) +0.19 PPL

5.94

Bar width proportional to bounded penalty. Both start from identical 5.75 unbounded baseline.

Results on Mistral-7B

Perplexity Across Configurations

WikiText-2, bf16, 2,048 context

Config	PPL	vs Base	Cache/Layer
Mistral-7B baseline	4.81	—	O(L)
CoDA unbounded	5.38	+11.9%	O(L)
CoDA bounded, medium	5.94	+23.5%	217.9 KB
CoDA bounded, large	6.22	+29.3%	3.0 MB
CoDA bounded, tiny	6.31	+31.2%	108.9 KB
Window-only (no banks)	6.22	+29.3%	129.2 KB

Context-Length Scaling

Bounded medium-cache, trained at 8K

512 6.36

1K 6.09

2K 5.94

4K 5.95

8K 6.87

Remarkably flat between 1K–4K (5.94–5.95). Bar = inverse PPL (longer = better).

Compression at Scale

Scenario	Standard KV	CoDA State	Compression
7B, 2K ctx	512 MB	48 MB	10.7×
7B, 32K ctx	8 GB	48 MB	170×
7B, 128K ctx	32 GB	48 MB	682×
70B, 128K ctx	160 GB	120 MB	1,365×

At 70B/128K, bounded state saves ~160 GB—the difference between a multi-GPU cluster and a single consumer accelerator.

Two-Phase Training Protocol

Unbounded Training

Teach differential attention with full KV cache.

Steps2,000

PPL23.50 → 5.75

Throughput~4,950 tok/s

Peak VRAM34.2 GB

Bounded Adaptation

Adapt to fixed-size KV cache with memory banks.

Steps600

PPL27.88 → 6.31

Throughput~2,000 tok/s

Peak VRAM84.3 GB

Why two phases? Direct bounded evaluation of a pretrained model produces PPL 2,464 (cold-swap catastrophe). The untrained memory banks provide no compensation for the ~87.5% context loss. Phase 1 teaches signal/noise decomposition; Phase 2 teaches memory management.

Needle-in-Haystack Retention

✓

256 tokens

cos ≥ 0.9999

✓

1K tokens

cos ≥ 0.9999

✓

4K tokens

cos ≥ 0.9999

✓

16K tokens

cos ≥ 0.9999

100% retention at all tested lengths. The exact landmark bank preserves needle tokens with near-perfect fidelity, even 16K tokens after injection with only 32 bank slots.

Quick Start

pip install coda-gqa-l

# Swap Mistral-7B attention layers
from coda_gqa_l import LlamaCoDAAdapter

adapters = LlamaCoDAAdapter.swap_llama_layers(
    model, bounded=True,
    window=256, num_landmarks_exact=64,
    num_landmarks_summary=64,
)

# Load trained weights
import torch
state = torch.load("coda_adapters.pt", weights_only=True)
for i, adapter in enumerate(adapters):
    adapter.load_state_dict(state[f"layer_{i}"])

Paper PDF GitHub HuggingFace Checkpoint PyPI ← All Research