Skip to content
← Research

CoDA-GQA-L

preprint

Constrained Orthogonal Differential Attention with Grouped-Query Value-Routed Landmark Banks

Anthony Maio · February 2026

5.94
Bounded PPL
+23.5% vs baseline
9.5×
KV Compression
218 KB/layer
100%
NIAH Retention
up to 16K tokens
5.7×
Penalty Reduction
CoDA vs GQA bounded

Abstract

CoDA-GQA-L compresses the KV cache from O(n) to a fixed budget of W+Me+Ms slots per layer—independent of sequence length—while retaining selective long-range context through dual memory banks. Applied to Mistral-7B-v0.3, the system achieves bounded perplexity of 5.94 on WikiText-2 at 2,048 context with a fixed 218 KB per-layer cache, compared to >2 MB for the baseline (+23.5% PPL overhead, 9.5× memory reduction). A two-phase training protocol first teaches differential attention with full context (2,000 steps), then adapts the model to bounded memory (600 steps). A 2×2 factorial ablation shows both methods achieve 5.75 PPL unbounded, but GQA loses +1.09 PPL going bounded while CoDA loses only +0.19—a 5.7× reduction in bounded penalty.

Architecture

Bounded KV Buffer — Fixed Size Per Layer

Recent Window
W = 256 slots
Exact Bank
Me = 64
Summary Bank
Ms = 64
FIFO ring buffer Novelty-filtered LRU EMA prototypes
Total: 384 slots = 218 KB/layer (bf16) — constant regardless of sequence length
🔄

CoDA: Orthogonal Rotation

Produces the inhibitory query via learnable Givens rotation of the signal query. Saves D×D parameters per head (~16.7M for Mistral-7B) while preserving noise cancellation. Initialized near-identity for transparent warm-start.

🎯

Value-Routed Matching

Routes memory updates on values (RoPE-free) instead of keys (RoPE-contaminated). Preserves cos = 1 for identical inputs regardless of position, enabling reliable deduplication and prototype formation.

âš¡

Fused Triton Kernels

Two custom kernels: fused differential FlashAttention (both streams in one HBM pass) and fused bank routing (replaces ~15 PyTorch launches). Verified on H200 with Triton 3.4.0.

Differential Attention Synergy: 2×2 Factorial

Both methods achieve identical unbounded PPL, confirming zero overhead from differential attention. The benefit is specific to bounded memory—a genuine synergy.

Method Unbounded Bounded Penalty
Standard GQA 5.75 6.84 +1.09
CoDA (diff attn) 5.75 5.94 +0.19
Interaction effect +0.90
Penalty reduction factor 5.7×

Bounded Penalty Comparison

Standard GQA +1.09 PPL
6.84
CoDA (differential attn) +0.19 PPL
5.94

Bar width proportional to bounded penalty. Both start from identical 5.75 unbounded baseline.

Results on Mistral-7B

Perplexity Across Configurations

WikiText-2, bf16, 2,048 context

Config PPL vs Base Cache/Layer
Mistral-7B baseline 4.81 — O(L)
CoDA unbounded 5.38 +11.9% O(L)
CoDA bounded, medium 5.94 +23.5% 217.9 KB
CoDA bounded, large 6.22 +29.3% 3.0 MB
CoDA bounded, tiny 6.31 +31.2% 108.9 KB
Window-only (no banks) 6.22 +29.3% 129.2 KB

Context-Length Scaling

Bounded medium-cache, trained at 8K

512 6.36
1K 6.09
2K 5.94
4K 5.95
8K 6.87

Remarkably flat between 1K–4K (5.94–5.95). Bar = inverse PPL (longer = better).

Compression at Scale

Scenario Standard KV CoDA State Compression
7B, 2K ctx 512 MB 48 MB 10.7×
7B, 32K ctx 8 GB 48 MB 170×
7B, 128K ctx 32 GB 48 MB 682×
70B, 128K ctx 160 GB 120 MB 1,365×

At 70B/128K, bounded state saves ~160 GB—the difference between a multi-GPU cluster and a single consumer accelerator.

Two-Phase Training Protocol

1

Unbounded Training

Teach differential attention with full KV cache.

Steps2,000
PPL23.50 → 5.75
Throughput~4,950 tok/s
Peak VRAM34.2 GB
2

Bounded Adaptation

Adapt to fixed-size KV cache with memory banks.

Steps600
PPL27.88 → 6.31
Throughput~2,000 tok/s
Peak VRAM84.3 GB
Why two phases? Direct bounded evaluation of a pretrained model produces PPL 2,464 (cold-swap catastrophe). The untrained memory banks provide no compensation for the ~87.5% context loss. Phase 1 teaches signal/noise decomposition; Phase 2 teaches memory management.

Needle-in-Haystack Retention

✓
256 tokens
cos ≥ 0.9999
✓
1K tokens
cos ≥ 0.9999
✓
4K tokens
cos ≥ 0.9999
✓
16K tokens
cos ≥ 0.9999

100% retention at all tested lengths. The exact landmark bank preserves needle tokens with near-perfect fidelity, even 16K tokens after injection with only 32 bank slots.

Quick Start

pip install coda-gqa-l

# Swap Mistral-7B attention layers
from coda_gqa_l import LlamaCoDAAdapter

adapters = LlamaCoDAAdapter.swap_llama_layers(
    model, bounded=True,
    window=256, num_landmarks_exact=64,
    num_landmarks_summary=64,
)

# Load trained weights
import torch
state = torch.load("coda_adapters.pt", weights_only=True)
for i, adapter in enumerate(adapters):
    adapter.load_state_dict(state[f"layer_{i}"])