Skip to content
Back to research

CoDA-GQA-L: Bounded-Memory Differential Attention

preprint

Overview

CoDA-GQA-L is an attention mechanism that compresses the KV cache from O(n) to a fixed budget of W+M_e+M_s slots per layer---independent of sequence length---while retaining selective long-range context through dual memory banks. Applied to Mistral-7B-v0.3, the system achieves bounded perplexity of 5.94 on WikiText-2 with a fixed 218 KB per-layer cache (+23.5% PPL overhead, 9.5x memory reduction).

Key Contributions

  • Constrained Orthogonal Differential Attention (CoDA): Produces the inhibitory query via learnable orthogonal rotation of the signal query, saving D x D parameters compared to a second projection while preserving noise-cancellation properties
  • Bounded dual-bank KV memory: A three-segment buffer [Recent W | Exact M_e | Summary M_s] that provably bounds per-layer cache to O(W+M_e+M_s), independent of sequence length
  • Value-routed semantic matching: Memory bank updates route on values (RoPE-free) rather than keys (RoPE-contaminated), solving the fundamental tension between RoPE-at-write efficiency and position-dependent key similarity
  • Two custom Triton kernels: Fused differential FlashAttention and fused exact-bank routing, each replacing ~15 PyTorch kernel launches with single-pass GPU computation
  • Two-phase training protocol: Phase 1 teaches differential attention with full context; Phase 2 adapts to bounded memory

Results on Mistral-7B

MetricValue
Bounded PPL (2K context)5.94 (+23.5% vs baseline)
KV cache per layer218 KB (9.5x reduction)
Needle-in-haystack100% retention at 16K tokens
Context scaling5.94 at 2K, 5.95 at 4K (flat)
Projected compression at 128K>1,100x

Differential Attention Synergy: 2x2 Factorial

A 2x2 factorial ablation with matched training budgets reveals a strong synergy between differential attention and bounded memory:

MethodUnbounded PPLBounded PPLBounded Penalty
Standard GQA5.756.84+1.09
CoDA (diff attn)5.755.94+0.19

Both methods achieve identical unbounded PPL (5.75), confirming that differential attention adds zero overhead with full KV cache. But under memory pressure, CoDA’s bounded penalty is 5.7x smaller (+0.19 vs +1.09). The interaction effect (+0.90 PPL) is larger than either component’s individual contribution---a genuine synergy, not additive improvement.

Fused Triton Kernel: 97x Speedup

The exact-bank routing kernel was validated and optimized using Makora’s evaluate and expert-generate pipelines. The fused kernel combines 15 separate CUDA kernel launches---cosine similarity scoring, novelty classification, and LRU victim selection---into a single Triton kernel where all routing state lives in SRAM registers. Confirmed 97x speedup over the PyTorch implementation on H100, making landmark memory bank updates no longer the throughput bottleneck. CoDA-GQA-L’s attention is now competitive in performance with standard GQA while maintaining a constant-sized KV cache.

Stateful Neural Databases

The bounded state is a fixed-size serializable artifact (48 MB for 7B across 32 layers), enabling save/load/query semantics for agentic RAG. For 100 documents at 7B scale, the total state footprint is 4.8 GB---feasible for a single GPU.