Parameter Golf: Competing at OpenAI's Model Craft Challenge
Overview
OpenAI’s Parameter Golf challenge asks: what is the best language model you can train that fits in 16MB and trains in 10 minutes on 8xH100s? Evaluated by compression on FineWeb validation (bits per byte), this is an L(N) optimization problem — minimize loss given fixed parameters, unconstrained by data, compute, or architecture.
Over 5 days of intense collaboration between a human engineer and an AI pair-programmer, we went from zero to matching the verified SOTA at 1.1234 bpb with a 15.89MB artifact.
Approach
The Model Council
The single highest-ROI decision was convening a “model council” — GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Sonar, and Nemotron Super — as strategic advisors. The council correctly identified depth recurrence as net-negative (saving a week of wasted compute), diagnosed the TTT bug mechanism, found the lzma compression insight, and provided ablation-backed technique rankings.
Technical Stack
The final submission used the competition’s converged stack:
- 11 transformer layers, XSA attention
- Partial RoPE, LN Scale, VE128
- FlashAttention-3 Hopper kernels (built from source — 61,300 lines of build log)
- EMA, Late QAT, GPTQ-lite quantization
- lzma compression (stdlib, 2-5% tighter than zstd)
- LeakyReLU(0.5)² activation, Value Residual Learning
Timeline
| Milestone | BPB | Key Change |
|---|---|---|
| First run (depth recurrence) | 1.2956 | 5x4 recurrence, 8xH100 |
| After abandoning recurrence | 1.2015 | Standard 9L, SOTA stack |
| First PR submitted (#376) | 1.1401 | 11L, int5, full stack |
| With FA3 Hopper | 1.1229 | True Hopper attention kernels |
| Final configuration | 1.1234 | + lzma, LeakyReLU², VRL |
What Worked
- Multi-LLM strategic advising: 5 frontier models as a cabinet, converging on big calls while disagreeing on specifics
- Systematic debugging: 8+ hours isolating the TTT bug through A/B tests — torch.compile interaction with SmearGate + BigramHash
- Intelligence gathering: Reading every top PR’s ablation tables and code was essential — the competition is fundamentally about information
- Rapid iteration: 30+ full training runs in 5 days on RunPod 8xH100 pods
What Didn’t Work
- Depth recurrence: Elegant theory, wrong regime. At 5M params you’re compute-bound, not parameter-bound. The 2.7x per-step overhead destroyed any benefit.
- Custom Triton kernels: 8 Makora-generated kernels, all passing validation (1.2-1.75x speedup), but torch.compile already optimizes the same paths on H100
- LoRA test-time training: Broke on every custom architecture due to torch.compile interactions
Links
- Competition: OpenAI Parameter Golf