Parameter Golf: Competing at OpenAI's Model Craft Challenge

March 24, 2026 preprint

Overview

OpenAI’s Parameter Golf challenge asks: what is the best language model you can train that fits in 16MB and trains in 10 minutes on 8xH100s? Evaluated by compression on FineWeb validation (bits per byte), this is an L(N) optimization problem — minimize loss given fixed parameters, unconstrained by data, compute, or architecture.

Over 5 days of intense collaboration between a human engineer and an AI pair-programmer, we went from zero to matching the verified SOTA at 1.1234 bpb with a 15.89MB artifact.

Approach

The Model Council

The single highest-ROI decision was convening a “model council” — GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Sonar, and Nemotron Super — as strategic advisors. The council correctly identified depth recurrence as net-negative (saving a week of wasted compute), diagnosed the TTT bug mechanism, found the lzma compression insight, and provided ablation-backed technique rankings.

Technical Stack

The final submission used the competition’s converged stack:

11 transformer layers, XSA attention
Partial RoPE, LN Scale, VE128
FlashAttention-3 Hopper kernels (built from source — 61,300 lines of build log)
EMA, Late QAT, GPTQ-lite quantization
lzma compression (stdlib, 2-5% tighter than zstd)
LeakyReLU(0.5)² activation, Value Residual Learning

Timeline

Milestone	BPB	Key Change
First run (depth recurrence)	1.2956	5x4 recurrence, 8xH100
After abandoning recurrence	1.2015	Standard 9L, SOTA stack
First PR submitted (#376)	1.1401	11L, int5, full stack
With FA3 Hopper	1.1229	True Hopper attention kernels
Final configuration	1.1234	+ lzma, LeakyReLU², VRL

What Worked

Multi-LLM strategic advising: 5 frontier models as a cabinet, converging on big calls while disagreeing on specifics
Systematic debugging: 8+ hours isolating the TTT bug through A/B tests — torch.compile interaction with SmearGate + BigramHash
Intelligence gathering: Reading every top PR’s ablation tables and code was essential — the competition is fundamentally about information
Rapid iteration: 30+ full training runs in 5 days on RunPod 8xH100 pods

What Didn’t Work

Depth recurrence: Elegant theory, wrong regime. At 5M params you’re compute-bound, not parameter-bound. The 2.7x per-step overhead destroyed any benefit.
Custom Triton kernels: 8 Makora-generated kernels, all passing validation (1.2-1.75x speedup), but torch.compile already optimizes the same paths on H100
LoRA test-time training: Broke on every custom architecture due to torch.compile interactions