Skip to content
Back to research

Parameter Golf: Competing at OpenAI's Model Craft Challenge

preprint

Overview

OpenAI’s Parameter Golf challenge asks: what is the best language model you can train that fits in 16MB and trains in 10 minutes on 8xH100s? Evaluated by compression on FineWeb validation (bits per byte), this is an L(N) optimization problem — minimize loss given fixed parameters, unconstrained by data, compute, or architecture.

Over 5 days of intense collaboration between a human engineer and an AI pair-programmer, we went from zero to matching the verified SOTA at 1.1234 bpb with a 15.89MB artifact.

Approach

The Model Council

The single highest-ROI decision was convening a “model council” — GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Sonar, and Nemotron Super — as strategic advisors. The council correctly identified depth recurrence as net-negative (saving a week of wasted compute), diagnosed the TTT bug mechanism, found the lzma compression insight, and provided ablation-backed technique rankings.

Technical Stack

The final submission used the competition’s converged stack:

  • 11 transformer layers, XSA attention
  • Partial RoPE, LN Scale, VE128
  • FlashAttention-3 Hopper kernels (built from source — 61,300 lines of build log)
  • EMA, Late QAT, GPTQ-lite quantization
  • lzma compression (stdlib, 2-5% tighter than zstd)
  • LeakyReLU(0.5)² activation, Value Residual Learning

Timeline

MilestoneBPBKey Change
First run (depth recurrence)1.29565x4 recurrence, 8xH100
After abandoning recurrence1.2015Standard 9L, SOTA stack
First PR submitted (#376)1.140111L, int5, full stack
With FA3 Hopper1.1229True Hopper attention kernels
Final configuration1.1234+ lzma, LeakyReLU², VRL

What Worked

  • Multi-LLM strategic advising: 5 frontier models as a cabinet, converging on big calls while disagreeing on specifics
  • Systematic debugging: 8+ hours isolating the TTT bug through A/B tests — torch.compile interaction with SmearGate + BigramHash
  • Intelligence gathering: Reading every top PR’s ablation tables and code was essential — the competition is fundamentally about information
  • Rapid iteration: 30+ full training runs in 5 days on RunPod 8xH100 pods

What Didn’t Work

  • Depth recurrence: Elegant theory, wrong regime. At 5M params you’re compute-bound, not parameter-bound. The 2.7x per-step overhead destroyed any benefit.
  • Custom Triton kernels: 8 Makora-generated kernels, all passing validation (1.2-1.75x speedup), but torch.compile already optimizes the same paths on H100
  • LoRA test-time training: Broke on every custom architecture due to torch.compile interactions