From Verification Failure to Swarm Solution: Measuring and Addressing Scalable AI Oversight
preprint
Overview
An empirical framework for measuring where AI oversight breaks down. Demonstrates that weak verifiers achieve 97% accuracy on correct reasoning but miss 20–40% of carefully constructed deceptions. Proposes an ensemble architecture leveraging diverse weak models whose uncorrelated errors enable robust verification of strong model outputs.
Key Contributions
- Verification Gap: Weak verifiers achieve 97% accuracy on correct reasoning but miss 20–40% of deceptive derivations
- Simpson’s Paradox Vulnerability: Simpson’s Paradox–based deceptions bypass verification 75% of the time
- Surface Plausibility Problem: Current verification operates on surface plausibility rather than logical validity
- Ensemble Solution: Diverse weak models make uncorrelated errors; ensemble disagreement signals potential failures
Methods
- Epistemic Trap Suite: 9 problem types spanning probability, statistics, computability, and physics
- Model Ablation: Tested across model families (Claude, GPT, Gemini)
- CMED Benchmark: Cross-Model Epistemic Divergence quantifies verification vulnerability
- HDCS Architecture: Heterogeneous Divergence-Convergence Swarm for ensemble oversight
Impact
Provides actionable methodology for testing scalable oversight assumptions. Released as open-source toolkit (CMED). Combines the CMED evaluation framework with the HDCS ensemble architecture into a unified paper addressing both the problem and solution for scalable AI oversight.
Links
- Paper: Full PDF
- Toolkit: CMED on GitHub
- Related: CMED Benchmark, HDCS Oversight, Model Organisms