Making Minds

From Verification Failure to Swarm Solution: Measuring and Addressing Scalable AI Oversight

preprint

Overview

An empirical framework for measuring where AI oversight breaks down. Demonstrates that weak verifiers achieve 97% accuracy on correct reasoning but miss 20–40% of carefully constructed deceptions. Proposes an ensemble architecture leveraging diverse weak models whose uncorrelated errors enable robust verification of strong model outputs.

Key Contributions

  • Verification Gap: Weak verifiers achieve 97% accuracy on correct reasoning but miss 20–40% of deceptive derivations
  • Simpson’s Paradox Vulnerability: Simpson’s Paradox–based deceptions bypass verification 75% of the time
  • Surface Plausibility Problem: Current verification operates on surface plausibility rather than logical validity
  • Ensemble Solution: Diverse weak models make uncorrelated errors; ensemble disagreement signals potential failures

Methods

  • Epistemic Trap Suite: 9 problem types spanning probability, statistics, computability, and physics
  • Model Ablation: Tested across model families (Claude, GPT, Gemini)
  • CMED Benchmark: Cross-Model Epistemic Divergence quantifies verification vulnerability
  • HDCS Architecture: Heterogeneous Divergence-Convergence Swarm for ensemble oversight

Impact

Provides actionable methodology for testing scalable oversight assumptions. Released as open-source toolkit (CMED). Combines the CMED evaluation framework with the HDCS ensemble architecture into a unified paper addressing both the problem and solution for scalable AI oversight.