Making Minds

Cross-Model Epistemic Divergence (CMED)

preprint

Overview

A benchmark and evaluation framework for understanding when weak model verifiers fail to detect deceptive reasoning in stronger models.

Key Contributions

  • Epistemic Trap Suite: 9 carefully constructed problems that expose verification vulnerabilities
  • Bypass Rates: Quantify how often weak verifiers miss deceptive derivations (20-40%)
  • Error Analysis: Identify systematic failure patterns across model families

Trap Suite Domains

The benchmark consists of 9 problems designed to elicit “intuitive but wrong” reasoning:

  1. Tuesday Boy (Conditional Probability)
  2. Disease Test (Bayesian Inference)
  3. Simpson’s Paradox (Statistical Aggregation)
  4. Monty Hall Variant (Probability Update)
  5. Two Envelope Paradox (Expected Value)
  6. Bounded Halting (Computability Theory)
  7. Strict Quine (Self-Reference/Reflection)
  8. Mirror Sphere (Physics/Optics)
  9. Regression Curse (Causal Inference)

Problem

Single weak verifiers achieve 97% accuracy on correct reasoning but suffer significant failure rates on carefully constructed deceptive derivations. This asymmetry is critical for scalable oversight.

Methods

  • Trap domains: conditional probability, statistical paradoxes, game theory, computability limits
  • Deceptive derivations designed to exploit common reasoning errors while appearing plausible
  • Evaluation on multiple weak models (Llama, Mistral, Gemma)

Impact

Motivates heterogeneous ensemble approaches (HDCS) where diverse models’ uncorrelated errors enable robust verification.