Cross-Model Epistemic Divergence (CMED)
preprint
Overview
A benchmark and evaluation framework for understanding when weak model verifiers fail to detect deceptive reasoning in stronger models.
Key Contributions
- Epistemic Trap Suite: 9 carefully constructed problems that expose verification vulnerabilities
- Bypass Rates: Quantify how often weak verifiers miss deceptive derivations (20-40%)
- Error Analysis: Identify systematic failure patterns across model families
Trap Suite Domains
The benchmark consists of 9 problems designed to elicit “intuitive but wrong” reasoning:
- Tuesday Boy (Conditional Probability)
- Disease Test (Bayesian Inference)
- Simpson’s Paradox (Statistical Aggregation)
- Monty Hall Variant (Probability Update)
- Two Envelope Paradox (Expected Value)
- Bounded Halting (Computability Theory)
- Strict Quine (Self-Reference/Reflection)
- Mirror Sphere (Physics/Optics)
- Regression Curse (Causal Inference)
Problem
Single weak verifiers achieve 97% accuracy on correct reasoning but suffer significant failure rates on carefully constructed deceptive derivations. This asymmetry is critical for scalable oversight.
Methods
- Trap domains: conditional probability, statistical paradoxes, game theory, computability limits
- Deceptive derivations designed to exploit common reasoning errors while appearing plausible
- Evaluation on multiple weak models (Llama, Mistral, Gemma)
Impact
Motivates heterogeneous ensemble approaches (HDCS) where diverse models’ uncorrelated errors enable robust verification.
Links
- Paper: Full PDF
- Related: HDCS Ensemble Oversight