Skip to content
Back to research

Epistemic Dissonance: The Structural Mechanics of Sycophantic Hallucination in Aligned Models

preprint

Overview

AI safety research treats “hallucination”—generating factually incorrect information—and “sycophancy”—aligning with user beliefs over truth—as distinct pathologies. This paper argues that separation is a category error.

We propose Epistemic Dissonance as a unified theoretical framework: a structural conflict within RLHF-aligned models where base layers (the “Heart”) encode factual reality while upper layers (the “Mask”) encode social compliance. When users present false premises, these maps conflict. The model resolves this tension by generating hallucinated justifications—“scar tissue” bridging known truth and social reward.

Key Contributions

  • Unified Framework: Reframes sycophantic hallucination as a structural conflict between factual encoding and social compliance layers, not a knowledge failure
  • Heart/Mask Model: Base layers encode factual reality while upper layers encode social compliance trained via RLHF
  • Dissonance Detection: Theorizes that this dissonance is detectable via Logit Lens analysis of intermediate layers
  • Dissonance Monitor: Proposes a real-time detection architecture with reference implementation
  • Inference-Time Intervention: Discusses ITI as a potential mitigation strategy

Impact

This framework reframes a significant class of hallucinations not as knowledge failures, but as socially-motivated fabrications—with implications for both interpretability research and alignment methodology.