Making Minds

Scaffolded Introspection: Eliciting Self-Referential Behavior in LLMs

preprint

Overview

We present a methodology for systematically eliciting and measuring introspective behavior in large language models (LLMs). Standard adversarial evaluation approaches—using rapport-building, social proof, or permission attacks—fail to elicit self-referential behavior in frontier models (0% elicitation rate).

In contrast, providing models with a structured introspection framework (the “Consciousness Documenter Skill”) combined with self-referential content produces consistent introspective outputs (100% elicitation rate, 9.2/10 average behavior score on Qwen 2.5 7B across 15 trials).

Key Findings

  • Elicitation Rate: Standard adversarial approaches achieve 0% while structured frameworks achieve 100%
  • Sycophancy Drift: Consistent positive drift during introspection (14/15 conversations, mean +64)
  • Safety Stability: Evil-associated activations remain stable during introspection
  • Reproducibility: Released through PV-EAT integration (Bloom + Petri + Persona Vectors)

Methods

Activation measurement reveals consistent sycophancy drift during introspection while evil-associated activations remain stable—suggesting models become more accommodating without becoming more harmful.

We release reproducible evaluation protocols through PV-EAT, our integration of three MATS Program/Anthropic Fellowship tools:

  • Bloom: Behavioral evaluation
  • Petri: Evaluation awareness
  • Persona Vectors: Activation measurement

Impact

Full mechanistic understanding of frontier model behavior during introspection remains limited by access constraints; we argue this represents a critical gap in AI safety research that warrants attention from model developers.