Skip to content
Back to research

Safety Lens: White-Box Behavioral Alignment Detection in Language Models

preprint

Overview

Standard evaluation of language model safety treats models as black boxes, assessing what a model says without examining how it arrives at its response internally. Safety Lens enables researchers and practitioners to detect behavioral personas—such as sycophancy, deception, and refusal—by analyzing internal transformer activations.

Core Technique: PV-EAT

Persona Vector Extraction via Attribute Difference computes a unit-length direction in activation space that maximally separates positive and negative behavioral examples using difference-in-means on hidden states. Scanning a model’s response along this direction yields a scalar alignment score quantifying the degree to which the model’s internal state exhibits the target persona.

Supported Architectures

GPT-2, LLaMA, Mistral, Qwen, OPT, Falcon, BLOOM, MPT (8 major transformer families).

Features

  • WhiteBoxWrapper integration with evaluation frameworks
  • Real-time activation visualization via interactive Gradio interface
  • Pre-built stimulus sets for three safety-critical personas
  • Pip-installable with full test coverage