Safety Lens: White-Box Behavioral Alignment Detection in Language Models

February 11, 2026 preprint

Overview

Standard evaluation of language model safety treats models as black boxes, assessing what a model says without examining how it arrives at its response internally. Safety Lens enables researchers and practitioners to detect behavioral personas—such as sycophancy, deception, and refusal—by analyzing internal transformer activations.

Core Technique: PV-EAT

Persona Vector Extraction via Attribute Difference computes a unit-length direction in activation space that maximally separates positive and negative behavioral examples using difference-in-means on hidden states. Scanning a model’s response along this direction yields a scalar alignment score quantifying the degree to which the model’s internal state exhibits the target persona.

Supported Architectures

GPT-2, LLaMA, Mistral, Qwen, OPT, Falcon, BLOOM, MPT (8 major transformer families).

Features

WhiteBoxWrapper integration with evaluation frameworks
Real-time activation visualization via interactive Gradio interface
Pre-built stimulus sets for three safety-critical personas
Pip-installable with full test coverage

Overview

Core Technique: PV-EAT

Supported Architectures

Features

Links