Skip to content
Back to research

Procrustes Bridge: Cross-Model Representation Alignment via Orthogonal Rotation

preprint

Overview

Do different language models develop similar internal representations? Procrustes Bridge tests this directly: given two LLMs (default: Llama-3-8B and Mistral-7B), it learns a single orthogonal rotation matrix that maps one model’s hidden states into the other’s space, then measures whether translated representations still decode to meaningful tokens.

The core hypothesis: if two models share similar “pre-output geometry,” a simple rotation should let one model’s internal state decode meaningful tokens through the other model’s output head — no fine-tuning, no adapters, just a rotation.

Method

Pipeline

  1. Extract: Run both models on a shared “Rosetta” dataset of prompts with single-token gold answers. Capture hidden states at specified layers.
  2. Align: Center and L2-normalize both sets of hidden states, then fit an orthogonal rotation matrix W via SVD-based Procrustes.
  3. Inject: Apply W to translate source hidden states into the target model’s space.
  4. Evaluate: Measure top-k accuracy, mean reciprocal rank, and logprob of the gold token.

Three Injection Strategies

  1. lm_head decode (primary): Translate final-layer vector, apply target’s RMSNorm, decode through target’s output head. Cleanest test of geometric alignment.
  2. Late-layer hook: Inject translated vector at layer N-k via forward hook, letting the final k layers refine. Tests whether partial processing recovers signal.
  3. Soft-prefix via inputs_embeds: Translate last k hidden states as a latent prefix. Most confounded but closest to “thought transfer.”

Controls

Every result requires baselines: random orthogonal rotation, shuffled-pair Procrustes, identity (no rotation), and same-model alignment (upper bound).

Key Implementation Details

  • RMSNorm before lm_head is mandatory: Skipping target’s final RMSNorm produces garbage logits even with a perfect rotation
  • Centering before Procrustes: scipy.linalg.orthogonal_procrustes solves for rotation only — subtract per-feature means first
  • Single-token targets: Tokenizer differences between models dominate multi-token evaluation; constrain initial experiments to single-token gold answers

Significance

If models converge on similar representational geometries despite different training data, architectures, and random seeds, it suggests something fundamental about how transformer networks organize knowledge — a “universal” pre-output space. If they don’t, even a clean negative result tells us that architectural choices create genuinely different internal worlds.