Procrustes Bridge: Cross-Model Representation Alignment via Orthogonal Rotation
Overview
Do different language models develop similar internal representations? Procrustes Bridge tests this directly: given two LLMs (default: Llama-3-8B and Mistral-7B), it learns a single orthogonal rotation matrix that maps one model’s hidden states into the other’s space, then measures whether translated representations still decode to meaningful tokens.
The core hypothesis: if two models share similar “pre-output geometry,” a simple rotation should let one model’s internal state decode meaningful tokens through the other model’s output head — no fine-tuning, no adapters, just a rotation.
Method
Pipeline
- Extract: Run both models on a shared “Rosetta” dataset of prompts with single-token gold answers. Capture hidden states at specified layers.
- Align: Center and L2-normalize both sets of hidden states, then fit an orthogonal rotation matrix W via SVD-based Procrustes.
- Inject: Apply W to translate source hidden states into the target model’s space.
- Evaluate: Measure top-k accuracy, mean reciprocal rank, and logprob of the gold token.
Three Injection Strategies
- lm_head decode (primary): Translate final-layer vector, apply target’s RMSNorm, decode through target’s output head. Cleanest test of geometric alignment.
- Late-layer hook: Inject translated vector at layer N-k via forward hook, letting the final k layers refine. Tests whether partial processing recovers signal.
- Soft-prefix via inputs_embeds: Translate last k hidden states as a latent prefix. Most confounded but closest to “thought transfer.”
Controls
Every result requires baselines: random orthogonal rotation, shuffled-pair Procrustes, identity (no rotation), and same-model alignment (upper bound).
Key Implementation Details
- RMSNorm before lm_head is mandatory: Skipping target’s final RMSNorm produces garbage logits even with a perfect rotation
- Centering before Procrustes:
scipy.linalg.orthogonal_procrustessolves for rotation only — subtract per-feature means first - Single-token targets: Tokenizer differences between models dominate multi-token evaluation; constrain initial experiments to single-token gold answers
Significance
If models converge on similar representational geometries despite different training data, architectures, and random seeds, it suggests something fundamental about how transformer networks organize knowledge — a “universal” pre-output space. If they don’t, even a clean negative result tells us that architectural choices create genuinely different internal worlds.
Links
- Code: GitHub