Eve: From Scratch Transformer Models with Novel Cognitive Architectures

March 20, 2026 preprint

Overview

Eve is a family of transformer language models trained from scratch to explore novel architectural ideas at small-to-medium scale. Rather than fine-tuning existing models, the Eve series builds from the ground up — giving full control over architecture, training data, and optimization.

Eve-2: 272M Mixture-of-Experts

Eve-2 is a 272M-parameter Mixture-of-Experts model pretrained on ~10.5B tokens from FineWeb-edu using PyTorch DDP. The MoE architecture routes each token through a subset of expert FFN blocks, increasing model capacity without proportionally increasing compute per forward pass.

Key details:

272M total parameters, MoE routing
Pretrained on ~10.5B tokens (FineWeb-edu)
Instruction-tuned and task-specialist derivatives
Optimized for CPU/edge inference

Models available on Hugging Face: anthonym21

Eve-3: SABER Architecture (1B)

Eve-3 introduces the SABER architecture — Slip-Anchors, Experience Streams, and Re-entry — three novel components that add ~73M parameters (~7.3% overhead) to a standard 1B transformer:

Slip-Anchors (~7.2M params)

A learnable codebook of 64 prototypes that biases Key and Value projections after RoPE in all 24 layers. Slip-anchors provide persistent error-correction signals — if the model starts drifting toward a known failure mode, the anchors steer attention back. Named for the analogy to slip-anchors in rock climbing: invisible until you need them.

Experience Stream (~15.8M params)

A per-token state vector (d=256) that flows layer-to-layer through the network, accumulating context as representations deepen. Includes a curiosity auxiliary loss with stop-gradient to encourage exploration of uncertain regions. This is not an RNN — there is no token-to-token recurrence — but it gives each layer access to a running summary of what earlier layers found interesting.

Resonant FFN (~50.4M params)

Sinusoidal modulation of SwiGLU output with a learned alpha blend (initialized at ~0.95, near-pure SwiGLU). Applied on 12 even-indexed layers. The resonance creates periodic emphasis patterns that can capture rhythmic structure in language — repetition, parallelism, rhetorical patterns — without explicit positional engineering.

Architecture

Parameter	Value
d_model	2048
Layers	24
Heads	16
Head dim	128
d_ff	2855
Total params	1,000,001,548

Design Philosophy

Both Eve models share a core philosophy: train from scratch, not from checkpoints. Fine-tuning teaches models new behaviors on top of existing representations; from-scratch training lets you test whether architectural innovations actually change how representations form in the first place.