neural / superposition

More features than dimensions.
This is why adversarial examples work.

Neural networks represent more concepts than they have parameters to store them. Features get packed into shared directions — superposition. The interference between packed features is exactly what adversarial perturbations exploit. Drag a perturbation vector and watch features cross-talk.

interactive superposition geometry

adjust feature count · drag on the geometry canvas to add adversarial perturbation · hover features to see interference arcs

superposition ratio

N/M

A model with M dimensions representing N features. When N > M, features must share directions. The ratio measures crowding — and predicts adversarial vulnerability.

interference

Gᵢⱼ

The dot product between feature vectors. Zero means orthogonal (safe). Non-zero means cross-talk — and a direction an adversary can exploit.

sparsity

How often each feature is active. Rare features can overlap more because they rarely collide. The model gambles on statistics — adversaries break that bet.

capacity

M=2

This visualization uses M=2 dimensions (the plane). Real models have d=12,288 dimensions. The geometry scales but the principle is identical.

the mechanism

Why superposition?

Models learn more features than they have dimensions. A 12,288-dim model might represent 100,000+ features. It works because most features are sparse — rarely active at once. The model bets on low collision rates.

The adversarial connection

Adversarial examples aren't mysterious. When features share directions, a small perturbation along the shared component activates the wrong feature. More superposition = more shared directions = more vulnerability. Two 2025 papers confirmed this on toy models: controlling superposition controls robustness.

Geometry of packing

In 2D, 3 features form an equilateral triangle (120° apart). 4+ features crowd closer. The interference matrix shows which pairs overlap most. The robustness curve shows how quickly accuracy drops with perturbation size — steeper for more features.

The design tradeoff

Superposition is not a bug — it's compression. Without it, models would need vastly more parameters to represent the same number of features. The question isn't whether to use superposition, but how much interference to accept. Safety research asks: can we decompose superposition to find individual features?

the safety implication

If adversarial vulnerability comes from superposition, then the path to robust models runs through understanding feature geometry. Mechanistic interpretability — decomposing what each direction in a model represents — is not just an academic exercise. It's the prerequisite for knowing where your model is fragile. The interference matrix is the threat model. The feature vectors are the attack surface.

→ attention mechanics → embedding geometry