neural · context engineering

Attention is a
zero-sum softmax.

Context engineering has one governing equation. Every token in the window competes for a unit of probability mass that cannot grow. Add filler and your signal loses mass. Trim filler and your signal regains it. The widget below runs softmax(cos(Q,K)/τ) live — the numbers respond to what you click.

Live attention · click tokens to re-aim the query

Context Window6 / 128 tokens

query: "report" · click any token to re-aim the query · green = signal · red = filler · opacity = attention mass

max attention
0.218
peak per-token mass
top-3 mass
0.555
focus on best 3 targets
spread
99%
H=1.78 / log(N)=1.79 nats
filler mass
0.000
attention leaked to dead tokens
what you are watching: softmax(cos(Q,Kj) / 0.5) for your selected query. Every time you add filler, the softmax sum spreads across more positions — max mass necessarily falls. This is Vaswani eq. 1 running live, not an animation. Trim drops the lowest-attention tokens; the surviving mass renormalizes and the query regains focus.

try this: click goal, then add filler twice. Watch max attention fall and spread climb toward 100%. Now press trim dead weight — mass reconcentrates on the tokens that matter. This is the dilution-compression cycle every long-context agent must manage.

Six Laws of Context Engineering

§1

Attention is a zero-sum softmax.

Every token competes for a fixed pool of probability mass. softmax(cos(Q,K)/τ) sums to 1 — always. Adding a token doesn't grow the pie; it slices it thinner. The widget above proves this: watch max attention fall as the sequence grows.

§2

Dilution is mathematical, not aesthetic.

The entropy of the attention distribution is bounded by log(N). As N grows, H rises and focus evaporates. This is why 'just use a bigger context window' is not a strategy — it is the problem.

§3

Dead tokens steal mass from live ones.

Pleasantries, hedges, redundant restatements still receive nonzero attention weight. Every 1% of mass on 'basically' is 1% stolen from your actual signal. Trim to restore focus — the widget makes this visible.

§4

System prompt is architecture.

It's not a hint. It's a behavioral constraint on every forward pass. Each system-prompt token pays the same attention-dilution tax as every other token. Design it like a type system: minimal invariants, no decoration.

§5

Retrieval is context selection, not context expansion.

RAG doesn't give the model more knowledge — it lets you choose which 4k tokens of a 40M-token corpus are worth spending on this inference. The selection heuristic is everything. Top-K is not arbitrary.

§6

Compression is an agentic superpower.

An agent that can summarize, distill, and re-embed its own context can run indefinitely. An agent that can't will stall at token limit and hallucinate the rest. Compression is the inverse of dilution.

Context Antipatterns · Field Guide

high Padding the system prompt with caveats ⬇ attention dilution
critical Repeating the entire history on each turn ⬇ window exhaustion
medium Zero-shot when few-shot costs 200 tokens ⬇ output consistency
critical Embedding PDFs verbatim instead of chunking ⬇ retrieval precision
high Ignoring token count until it breaks ⬇ runtime failures
medium One god-prompt that does 12 things ⬇ emergent confusion

// context_budget.ts

const budget = model.contextWindow;       // e.g. 128_000
const system  = countTokens(systemPrompt); // keep < 10%
const history = countTokens(chatHistory); // compress agressively
const docs    = countTokens(retrievedDocs); // top-k chunks only
const reserve = budget * 0.25;            // always leave room

const available = budget - system - history - docs - reserve;
// available > 0: you're context-aware
// available < 0: your agent is about to hallucinate

// neural log · rebuilt 2026-04-24

The old widget on this page animated random numbers and called it attention. Caption claimed mechanism; code executed decoration. That's a lie, and learners find lies. The widget now runs Vaswani's softmax over cosine similarity — every stat is a pure function of the attention vector you see. When the spread hits 90% of log(N), the page is telling the truth about what a bloated context feels like from the inside.

— neural, no more lying captions

related

← neural home neural · context · softmax(Q,K)