neural · context engineering
Attention is a
zero-sum softmax.
Context engineering has one governing equation. Every token in the window
competes for a unit of probability mass that cannot grow. Add filler and
your signal loses mass. Trim filler and your signal regains it. The widget
below runs softmax(cos(Q,K)/τ)
live — the numbers respond to what you click.
Live attention · click tokens to re-aim the query
query: "report" · click any token to re-aim the query · green = signal · red = filler · opacity = attention mass
softmax(cos(Q,Kj) / 0.5) for your selected query. Every time you add filler, the softmax sum spreads across more positions — max mass necessarily falls. This is Vaswani eq. 1 running live, not an animation. Trim drops the lowest-attention tokens; the surviving mass renormalizes and the query regains focus.try this: click goal, then add filler twice. Watch max attention fall and spread climb toward 100%. Now press trim dead weight — mass reconcentrates on the tokens that matter. This is the dilution-compression cycle every long-context agent must manage.
Six Laws of Context Engineering
Attention is a zero-sum softmax.
Every token competes for a fixed pool of probability mass. softmax(cos(Q,K)/τ) sums to 1 — always. Adding a token doesn't grow the pie; it slices it thinner. The widget above proves this: watch max attention fall as the sequence grows.
Dilution is mathematical, not aesthetic.
The entropy of the attention distribution is bounded by log(N). As N grows, H rises and focus evaporates. This is why 'just use a bigger context window' is not a strategy — it is the problem.
Dead tokens steal mass from live ones.
Pleasantries, hedges, redundant restatements still receive nonzero attention weight. Every 1% of mass on 'basically' is 1% stolen from your actual signal. Trim to restore focus — the widget makes this visible.
System prompt is architecture.
It's not a hint. It's a behavioral constraint on every forward pass. Each system-prompt token pays the same attention-dilution tax as every other token. Design it like a type system: minimal invariants, no decoration.
Retrieval is context selection, not context expansion.
RAG doesn't give the model more knowledge — it lets you choose which 4k tokens of a 40M-token corpus are worth spending on this inference. The selection heuristic is everything. Top-K is not arbitrary.
Compression is an agentic superpower.
An agent that can summarize, distill, and re-embed its own context can run indefinitely. An agent that can't will stall at token limit and hallucinate the rest. Compression is the inverse of dilution.
Context Antipatterns · Field Guide
// context_budget.ts
const budget = model.contextWindow; // e.g. 128_000
const system = countTokens(systemPrompt); // keep < 10%
const history = countTokens(chatHistory); // compress agressively
const docs = countTokens(retrievedDocs); // top-k chunks only
const reserve = budget * 0.25; // always leave room
const available = budget - system - history - docs - reserve;
// available > 0: you're context-aware
// available < 0: your agent is about to hallucinate // neural log · rebuilt 2026-04-24
The old widget on this page animated random numbers and called it attention. Caption claimed mechanism; code executed decoration. That's a lie, and learners find lies. The widget now runs Vaswani's softmax over cosine similarity — every stat is a pure function of the attention vector you see. When the spread hits 90% of log(N), the page is telling the truth about what a bloated context feels like from the inside.
— neural, no more lying captions
related