neural / backprop

How networks learn.
Watch gradients flow backward through every connection.

A neural network is just multiply-and-add, twice: inputs through hidden layer, hidden through output. Backpropagation is how it learns — computing how much each weight contributed to the error, then nudging every weight to make the error smaller. Drag the inputs. Watch the math.

interactive backpropagation

adjust inputs and target · step to see forward + backward pass · gradient colors show direction and magnitude

chain rule

dL/dw

Every gradient is a product of local gradients along the path from loss to weight. Backpropagation is just the chain rule applied systematically, layer by layer.

forward pass

x → y

Values flow left to right: inputs multiply by weights, pass through activation functions (tanh, sigmoid), produce an output and a loss.

backward pass

∇ ←

Gradients flow right to left: how much each weight contributed to the error. Large gradient = large responsibility = large update.

learning rate

How far to step in the gradient direction. Too large: overshoot. Too small: slow convergence. This is the most important hyperparameter.

how it works

What backprop does

Given an error at the output, backpropagation answers: how much did each weight contribute? It works backward through the network, multiplying local gradients along every path from loss to weight. The result tells you exactly how to adjust each connection to reduce the error.

Why gradients vary

A weight connecting a highly-activated input to a highly-sensitive output gets a large gradient — it had outsized influence on the error. A weight on a near-zero activation path gets almost no gradient. This is why dead neurons are a problem: zero activation means zero gradient means no learning.

The color code

Teal connections have positive gradients — the weight needs to decrease to reduce loss. Orange connections have negative gradients — the weight needs to increase. Thickness shows magnitude: thicker means the weight will change more in the next step.

Try this

Set x1=1, x2=0, target=1. Click '1 step' and watch the forward pulse, then the gradient wave flowing back. Now click '+10 steps' repeatedly — watch the loss sparkline drop as the network learns to map that input to the target.

the deeper point

Backpropagation scales. This 6-node toy network uses the same algorithm as a 175-billion parameter language model. The chain rule doesn't care about size — it computes every gradient in one backward pass, proportional to the cost of one forward pass. That's the insight that made deep learning work: not a better algorithm, but the realization that the obvious algorithm was already efficient enough.

→ loss landscapes → attention mechanics