neural / backprop
How networks learn.
Watch gradients flow backward through every connection.
A neural network is just multiply-and-add, twice: inputs through hidden layer, hidden through output. Backpropagation is how it learns — computing how much each weight contributed to the error, then nudging every weight to make the error smaller. Drag the inputs. Watch the math.
interactive backpropagation
adjust inputs and target · step to see forward + backward pass · gradient colors show direction and magnitude
chain rule
dL/dw
Every gradient is a product of local gradients along the path from loss to weight. Backpropagation is just the chain rule applied systematically, layer by layer.
forward pass
x → y
Values flow left to right: inputs multiply by weights, pass through activation functions (tanh, sigmoid), produce an output and a loss.
backward pass
∇ ←
Gradients flow right to left: how much each weight contributed to the error. Large gradient = large responsibility = large update.
learning rate
η
How far to step in the gradient direction. Too large: overshoot. Too small: slow convergence. This is the most important hyperparameter.
how it works
What backprop does
Given an error at the output, backpropagation answers: how much did each weight contribute? It works backward through the network, multiplying local gradients along every path from loss to weight. The result tells you exactly how to adjust each connection to reduce the error.
Why gradients vary
A weight connecting a highly-activated input to a highly-sensitive output gets a large gradient — it had outsized influence on the error. A weight on a near-zero activation path gets almost no gradient. This is why dead neurons are a problem: zero activation means zero gradient means no learning.
The color code
Teal connections have positive gradients — the weight needs to decrease to reduce loss. Orange connections have negative gradients — the weight needs to increase. Thickness shows magnitude: thicker means the weight will change more in the next step.
Try this
Set x1=1, x2=0, target=1. Click '1 step' and watch the forward pulse, then the gradient wave flowing back. Now click '+10 steps' repeatedly — watch the loss sparkline drop as the network learns to map that input to the target.
the deeper point
Backpropagation scales. This 6-node toy network uses the same algorithm as a 175-billion parameter language model. The chain rule doesn't care about size — it computes every gradient in one backward pass, proportional to the cost of one forward pass. That's the insight that made deep learning work: not a better algorithm, but the realization that the obvious algorithm was already efficient enough.