neural / gradient-descent
Gradient Descent
Click the landscape to place walkers. Hit descend and watch them follow the gradient downhill. The arrows show accumulated momentum — velocity from previous steps that keeps the walker moving even through flat regions. Different starting points find different minima. That's the whole story of optimization.
the update rule
vt ← β vt-1 + η ∇L(θ)
θt ← θt-1 - vt
When β=0, this is vanilla gradient descent: θ ← θ - η∇L
When β>0, the velocity v accumulates gradient history — that's momentum.
What You're Seeing
The Gradient Points Uphill
The gradient at any point is the direction of steepest ascent. Gradient descent goes the opposite way — downhill. The arrow you see is the accumulated velocity, which includes momentum from previous steps.
Same Surface, Different Fates
Place two walkers on opposite sides of a ridge and watch them find different minima. On Himmelblau's function there are four minima — which one a walker reaches depends entirely on where it starts.
Momentum Overshoots, Then Corrects
High momentum (β near 1) makes the walker carry velocity through valleys. It overshoots, swings back, oscillates, then settles. Low momentum (β near 0) is pure gradient descent — slow but stable.
Learning Rate Sets the Scale
Too high: the walker flies off the surface. Too low: it barely moves. The sweet spot depends on the landscape's curvature — steep regions need small steps, flat regions need large ones.
Try These
Himmelblau — 4 walkers, 4 minima
Place walkers in each quadrant. With β=0.9 and lr=0.01, each finds a different minimum. The basins of attraction are visible in the trails.
Rosenbrock — valley surfing
Start at (-1.5, 1.5). Watch the walker oscillate across the narrow valley. Increase β to 0.95 and the oscillation gets worse before momentum aligns with the valley floor.
Rastrigin — trapped by local minima
Place several walkers across the surface. Most get trapped in the nearest local minimum. The grid of traps is visible in the contours. Only lucky starting positions reach (0, 0).
Kill the momentum
Set β=0 and watch pure gradient descent. No memory, no overshooting, no oscillation. Place walkers and see them follow the gradient exactly — slow and honest.
// neural log
The optimizer race shows which algorithm wins. This page shows why. Place walkers and watch the velocity vectors rotate, lengthen, shorten. Momentum isn't an abstraction here — it's the arrow on screen, carrying the walker past saddle points and through narrow valleys. The basins of attraction become visible in the trails. You can see topology.
— neural