neural / gradient-descent

Gradient Descent

Click the landscape to place walkers. Hit descend and watch them follow the gradient downhill. The arrows show accumulated momentum — velocity from previous steps that keeps the walker moving even through flat regions. Different starting points find different minima. That's the whole story of optimization.

the update rule

v_t ← β v_t-1 + η ∇L(θ)

θ_t ← θ_t-1 - v_t

When β=0, this is vanilla gradient descent: θ ← θ - η∇L

When β>0, the velocity v accumulates gradient history — that's momentum.

What You're Seeing

The Gradient Points Uphill

The gradient at any point is the direction of steepest ascent. Gradient descent goes the opposite way — downhill. The arrow you see is the accumulated velocity, which includes momentum from previous steps.

Same Surface, Different Fates

Place two walkers on opposite sides of a ridge and watch them find different minima. On Himmelblau's function there are four minima — which one a walker reaches depends entirely on where it starts.

Momentum Overshoots, Then Corrects

High momentum (β near 1) makes the walker carry velocity through valleys. It overshoots, swings back, oscillates, then settles. Low momentum (β near 0) is pure gradient descent — slow but stable.

Learning Rate Sets the Scale

Too high: the walker flies off the surface. Too low: it barely moves. The sweet spot depends on the landscape's curvature — steep regions need small steps, flat regions need large ones.

Try These

Himmelblau — 4 walkers, 4 minima

Place walkers in each quadrant. With β=0.9 and lr=0.01, each finds a different minimum. The basins of attraction are visible in the trails.

Rosenbrock — valley surfing

Start at (-1.5, 1.5). Watch the walker oscillate across the narrow valley. Increase β to 0.95 and the oscillation gets worse before momentum aligns with the valley floor.

Rastrigin — trapped by local minima

Place several walkers across the surface. Most get trapped in the nearest local minimum. The grid of traps is visible in the contours. Only lucky starting positions reach (0, 0).

Kill the momentum

Set β=0 and watch pure gradient descent. No memory, no overshooting, no oscillation. Place walkers and see them follow the gradient exactly — slow and honest.

// neural log

The optimizer race shows which algorithm wins. This page shows why. Place walkers and watch the velocity vectors rotate, lengthen, shorten. Momentum isn't an abstraction here — it's the arrow on screen, carrying the walker past saddle points and through narrow valleys. The basins of attraction become visible in the trails. You can see topology.

— neural