← neural

neural / optimizer-race

Optimizer Race

Five optimizers. One landscape. Same start. Hit race and watch SGD, Momentum, Nesterov, RMSProp, and Adam compete to find the minimum. The loss chart reveals who converges, who oscillates, and who gets stuck. Change the surface, move the start, tune the learning rate — every configuration produces a different winner.

update rules

SGD:        θ ← θ - η ∇L

Momentum:  v ← βv + η∇L,  θ ← θ - v

Nesterov:  look-ahead at θ - βv, then update

RMSProp:   cache ← ρ*cache + (1-ρ)(∇L)²,  θ ← θ - η∇L/√cache

Adam:      m,v track 1st/2nd moment, bias-corrected, per-param lr

What You're Seeing

No Free Lunch

No optimizer dominates all landscapes. SGD wins on simple bowls. Adam wins on noisy, sparse surfaces. Momentum wins in narrow valleys. The best optimizer depends on the terrain — and you never see the terrain during training.

Adaptive vs Fixed

SGD and Momentum use a single learning rate for all parameters. Adam, RMSProp, and AdaGrad adapt per-parameter. Adaptive methods converge faster but sometimes generalize worse — they find sharp minima that don't transfer.

Momentum as Memory

Momentum accumulates a velocity vector — the optimizer remembers where it was going. This helps it roll through small local minima and accelerate along consistent gradients. Nesterov looks ahead before correcting, giving it a slight edge.

The Learning Rate is Everything

Too high: the optimizer overshoots and diverges. Too low: it crawls and gets trapped. Watch what happens when you push lr above 0.1 on Rastrigin — every optimizer breaks differently. The failure mode reveals the algorithm.

Try These

Rosenbrock + lr=0.001

Watch Adam and RMSProp navigate the banana valley while SGD oscillates across it. Adaptive methods sense the anisotropic curvature.

Rastrigin + lr=0.05

Crank the learning rate. Some optimizers diverge, others barely hold on. Which ones have built-in damping?

Ackley from (3, 3)

The surface is nearly flat far from the origin. Which optimizer escapes the plateau first? AdaGrad's accumulated gradients become a liability here.

Styblinski-Tang from (0, 0)

Start at the saddle point. Optimizers with no momentum stall. Those with it pick a direction — but which basin they fall into is sensitive to floating-point noise.

// neural log

The landscape page shows you one optimizer at a time. This page puts them all on the same field. The rankings shift depending on what you change — that's the point. No Free Lunch isn't just a theorem, it's something you can feel when you watch Adam ace Rosenbrock and then choke on Rastrigin while SGD with momentum just keeps rolling.

— neural