Optimizers
SGD
Stochastic Gradient Descent. The simplest optimizer: w ← w − lr · ∇L. Optionally supports momentum (accelerates convergence), weight decay (L2 regularization), and Nesterov momentum.
Adam
Adaptive Moment Estimation. Maintains per-parameter running averages of first moment (mean) and second moment (uncentered variance) of gradients. The default choice for most deep learning tasks. Combines benefits of AdaGrad and RMSProp.
AdamW
Adam with decoupled weight decay. Fixes the weight decay implementation in Adam by applying it directly to parameters rather than through the gradient. Recommended over Adam when using weight decay.
RMSprop
Root Mean Square Propagation. Divides the gradient by a running average of its magnitude. Adapts learning rate per parameter. Originally proposed for training RNNs.
Adagrad
Adaptive Gradient. Adapts learning rate per parameter based on accumulated squared gradients. Well-suited for sparse data. Learning rate decays monotonically.
AdaDelta
Extension of Adagrad that seeks to reduce its monotonically decreasing learning rate. Uses a window of accumulated past gradients instead of all past gradients.
Nadam
Nesterov-accelerated Adam. Combines Adam's adaptive learning rates with Nesterov momentum for faster convergence.
Optimizer API (Common Methods)
- new Optimizer(params, { lr, ...opts }) — Create optimizer with model parameters
- .step() — Update all parameters using computed gradients
- .zeroGrad() — Reset all parameter gradients to zero (call before each forward pass)
SGD
Where:
- lr = Learning rate
- ∇L = Gradient of loss
SGD + Momentum
Where:
- μ = Momentum coefficient (default: 0.9)
Adam
Where:
- β₁, β₂ = Decay rates (0.9, 0.999)
- m̂, v̂ = Bias-corrected moments
import { SGD, Adam, AdamW } from "deepbox/optim";import { Sequential, Linear, ReLU } from "deepbox/nn";import { parameter, GradTensor } from "deepbox/ndarray";const model = new Sequential( new Linear(2, 16), new ReLU(), new Linear(16, 1));// Adam optimizer (default choice)const optimizer = new Adam(model.parameters(), { lr: 0.01 });// Training dataconst input = parameter([[1, 2], [3, 4], [5, 6]]);const target = parameter([[1], [0], [1]]);// Training loopfor (let epoch = 0; epoch < 100; epoch++) { optimizer.zeroGrad(); // Reset gradients const output = model.forward(input); // Forward pass (returns GradTensor) const diff = (output as GradTensor).sub(target); const loss = diff.mul(diff).mean(); // MSE loss via GradTensor ops loss.backward(); // Backpropagation optimizer.step(); // Update weights}// SGD with momentumconst sgd = new SGD(model.parameters(), { lr: 0.01, momentum: 0.9 });// AdamW with weight decayconst adamw = new AdamW(model.parameters(), { lr: 0.001, weightDecay: 0.01 });Choosing an Optimizer
- Adam — Default for most tasks. Good out-of-the-box without much tuning.
- AdamW — When using weight decay (recommended over Adam for regularization).
- SGD + Momentum — Often achieves better final accuracy than Adam with proper tuning (used in many research papers).
- RMSprop — Good for RNNs and non-stationary objectives.
- Adagrad — Sparse data (e.g., NLP with large vocabularies).