- Goal: Find weights $W$ that minimize a loss function while ensuring good generalization to unseen data.
- Problem without regularization: If a linear classifier achieves $L=0$, scaling $W \to 2W$ also gives $L=0$. The solution is non-unique, and arbitrarily large weights easily overfit noise. Regularization resolves this.
Regularization
Concept
- Full Loss: $\mathcal{L}{\text{total}}(W) = \mathcal{L}{\text{data}}(W) + \lambda R(W)$
- $\mathcal{L}_{\text{data}}$: Measures prediction error on training data.
- $R(W)$: Penalty term for model complexity.
- $\lambda$: Regularization strength (hyperparameter).
- Philosophy: Occam’s Razor → Prefer simpler models. Regularization prevents fitting noise and adds curvature to the loss landscape, aiding optimization.
Common Types
| Type | Formula | Behavior |
|---|---|---|
| L2 (Ridge) | $\sum_{k,l} W_{k,l}^2$ | Spreads weights evenly; differentiable everywhere; most common |
| L1 (Lasso) | $\sum_{k,l} |W_{k,l}|$ | Drives weights to exactly zero → sparse models / feature selection |
| Elastic Net | $\alpha \text{L1} + (1-\alpha)\text{L2}$ | Combines sparsity & stability |
| Implicit | Dropout, BatchNorm, Stochastic Depth | Structural/algorithmic regularization |
Optimization Fundamentals
Computing Gradients
- Numerical Gradient: $\frac{\partial f}{\partial x_i} \approx \frac{f(x+h e_i) - f(x)}{h}$
- Easy to implement | Slow $O(N)$, approximate
- Analytic Gradient: Exact derivative via calculus
- Fast, exact | Error-prone to derive/code
- Best Practice: Always code the analytic gradient, then verify with a gradient check using the numerical approximation.
Gradient Descent (GD)
- Update: $W \leftarrow W - \eta \nabla_W \mathcal{L}$
- Stochastic Gradient Descent (SGD): Approximates full gradient using mini-batches (32/64/128). Solves memory/speed bottlenecks but introduces:
- Poor Conditioning: Oscillates on steep axes, crawls on flat axes
- Saddle Points / Local Minima: Gradient vanishes; saddle points dominate in high dimensions
- Gradient Noise: Mini-batch variance causes jittery updates
Advanced Optimizers
| Optimizer | Mechanism | Solves | Notes |
|---|---|---|---|
| SGD + Momentum | Accumulates velocity: $v = \rho v - \eta \nabla\mathcal{L}$ | Oscillations, slow convergence on flat regions | $\rho \approx 0.9$; builds inertia in consistent directions |
| RMSProp | Scales LR per-parameter: $\frac{\nabla\mathcal{L}}{\sqrt{E[g^2] + \epsilon}}$ | Poor conditioning | Dampens steep dims, accelerates flat dims |
| Adam | Momentum (1st moment) + RMSProp (2nd moment) + Bias correction | All SGD issues | Default: $\beta_1=0.9, \beta_2=0.999, \eta=10^{-3}$ |
| AdamW | Decouples weight decay from adaptive gradient update | L2 interacts poorly with Adam's moment estimates | Weight decay applied after adaptive step; superior generalization |
Adam vs AdamW: Standard Adam folds L2 into the gradient/moment calculation. AdamW applies weight decay directly to parameters, preserving the adaptive learning rate's behavior.
Learning Rate Strategies
- LR is the most critical hyperparameter. Too high → loss explodes. Too low → slow/stuck convergence.
- Schedules:
Step Decay: Drop by factor (e.g., ×0.1) at fixed epochsCosine Annealing: $\eta_t = \frac{\eta_{\text{init}}}{2}\left(1 + \cos\frac{\pi t}{T}\right)$Linear / Inverse Sqrt: Common in Transformers & large-scale vision
- Warmup: Linearly ramp LR from 0 over ~5k iterations. Prevents early instability, especially with large batches.
- Rule of Thumb: If batch size ↑ by $N$, scale initial LR by $N$.
First-Order vs. Second-Order Optimization
- First-Order: Uses gradient (linear approximation of loss). Foundation of all modern DL optimizers.
- Second-Order (Newton's Method): Uses Hessian matrix $H$ for quadratic approximation.
- Update: $W \leftarrow W - H^{-1} \nabla\mathcal{L}$
- Why not in DL? Hessian has $O(N^2)$ elements; inversion costs $O(N^3)$. Infeasible for $N \sim 10^6-10^9$.
L-BFGS: Quasi-Newton method that approximates $H^{-1}$ efficiently. Works well for full-batch deterministic optimization but fails with mini-batch noise.
Takeaways
- Start with AdamW ($\eta \approx 10^{-3} \text{ to } 5\times10^{-4}$). Often works well out-of-the-box.
- SGD + Momentum can generalize better and beat Adam on large datasets, but requires careful LR tuning & scheduling.
- Always perform a gradient check when implementing a new loss or network layer.
- Regularization isn't just for generalization; it improves optimization curvature.
- Use learning rate warmup + decay for stable training, especially with large batch sizes or Transformers.
- Avoid L2 inside Adam; use AdamW for proper weight decay.