Back to all posts
Notes

Regularization & Optimization

Regularization prevents overfitting by penalizing model complexity, while advanced optimizers like AdamW and learning rate schedules improve convergence and generalization in neural network training.

  • Goal: Find weights $W$ that minimize a loss function while ensuring good generalization to unseen data.
  • Problem without regularization: If a linear classifier achieves $L=0$, scaling $W \to 2W$ also gives $L=0$. The solution is non-unique, and arbitrarily large weights easily overfit noise. Regularization resolves this.

Regularization

Concept

  • Full Loss: $\mathcal{L}{\text{total}}(W) = \mathcal{L}{\text{data}}(W) + \lambda R(W)$
    • $\mathcal{L}_{\text{data}}$: Measures prediction error on training data.
    • $R(W)$: Penalty term for model complexity.
    • $\lambda$: Regularization strength (hyperparameter).
  • Philosophy: Occam’s Razor → Prefer simpler models. Regularization prevents fitting noise and adds curvature to the loss landscape, aiding optimization.

Common Types

Type Formula Behavior
L2 (Ridge) $\sum_{k,l} W_{k,l}^2$ Spreads weights evenly; differentiable everywhere; most common
L1 (Lasso) $\sum_{k,l} |W_{k,l}|$ Drives weights to exactly zero → sparse models / feature selection
Elastic Net $\alpha \text{L1} + (1-\alpha)\text{L2}$ Combines sparsity & stability
Implicit Dropout, BatchNorm, Stochastic Depth Structural/algorithmic regularization

Optimization Fundamentals

Computing Gradients

  • Numerical Gradient: $\frac{\partial f}{\partial x_i} \approx \frac{f(x+h e_i) - f(x)}{h}$
    • Easy to implement | Slow $O(N)$, approximate
  • Analytic Gradient: Exact derivative via calculus
    • Fast, exact | Error-prone to derive/code
  • Best Practice: Always code the analytic gradient, then verify with a gradient check using the numerical approximation.

Gradient Descent (GD)

  • Update: $W \leftarrow W - \eta \nabla_W \mathcal{L}$
  • Stochastic Gradient Descent (SGD): Approximates full gradient using mini-batches (32/64/128). Solves memory/speed bottlenecks but introduces:
    1. Poor Conditioning: Oscillates on steep axes, crawls on flat axes
    2. Saddle Points / Local Minima: Gradient vanishes; saddle points dominate in high dimensions
    3. Gradient Noise: Mini-batch variance causes jittery updates

Advanced Optimizers

Optimizer Mechanism Solves Notes
SGD + Momentum Accumulates velocity: $v = \rho v - \eta \nabla\mathcal{L}$ Oscillations, slow convergence on flat regions $\rho \approx 0.9$; builds inertia in consistent directions
RMSProp Scales LR per-parameter: $\frac{\nabla\mathcal{L}}{\sqrt{E[g^2] + \epsilon}}$ Poor conditioning Dampens steep dims, accelerates flat dims
Adam Momentum (1st moment) + RMSProp (2nd moment) + Bias correction All SGD issues Default: $\beta_1=0.9, \beta_2=0.999, \eta=10^{-3}$
AdamW Decouples weight decay from adaptive gradient update L2 interacts poorly with Adam's moment estimates Weight decay applied after adaptive step; superior generalization

Adam vs AdamW: Standard Adam folds L2 into the gradient/moment calculation. AdamW applies weight decay directly to parameters, preserving the adaptive learning rate's behavior.


Learning Rate Strategies

  • LR is the most critical hyperparameter. Too high → loss explodes. Too low → slow/stuck convergence.
  • Schedules:
    • Step Decay: Drop by factor (e.g., ×0.1) at fixed epochs
    • Cosine Annealing: $\eta_t = \frac{\eta_{\text{init}}}{2}\left(1 + \cos\frac{\pi t}{T}\right)$
    • Linear / Inverse Sqrt: Common in Transformers & large-scale vision
  • Warmup: Linearly ramp LR from 0 over ~5k iterations. Prevents early instability, especially with large batches.
    • Rule of Thumb: If batch size ↑ by $N$, scale initial LR by $N$.

First-Order vs. Second-Order Optimization

  • First-Order: Uses gradient (linear approximation of loss). Foundation of all modern DL optimizers.
  • Second-Order (Newton's Method): Uses Hessian matrix $H$ for quadratic approximation.
    • Update: $W \leftarrow W - H^{-1} \nabla\mathcal{L}$
    • Why not in DL? Hessian has $O(N^2)$ elements; inversion costs $O(N^3)$. Infeasible for $N \sim 10^6-10^9$.
    • L-BFGS: Quasi-Newton method that approximates $H^{-1}$ efficiently. Works well for full-batch deterministic optimization but fails with mini-batch noise.

Takeaways

  1. Start with AdamW ($\eta \approx 10^{-3} \text{ to } 5\times10^{-4}$). Often works well out-of-the-box.
  2. SGD + Momentum can generalize better and beat Adam on large datasets, but requires careful LR tuning & scheduling.
  3. Always perform a gradient check when implementing a new loss or network layer.
  4. Regularization isn't just for generalization; it improves optimization curvature.
  5. Use learning rate warmup + decay for stable training, especially with large batch sizes or Transformers.
  6. Avoid L2 inside Adam; use AdamW for proper weight decay.

Share this post

Back to home

Comments