Goal: Find weights $W$ that minimize a loss function while ensuring good generalization to unseen data.
Problem without regularization: If a linear classifier achieves $L=0$, scaling $W \to 2W$ also gives $L=0$. The solution is non-unique, and arbitrarily large weights easily overfit noise. Regularization resolves this.

Regularization

Concept

Full Loss: $\mathcal{L}{\text{total}}(W) = \mathcal{L}{\text{data}}(W) + \lambda R(W)$
- $\mathcal{L}_{\text{data}}$: Measures prediction error on training data.
- $R(W)$: Penalty term for model complexity.
- $\lambda$: Regularization strength (hyperparameter).
Philosophy: Occam’s Razor → Prefer simpler models. Regularization prevents fitting noise and adds curvature to the loss landscape, aiding optimization.

Common Types

Type	Formula	Behavior
L2 (Ridge)	$\sum_{k,l} W_{k,l}^2$	Spreads weights evenly; differentiable everywhere; most common
L1 (Lasso)	$\sum_{k,l} \|W_{k,l}\|$	Drives weights to exactly zero → sparse models / feature selection
Elastic Net	$\alpha \text{L1} + (1-\alpha)\text{L2}$	Combines sparsity & stability
Implicit	Dropout, BatchNorm, Stochastic Depth	Structural/algorithmic regularization

Optimization Fundamentals

Computing Gradients

Numerical Gradient: $\frac{\partial f}{\partial x_i} \approx \frac{f(x+h e_i) - f(x)}{h}$
- Easy to implement | Slow $O(N)$, approximate
Analytic Gradient: Exact derivative via calculus
- Fast, exact | Error-prone to derive/code
Best Practice: Always code the analytic gradient, then verify with a gradient check using the numerical approximation.

Gradient Descent (GD)

Update: $W \leftarrow W - \eta \nabla_W \mathcal{L}$
Stochastic Gradient Descent (SGD): Approximates full gradient using mini-batches (32/64/128). Solves memory/speed bottlenecks but introduces:
1. Poor Conditioning: Oscillates on steep axes, crawls on flat axes
2. Saddle Points / Local Minima: Gradient vanishes; saddle points dominate in high dimensions
3. Gradient Noise: Mini-batch variance causes jittery updates

Advanced Optimizers

Optimizer	Mechanism	Solves	Notes
SGD + Momentum	Accumulates velocity: $v = \rho v - \eta \nabla\mathcal{L}$	Oscillations, slow convergence on flat regions	$\rho \approx 0.9$; builds inertia in consistent directions
RMSProp	Scales LR per-parameter: $\frac{\nabla\mathcal{L}}{\sqrt{E[g^2] + \epsilon}}$	Poor conditioning	Dampens steep dims, accelerates flat dims
Adam	Momentum (1st moment) + RMSProp (2nd moment) + Bias correction	All SGD issues	Default: $\beta_1=0.9, \beta_2=0.999, \eta=10^{-3}$
AdamW	Decouples weight decay from adaptive gradient update	L2 interacts poorly with Adam's moment estimates	Weight decay applied after adaptive step; superior generalization

Adam vs AdamW: Standard Adam folds L2 into the gradient/moment calculation. AdamW applies weight decay directly to parameters, preserving the adaptive learning rate's behavior.

Learning Rate Strategies

LR is the most critical hyperparameter. Too high → loss explodes. Too low → slow/stuck convergence.
Schedules:
- Step Decay: Drop by factor (e.g., ×0.1) at fixed epochs
- Cosine Annealing: $\eta_t = \frac{\eta_{\text{init}}}{2}\left(1 + \cos\frac{\pi t}{T}\right)$
- Linear / Inverse Sqrt: Common in Transformers & large-scale vision
Warmup: Linearly ramp LR from 0 over ~5k iterations. Prevents early instability, especially with large batches.
- Rule of Thumb: If batch size ↑ by $N$, scale initial LR by $N$.

First-Order vs. Second-Order Optimization

First-Order: Uses gradient (linear approximation of loss). Foundation of all modern DL optimizers.
Second-Order (Newton's Method): Uses Hessian matrix $H$ for quadratic approximation.
- Update: $W \leftarrow W - H^{-1} \nabla\mathcal{L}$
- Why not in DL? Hessian has $O(N^2)$ elements; inversion costs $O(N^3)$. Infeasible for $N \sim 10^6-10^9$.
- L-BFGS: Quasi-Newton method that approximates $H^{-1}$ efficiently. Works well for full-batch deterministic optimization but fails with mini-batch noise.

Takeaways

Start with AdamW ($\eta \approx 10^{-3} \text{ to } 5\times10^{-4}$). Often works well out-of-the-box.
SGD + Momentum can generalize better and beat Adam on large datasets, but requires careful LR tuning & scheduling.
Always perform a gradient check when implementing a new loss or network layer.
Regularization isn't just for generalization; it improves optimization curvature.
Use learning rate warmup + decay for stable training, especially with large batch sizes or Transformers.
Avoid L2 inside Adam; use AdamW for proper weight decay.

Regularization & Optimization

Regularization

Concept

Common Types

Optimization Fundamentals

Computing Gradients

Gradient Descent (GD)

Advanced Optimizers

Learning Rate Strategies

First-Order vs. Second-Order Optimization

Takeaways

Comments

Regularization

Concept

Common Types

Optimization Fundamentals

Computing Gradients

Gradient Descent (GD)

Advanced Optimizers

Learning Rate Strategies

First-Order vs. Second-Order Optimization

Takeaways

Comments

Scan to share on WeChat