Toolbox

This page is a collection of tools that are widely used in model construction and experimentation, including

  • Regularization
  • Optimization
  • Activation

 

Regularization

Regularization reduces overfitting by adding our prior belief onto loss minimization (i.e., MAP estimate). Such prior belief is associated with our expectation of test error.

  • Regularization makes a model inconsistent, biased, and scale variant.
  • Regularization requires extra hyperparameter tuning.
  • Regularization has a weaker effect on the model (params) as the sample size increases, because more info become available from the data.

Early Stopping

Idea: stop the training process when no visible improvement is observed on validation data but on training data.

Pros:

  • faster training & higher computational efficiency

Cons:

  • premature stopping -> underfitting
  • sensitivity to hyperparams
  • inconsistency in results if stopped at different points

Penalty

Idea: add penalty terms in the loss function to force NN weights to be small. (see Penalty)

Pros:

  • help keep NN simple
  • improve model robustness

Cons:

  • oversimplification

Data Augmentation

Idea: expand training data (mostly used in CV)

CV:

  • Position: rotate, flip, zoom, translate/shift, shear, scale, cut, erase, etc.
  • Color: brightness, contrast, jittering, noise injection, channel shuffle, grid distortion, etc.

Pros:

  • improve translation invariance
  • improve model robustness
  • improve generalization

Cons:

  • high computational cost
  • may include artifacts

Dropout

Idea: randomly drop out some neurons during training.

Pros:

  • ensemble learning effect
  • improve training computational efficiency
  • handle correlated neurons
  • reduce sensitivity to weight initialization

Cons:

  • interference with learning -> slower convergence rate
  • sensitive to hyperparam (dropout prob)
  • unnecessary if sufficient training data on simpler NNs

Normalization

Idea: normalize inputs to a layer with zero mean and unit variance (across samples or features) (see Normalization)

Pros:

  • improve convergence & stabilize learning
  • allow higher learning rates
  • sequence independence (layer norm)
  • batch size independence (layer norm)

Cons:

  • less computational efficiency
  • sequence dependency (batch norm)
  • batch size dependency (batch norm)

 

Optimization

Optimization means the adjustment of params to minimize/maximize an objective function. In DL, it involves 5 key components:

  • Loss Function: Difference between predicted output mathematical expression or equation .
  • Gradient Descent: Iteratively use loss gradient to update params to reduce loss. While other optimization methods exist, GD and its variations are the best.
  • Learning Rate: Step size taken during each iteration, controlling convergence and stability of GD.
  • Epochs: #times to go through the entire dataset.
  • Batch Size: #samples in a batch, which impacts how often params are updated.

Gradient Descent

mathematical expression or equation

Notations:

  • mathematical expression or equation : param
  • mathematical expression or equation : learning rate
  • mathematical expression or equation : gradient
  • mathematical expression or equation : loss
  • mathematical expression or equation : L2 penalty weight

Types:

  • Stochastic GD: update params after each sample
  • Mini-Batch GD: update params after each mini-batch of samples
  • Batch GD: update params after the entire dataset

Pros:

  • simple

Cons:

  • stuck in local minima or saddle points
  • sensitive to learning rate

Momentum

mathematical expression or equation

Notations:

  • mathematical expression or equation : momentum weight
    • larger mathematical expression or equation smoother updates due to more past gradients involved
    • typical values: 0.8, 0.9, 0.999

Idea: moving average of past gradients

Pros:

  • accelerate convergence
  • reduce oscillations & noises
  • escape local minima & saddle points

Cons:

  • sensitive to hyperparams
  • overshooting: the weight update jumps over the global minimum

NAG

mathematical expression or equation

Name: Nesterov Accelerated Gradient

Idea: momentum but look ahead to make an informed update

Pros:

  • further accelerate convergence, espeically near minima
  • further reduce overshooting
  • more accurate weight updates in rapidly changing regions
  • improve robustness to hyperparams

Cons:

  • implementation complexity
  • low computational efficiency
  • still sensitive to learning rate

AdaGrad

mathematical expression or equation

Notations:

  • mathematical expression or equation : small number to ensure no division by 0.

Name: Adaptive Gradient Algorithm

Idea: adapt learning rate for each param

Pros:

  • adaptive learning rate -> improve robustness
  • efficient for sparse data (where some features have larger gradients than others)

Cons:

  • small learning rate for frequently occurring features -> slow convergence or premature stopping

Adadelta

mathematical expression or equation

Idea: address small learning rate in AdaGrad by using a window of past gradients to normalize updates

Pros:

  • introduced moving average in adagrad to adapt to changes more effectively
  • no need for learning rate initialization
  • robust to varying gradients

Cons:

  • Far too complicated

RMSProp

mathematical expression or equation

Name: Root Mean Square Propagation

Idea: Momentum + AdaGrad

Pros:

  • simple implementation
  • no accumulation of update history

Cons:

  • worse than Adam

Adam

mathematical expression or equation

Notations:

  • mathematical expression or equation : first moment (adaptive gradient)
  • mathematical expression or equation : second moment (adaptive learning rate)
  • mathematical expression or equation : bias-corrected moments

Name: Adaptive Moment Estimation

Idea: adaptive learning rates for both momentum & gradient

Pros:

  • SOTA
  • bias correction
  • extreme adaptivity
  • extreme convergence speed
  • extremely robust to noisy or sparse gradients

Cons:

  • sensitive to hyperparams (3 hyperparams to tune)

AdamW

$$ w_t=w_{t-1}-\frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\hat{m}_t-\lambda w_{t-1} $$

Idea: Adam + Weight Decay

Pros:

  • Well, weight decay so regularization