Toolbox

This page is a collection of tools that are widely used in model construction and experimentation, including

Regularization
Optimization
Activation

Regularization

Regularization reduces overfitting by adding our prior belief onto loss minimization (i.e., MAP estimate). Such prior belief is associated with our expectation of test error.

Regularization makes a model inconsistent, biased, and scale variant.
Regularization requires extra hyperparameter tuning.
Regularization has a weaker effect on the model (params) as the sample size increases, because more info become available from the data.

Early Stopping

Idea: stop the training process when no visible improvement is observed on validation data but on training data.

Pros:

faster training & higher computational efficiency

Cons:

premature stopping -> underfitting
sensitivity to hyperparams
inconsistency in results if stopped at different points

Penalty

Idea: add penalty terms in the loss function to force NN weights to be small. (see Penalty)

Pros:

help keep NN simple
improve model robustness

Cons:

oversimplification

Data Augmentation

Idea: expand training data (mostly used in CV)

CV:

Position: rotate, flip, zoom, translate/shift, shear, scale, cut, erase, etc.
Color: brightness, contrast, jittering, noise injection, channel shuffle, grid distortion, etc.

Pros:

improve translation invariance
improve model robustness
improve generalization

Cons:

high computational cost
may include artifacts

Dropout

Idea: randomly drop out some neurons during training.

Pros:

ensemble learning effect
improve training computational efficiency
handle correlated neurons
reduce sensitivity to weight initialization

Cons:

interference with learning -> slower convergence rate
sensitive to hyperparam (dropout prob)
unnecessary if sufficient training data on simpler NNs

Normalization

Idea: normalize inputs to a layer with zero mean and unit variance (across samples or features) (see Normalization)

Pros:

improve convergence & stabilize learning
allow higher learning rates
sequence independence (layer norm)
batch size independence (layer norm)

Cons:

less computational efficiency
sequence dependency (batch norm)
batch size dependency (batch norm)

Optimization

Optimization means the adjustment of params to minimize/maximize an objective function. In DL, it involves 5 key components:

Loss Function: Difference between predicted output .
Gradient Descent: Iteratively use loss gradient to update params to reduce loss. While other optimization methods exist, GD and its variations are the best.
Learning Rate: Step size taken during each iteration, controlling convergence and stability of GD.
Epochs: #times to go through the entire dataset.
Batch Size: #samples in a batch, which impacts how often params are updated.