Supervised Learning

Linear Models

Naive Bayes

Why: provide quick & fairly accurate predictions

What: the simplest generative, parametric classifier solely based on Bayes’ Theorem mathematical expression or equation

  • Background (#Params):
    • General: mathematical expression or equation
    • Joint (binary): mathematical expression or equation
    • Independent (binary): mathematical expression or equation
  • Assumption: Conditional Independence of features given label mathematical expression or equation
  • Params:
    • mathematical expression or equation : prior probability for class mathematical expression or equation
    • mathematical expression or equation : conditional density param for class mathematical expression or equation and feature mathematical expression or equation

How:

  • Training:
    • Objective:
      • Loss: 0-1
      • Regularization: Laplace smoothing
    • Optimization:
      • MLE: mathematical expression or equation
        • mathematical expression or equation : #samples of class mathematical expression or equation
        • mathematical expression or equation : #samples of class mathematical expression or equation with feature mathematical expression or equation
      • MAP: mathematical expression or equation
        • mathematical expression or equation : shifted Dirichlet param for mathematical expression or equation (conjugate prior + likelihood)
        • mathematical expression or equation : shifted Dirichlet param for mathematical expression or equation (conjugate prior + likelihood)
        • Laplace smoothing: mathematical expression or equation
  • Inference: mathematical expression or equation

Pros:

  • Easy
  • Scale invariant
  • High computational efficiency (significantly low #params)
  • Robust to outliers and noisy data
  • No overfitting (unless feature value missing in training set)
  • Can handle real-time prediction
  • High accuracy on small datasets (> LogReg)

Cons:

  • Assumption fails in real life
  • Low accuracy on large datasets (< LogReg)
  • Low accuracy in terms of probability estimates (« LogReg)

 

Linear Regression

What: Linear regression fits a linear hyperplane between features and labels.

When:

  • Linearity: The underlying relationship between mathematical expression or equation is linear.
  • Independence: mathematical expression or equation is independent of each other.
  • Normality: mathematical expression or equation follows Gaussian distribution.
  • Non-Collinearity: No/Minimal explanatory variables correlate with each other.
  • Homoskedasticity: The variance of all noises is the same constant mathematical expression or equation .

How: mathematical expression or equation

Training:

  • Hyperparameters:
  • Objective:
    • Loss: MSE
    • Regularization: L1, L2, ElasticNet, L0
  • Optimization: $$\begin{aligned} &\text{OLS/MLE}:\ &&\hat{\mathbf{w}}=(X^TX)^{-1}X^T\mathbf{y} \\ &\text{Ridge}:\ &&\hat{\mathbf{w}}=(X^TX+\lambda I)^{-1}X^T\mathbf{y} \\ &\text{Lasso}:\ &&\frac{\partial \mathcal{L}B}{\partial w_j}=\frac{1}{m}\sum{i=1}^{m}[x_{ij}(\hat{y}i-y_i)]+\lambda\cdot\text{sign}(w_j) \\ &\text{ElasticNet}:\ &&\frac{\partial \mathcal{L}B}{\partial w_j}=\frac{1}{m}\sum{i=1}^{m}[x{ij}(\hat{y}_i-y_i)]+\lambda r_1\cdot\text{sign}(w_j)+\lambda(1-r_1)w_j \end{aligned}$$

Inference: $$ \hat{y}_i=\mathbf{w}^T\mathbf{x}_i $$

Pros:

Cons:

Objective:

Optimization:

Pros:

  • Simple and Interpretable
  • Scale invariant
  • Consistent and Unbiased (OLS/MLE ver.)

Cons:

  • Sensitive to outliers
  • Limited to assumptions (if any assumption fails, LinReg fails)

Time complexity:

  • Train:
    • Exact Solution: mathematical expression or equation
    • Gradient Descent: mathematical expression or equation
  • Test: mathematical expression or equation

 

Logistic Regression

Idea: use sigmoid/softmax as link function for linear regression for classification.

Model: mathematical expression or equation

Prediction: $$ \hat{y}i=\arg\max_k\hat{p}{ik} $$

Objective:

  • Loss: Cross Entropy
  • Regularization: L1, L2, ElasticNet

Optimization: $$ \frac{\partial \mathcal{L}}{\partial w_{jk}}=\frac{1}{m}\sum_{i=1}^{m}[x_{ij}(\hat{p}_{ik}-y_i)] $$

Pros:

  • Scale invariant
  • Easy expansion to multiclass
  • Coefficients resemble feature importance
  • Easy gradient calculation

Cons:

  • bad performance when mathematical expression or equation
  • bad performance for nonlinear cases (assume linearity by log odds)
  • prioritize correctly classifying the more prevalent class, even if it means misclassifying the less prevalent class

Time Complexity: Train: mathematical expression or equation

Space Complexity: mathematical expression or equation

 

Support Vector Machine

Idea: choose a decision boundary that maximizes the soft margin between classes.

Preliminaries:

  • Primal vs Dual
    • Primal operates in the feature space mathematical expression or equation
    • Dual operates in the sample space mathematical expression or equation (i.e., Kernel Matrix)
    • Params in Primal & Dual are transferable: mathematical expression or equation
    • Dual > Primal:
      • = Weighted combination of support vectors
      • Sparsity
      • Kernel Trick
  • Hard margin vs Soft margin
    • Hard margin does NOT accept any misclassification mathematical expression or equation prone to overfitting
    • Soft margin allows some misclassifications mathematical expression or equation regularization
  • Support vectors are 1) on the margin 2) on the wrong side 3) within the margin.
  • In linearly separable case, the decision boundary with the maximal margin is unique.

Model/Prediction: linear classifier $$ \hat{y}_i=\text{sign}(\textbf{w}^T\phi(\textbf{x}_i)) $$

Objective: Hinge Loss + L2 Penalty (can use other losses/penalties but rare)

  • Primal (Linear): mathematical expression or equation
  • Primal (Kernel): mathematical expression or equation
  • Dual (Linear): mathematical expression or equation
  • Dual (Kernel): mathematical expression or equation

Optimization:

  • Param Estimation: Decomposition (quadratic programming), Closed-form, GD, etc.
  • Hyperparam Tuning: mathematical expression or equation , a larger-margin hyperplane will be chosen.

Pros:

  • Good performance on high-dimensional and non-linearly separable data (with Kernel trick)
  • Good generalization to unseen data
  • Guaranteed convexity. Easy to optimize
  • Low memory cost (only support vectors matter)

Cons:

  • High computational cost, especially 1) with Kernels 2) when sample size is too large 3) with Multiclass classification (no native support for it; need 1v1 or 1-v-rest strategies)
  • Low interpretability: No probability estimates
  • Bad performance when mathematical expression or equation .
  • Bad performance on large datasets (i.e., mathematical expression or equation ).
  • Sensitive to outliers and noisy data
  • Sensitive to overlapping classes (i.e., classes which share the same parts of some feature values)

Time Complexity:

  • Train: mathematical expression or equation
  • Test: mathematical expression or equation #support vectors.

 

 

Local Learning

K Nearest Neighbors

Idea: label a sample based on its mathematical expression or equation nearest neighbors.

Model:

  1. Calculate distance between sample point and every training point.
  2. Find the mathematical expression or equation nearest neighbors with minimal distances.
  3. Take the majority vote and output it as the label for the sample point.

Pros:

  • No training: instance-based learning (i.e., lazy learner)
  • Seamless data augmentation at any step
  • Simple and interpretable

Cons:

  • Scale variant
  • High computational cost for large datasets
  • High computational cost + Low variance in distance measure for high-dimensional data
  • Sensitive to noisy data, missing values, and outliers

Time Complexity: Test: mathematical expression or equation

Space Complexity: mathematical expression or equation

 

Kernel Method

Model (Mercer’s Theorem): $$ \forall\phi:\mathbb{R}^n\rightarrow\mathbb{R}^p\ \exists k:\mathbb{R}^{n\times n}\rightarrow\mathbb{R}\ \text{ s.t. }\ k(\mathbf{x},\mathbf{x}’)=\phi(\mathbf{x})^T\phi(\mathbf{x}’) $$

  • mathematical expression or equation : input sample
  • mathematical expression or equation : feature map from one space to another (typically a higher dimensional space)
  • mathematical expression or equation : kernel function

Idea: Allow using linear models for non-linear samples, without transforming data into a higher dimensional space (i.e., without computing mathematical expression or equation ).

Assumptions: The kernel function must satisfy 2 conditions:

  • Symmetry: mathematical expression or equation
  • Positive-Definite: mathematical expression or equation

Types of Kernels:

TypeFormula
Linearmathematical expression or equation
Polynomialmathematical expression or equation
RBF (Radial Basis Function)mathematical expression or equation

Pros:

  • Low computational cost (relative to feature mapping calculation)
  • Applicable to all data type
  • Easy validation of kernel by finding arbitrary 2 points with negative determinant (i.e., negative eigenvalue)
  • Consider all samples as each sample’s neighbors

Cons:

  • Scale variant
  • Overfitting
  • Low interpretability of kernels
  • High difficulty in kernel selection
  • Biased toward closer samples

 

 

Decision Tree

Idea: build a tree where each node is a feature split to classify data points into different leaf outputs.

Preliminaries:

  • Information gain: mathematical expression or equation
    • mathematical expression or equation : Impurity measure
      • Gini: mathematical expression or equation
      • Entropy: mathematical expression or equation
  • Entropy: a measure of uncertainty
    • Conditional entropy (Average): mathematical expression or equation
    • Specific conditional entropy: mathematical expression or equation

Model:

  1. Calculate info gain for each feature.
  2. Select the feature that maximizes info gain as the decision threshold.
  3. Split data based on the decision. Repeat Step 1-2 until stop.

Pros:

  • Scale invariant
  • Low computational cost
  • Can handle multiclass classification
  • Interpretable and Easy to validate (easy model visualization)

Cons:

  • Prone to overfitting (perfect fit on smaller sample size)
  • Bad performance with class imbalance (biased toward most frequently occurring class)
  • Sensitive to noisy data, missing value, and outliers
  • Discrete predictions only
  • Difficult to find globally optimal decisions (only support locally optimal decisions at each node)
  • Bad performance overall

Time complexity:

  • Train: mathematical expression or equation
  • Test: mathematical expression or equation if balanced binary tree)

 

 

Ensemble Methods

  • Bootstrapping: randomly select mathematical expression or equation is the fraction of samples to bootstrap.
  • Bagging (bootstrap aggregating): aggregate a bunch of weak models trained on bootstrapped subsets individually.
  • Boosting: train multiple models sequentially and dependently.
    • Converge exponentially with #iterations.
    • Cons: Highly sensitive to outliers and noisy data.
  • Common Pros of Ensemble Methods:
    • Scale invariant
    • Reduce overfitting
    • Greater performance
    • Can handle high-dimensional data efficiently
  • Common Cons of Ensemble Methods:
    • High computational cost
    • Low interpretability

 

Random Forest

Idea: Bagging with Decision Trees

Model:

  1. Bootstrap.
  2. Create a full decision tree (no pruning). On each node, randomly select mathematical expression or equation features from bootstrapped subset. Find the best split.
  3. Repeat 1-2 to create a random forest till #tree reaches limit.
  4. Use out-of-bag samples to determine the accuracy of each tree.

Prediction: Take a majority/average vote of all trees.

Objective:

  • Loss: any
  • Regularization: increase #trees

Pros:

  • Reduce overfitting in decision tree & much more accurate than a single decision tree
  • Flexible to both categorical & numerical outputs
  • Automatically handle missing values by dropping or filling with median/mode
  • Robust to outliers and noisy data
  • Best used in banking and healthcare

Cons:

  • Worse performance than Boosting in general

Time Complexity:

  • Train: mathematical expression or equation #trees
  • Test: mathematical expression or equation max depth

 

AdaBoost

Idea: train a bunch of stumps (weak learners) sequentially and take a weighted majority vote.

Model:

  1. Init all samples with equal sample weight.
  2. Find optimal feature for first stump. Calculate total error. Calculate amount of say: $$ \text{Amount of Say}=\frac{1}{2}\log{\frac{1-Err_\text{tot}}{Err_\text{tot}}} $$
  3. Modify sample weights for incorrectly and correctly predicted samples as follows: mathematical expression or equation
  4. Normalize all sample weights.
  5. Select new samples based on new sample weights as probabilities with replacement to generate a new set.
  6. Give equal sample weights to all samples in the new training set.
  7. Repeat Steps 1-6 till #stumps reaches limit.

Prediction: Take a weighted majority vote using the Amount of Say from each stump.

Objective: Exponential Loss

Pros:

  • Reduce overfitting more than bagging (because parameters are not optimized jointly but stagewise)
  • Fewer hyperparameters than other models

Cons:

  • Guaranteed perfect fitting (training error=0) if infinite epochs
  • Sensitive to outliers and noisy data
  • Slower and generally worse performance than Gradient Boosting

Random Forest vs AdaBoost:

Random ForestAdaBoost
No tree shape requirementEach tree is a stump (1 node + 2 leaves)
Each tree has equal influenceStumps have weighted influence
Each tree is independentEach stump is dependent of its previous one

 

Gradient Boosting

Idea: train a bunch of fixed-size trees to fit residuals sequentially. take a weighted majority vote.

Model:

  1. Init a constant-value leaf as initial prediction (average for reg; log-odds for cls)
  2. Train a tree (#leaves < mathematical expression or equation ) to predict the negative loss gradient w.r.t. curr ensemble’s predictions for each sample in the training data.
    • Pseudo-residuals: negative gradients represent how far off curr predictions are from actual targets
    • The weak learner aims to capture the patterns in the errors made by curr ensemble.
  3. Combine leaf (+ prev trees) + curr tree (scaled with a learning rate) to make new predictions on the same data. Calculate new negative loss gradients.
  4. Repeat Steps 2-3 till #trees reaches limit.

Objective:

  • Loss: arbitrary
  • Regularization:
    • smaller learning rate (i.e., multiplicative shrinking of the weight on the weak learner)
    • bootstrapping

Optimization (Hyperparams):

  • #stages: mathematical expression or equation
  • bag size (fraction): mathematical expression or equation
  • learning rate: mathematical expression or equation
  • tree depth: mathematical expression or equation

Pros:

  • Great performance in general
  • Fast Prediction
  • High flexibility (multiple hyperparameters to tune, multiple loss functions to use)
  • Can handle missing data
  • Reduce overfitting by using a small learning rate and incorporate bootstrapping

Cons:

  • Sensitive to outliers and noisy data (may cause overfitting by overemphasizing them)
  • Require a large grid search for hyperparameter tuning
  • Longer training due to sequential building

AdaBoost vs Gradient Boosting:

AdaBoostGradient Boosting
Init with a stumpInit with a leaf of mathematical expression or equation
StumpsFixed-size trees
Scale stumps differentlyScale trees equally
Train on mathematical expression or equationTrain on residuals

 

XGBoost

Idea: use a unique tree to make decisions based on similarity scores.

Preliminaries:

  • Pre-sort algorithm: sort samples by the feature value, then split linearly.
  • Histogram-based algorithm: bucket continuous feature values into discrete bins.
  • Level-wise tree growth (BFS)

Model:

  1. Calculate residuals for all samples based on current prediction. Calculate similarity score for the root node of a new tree. $$ \text{Similarity}=\frac{\sum_{i=1}^{m_r}{r_i^2}}{m_r+\lambda} $$
  2. Find similarity gain of each possible split. Choose the split with the max gain at root node. $$ \text{Gain}=\text{Similarity}\text{left}+\text{Similarity}\text{right}-\text{Similarity}_\text{root} $$
  3. Repeat Step 2 till limit. Prune branches bottom-up by checking whether the gain of the branch is higher than a predefined threshold mathematical expression or equation . If it is higher, stop. If it is lower, prune it, move on to the next branch.
  4. Define the output value for each leaf of this tree. $$ \text{Output}=\frac{\sum_{i=1}^{m_r}{r_i}}{m_r+\lambda} $$
  5. Define the predicted value for each sample. $$ \hat{y}_i=\bar{y}+\eta\sum{\text{Tree}(\mathbf{x}_i)} $$
  6. Repeat Steps 1-5 till limit.

Optimization (hyperparams):

  • Regularization: mathematical expression or equation , the lower gain, thus easier to prune.)
  • Prune threshold: mathematical expression or equation prunes negative gains.)
  • Learning rate: mathematical expression or equation

Pros:

  • Perform well when #features is small
  • The Pros of Gradient Boosting

Cons:

  • Cannot handle categorical features (must do encoding)
  • Bad performance on sparse and unstructured data
  • The Cons of Gradient Boosting

 

LightGBM

Idea: Gradient Boosting + GOSS + EFB

Preliminaries:

  • GOSS (Gradient-based One-Side Sampling): focus more on under-trained samples without changing the original data distribution.
    1. Sort all samples based on abs(gradient). Select top mathematical expression or equation % samples as the samples with large gradients. Keep them.
    2. Randomly sample mathematical expression or equation .
  • EFB (Exclusive Feature Bundling): bundle mutually exclusive features together into much fewer dense features by conflict rate (more nonzero values lead to higher probability of conflicts).
    1. Sort features based on conflicts in a descending order. Assign each to an existing bundle with a small conflict or create a new bundle.
    2. Merge exclusive features in the same bundle. If two features have joint ranges, add an offset value so that the two features can be merged into one range.
  • Leaf-wise tree growth: choose the leaf with max delta loss to grow. (DFS)

Pros:

  • Can handle categorical features
  • Much faster and efficient training both in time and space
  • Higher accuracy than most other boosting algorithms, especially on large datasets

Cons:

  • Overfitting (Because of leaf-wise splitting, the tree can be much more complex)
  • Bad performance on small datasets

Gradient Boosting Comparisons:

XGBoostCatBoostLightGBM
Tree growthAsymmetric level-wise tree growthSymmetric growthAsymmetric leaf-wise tree growth
SplitPre-sort + HistogramGreedyGOSS + EFB
Numerical featuresSupportSupportSupport
Categorical featuresNeed external encoding into numerical features
Cannot interpret ordinal category
Support with default encoding methodsSupport with default encoding methods
Can interpret ordinal category
Text featuresNOYES by converting them to numerical features
via bag-of-words, BM-25, or Naive Bayes
NO
Missing valuesInterpret as NaN
Assign to side that reduces loss the most in each split
Interpret as NaN
Process as min/max
Interpret as NaN
Assign to side that reduces loss the most in each split

 

 

Online Learning

In comparison to batch learning which can be extremely expensive for big datasets, online learning is easy in a map-reduce environment.

Online learning is like SGD where we learn only on a single sample each time.

Common Pros:

  • Extreme flexibility
  • Extremely fast in terms of reaching optimum
  • The only door towards continual learning so far

Common Cons:

  • Catastrophic forgetting (The model forgets what it learnt before)
  • Highly noisy convergence (might not converge due to frequent updates)

 

Least Mean Squares

Idea: Fit LinReg on each observation sequentially.

Model: LinReg

Objective: mathematical expression or equation

Optimization: SGD $$ \textbf{w}_{i+1}=\textbf{w}_i-\frac{\eta}{2}\frac{\partial\mathcal{L}}{\partial\textbf{w}_i}=\textbf{w}_i+\eta(y_i-\textbf{w}^T\textbf{x}_i)\textbf{x}_i $$

  • This algorithm is guaranteed to converge for mathematical expression or equation .
  • The convergence rate is proportional to mathematical expression or equation ).

 

Perceptron

Idea: Fit a linear classifier on each observation sequentially.

Model: linear classifier

  • Binary: mathematical expression or equation
  • Multiclass: mathematical expression or equation

Objective: mathematical expression or equation

Optimization: SGD

  • Binary: $$ \textbf{w}_{i+1}=\textbf{w}_i+\frac{1}{2}(y_i-\text{sign}(\textbf{w}^T\textbf{x}_i))\textbf{x}_i=\textbf{w}_i+y_i\textbf{x}_i $$
    • mathematical expression or equation
    • If we get it correct, no update at all because the residual is 0.
    • If we get it wrong, drop weights for negative samples and raise weights for positive samples.
  • Multiclass: Similar to Binary but raise the weight vector for the actual class and reduce the weight vector for the predicted wrong class.

Pros:

  • Guaranteed to converge to a solution if samples are linearly separable
    • #mistakes before convergence is always less than mathematical expression or equation .
    • Numerator: size of the biggest sample.
    • Denominator: margin of the decision boundary ( mathematical expression or equation )

Cons:

  • Highly unstable and bounce around if samples are not linearly separable

Variations:

  • Voted Perceptron: Perceptron but keep track of all the intermediate models and take a majority vote during prediction.
  • Averaged Perceptron: Voted Perceptron but take an average vote during prediction.
  • Pros:
    • Better generalization performance than perceptron
    • Same training time as perceptron
    • Nearly as good as SVM
    • Can use the Kernel Trick to replace dot product with Kernel
  • Cons:
    • Higher memory cost
    • Higher inference cost
  • Further variations: different ways to tune mathematical expression or equation to maximize margin)

 

Passive Aggressive

Idea: Perceptron but minimizing hinge loss (i.e., maximize margin)

Model: Perceptron

Objective: Hinge loss $$ \mathcal{L}=\begin{cases} 0 & \text{ if }y_i\textbf{w}^T\textbf{x}_i\geq1 \\ 1-y_i\textbf{w}^T\textbf{x}_i & \text{ if }y_i\textbf{w}^T\textbf{x}_i<1 \end{cases} $$

Optimization: Margin-Infused Relaxed Algorithm (MIRA)

  • If correct classification with a margin of at least 1, no change.
  • If wrong, $$ \textbf{w}_{i+1}=\textbf{w}_i+\frac{\mathcal{L}}{||\textbf{x}_i||^2}y_i\textbf{x}_i $$
  • MIRA attempts to make the smallest changes to the weight vector(s) by moving the hyperplane to include the new sample point onto the margin, therefore maximizing the margin for the entire dataset.

Pros & Cons Summary

| | Scale Invariance | Robustness | Consistency | Generalization |