Supervised Learning

Linear Models

Naive Bayes

Why: provide quick & fairly accurate predictions

What: the simplest generative, parametric classifier solely based on Bayes’ Theorem

Background (#Params):
- General:
- Joint (binary):
- Independent (binary):
Assumption: Conditional Independence of features given label
Params:
- : prior probability for class
- : conditional density param for class and feature

How:

Training:
- Objective:
  - Loss: 0-1
  - Regularization: Laplace smoothing
- Optimization:
  - MLE:
    - : #samples of class
    - : #samples of class with feature
  - MAP:
    - : shifted Dirichlet param for (conjugate prior + likelihood)
    - : shifted Dirichlet param for (conjugate prior + likelihood)
    - Laplace smoothing:
Inference:

Pros:

Easy
Scale invariant
High computational efficiency (significantly low #params)
Robust to outliers and noisy data
No overfitting (unless feature value missing in training set)
Can handle real-time prediction
High accuracy on small datasets (> LogReg)

Cons:

Assumption fails in real life
Low accuracy on large datasets (< LogReg)
Low accuracy in terms of probability estimates (« LogReg)

Linear Regression

What: Linear regression fits a linear hyperplane between features and labels.

When:

Linearity: The underlying relationship between is linear.
Independence: is independent of each other.
Normality: follows Gaussian distribution.
Non-Collinearity: No/Minimal explanatory variables correlate with each other.
Homoskedasticity: The variance of all noises is the same constant .

How:

Training:

Hyperparameters:
Objective:
- Loss: MSE
- Regularization: L1, L2, ElasticNet, L0
Optimization: $$\begin{aligned} &\text{OLS/MLE}:\ &&\hat{\mathbf{w}}=(X^TX)^{-1}X^T\mathbf{y} \\ &\text{Ridge}:\ &&\hat{\mathbf{w}}=(X^TX+\lambda I)^{-1}X^T\mathbf{y} \\ &\text{Lasso}:\ &&\frac{\partial \mathcal{L}B}{\partial w_j}=\frac{1}{m}\sum{i=1}^{m}[x_{ij}(\hat{y}i-y_i)]+\lambda\cdot\text{sign}(w_j) \\ &\text{ElasticNet}:\ &&\frac{\partial \mathcal{L}B}{\partial w_j}=\frac{1}{m}\sum{i=1}^{m}[x{ij}(\hat{y}_i-y_i)]+\lambda r_1\cdot\text{sign}(w_j)+\lambda(1-r_1)w_j \end{aligned}$$

Inference: $$ \hat{y}_i=\mathbf{w}^T\mathbf{x}_i $$

Pros:

Cons:

Objective:

Optimization:

Pros:

Simple and Interpretable
Scale invariant
Consistent and Unbiased (OLS/MLE ver.)

Cons:

Sensitive to outliers
Limited to assumptions (if any assumption fails, LinReg fails)

Time complexity:

Train:
- Exact Solution:
- Gradient Descent:
Test:

Logistic Regression

Idea: use sigmoid/softmax as link function for linear regression for classification.

Model:

Prediction: $$ \hat{y}i=\arg\max_k\hat{p}{ik} $$

Objective:

Loss: Cross Entropy
Regularization: L1, L2, ElasticNet

Optimization: $$ \frac{\partial \mathcal{L}}{\partial w_{jk}}=\frac{1}{m}\sum_{i=1}^{m}[x_{ij}(\hat{p}_{ik}-y_i)] $$

Pros:

Scale invariant
Easy expansion to multiclass
Coefficients resemble feature importance
Easy gradient calculation

Cons:

bad performance when
bad performance for nonlinear cases (assume linearity by log odds)
prioritize correctly classifying the more prevalent class, even if it means misclassifying the less prevalent class

Time Complexity: Train:

Space Complexity:

Support Vector Machine

Idea: choose a decision boundary that maximizes the soft margin between classes.

Preliminaries:

Primal vs Dual
- Primal operates in the feature space
- Dual operates in the sample space (i.e., Kernel Matrix)
- Params in Primal & Dual are transferable:
- Dual > Primal:
  - = Weighted combination of support vectors
  - Sparsity
  - Kernel Trick
Hard margin vs Soft margin
- Hard margin does NOT accept any misclassification prone to overfitting
- Soft margin allows some misclassifications regularization
Support vectors are 1) on the margin 2) on the wrong side 3) within the margin.
In linearly separable case, the decision boundary with the maximal margin is unique.

Model/Prediction: linear classifier $$ \hat{y}_i=\text{sign}(\textbf{w}^T\phi(\textbf{x}_i)) $$

Objective: Hinge Loss + L2 Penalty (can use other losses/penalties but rare)

Primal (Linear):
Primal (Kernel):
Dual (Linear):
Dual (Kernel):

Optimization:

Param Estimation: Decomposition (quadratic programming), Closed-form, GD, etc.
Hyperparam Tuning: , a larger-margin hyperplane will be chosen.

Pros:

Good performance on high-dimensional and non-linearly separable data (with Kernel trick)
Good generalization to unseen data
Guaranteed convexity. Easy to optimize
Low memory cost (only support vectors matter)

Cons:

High computational cost, especially 1) with Kernels 2) when sample size is too large 3) with Multiclass classification (no native support for it; need 1v1 or 1-v-rest strategies)
Low interpretability: No probability estimates
Bad performance when .
Bad performance on large datasets (i.e., ).
Sensitive to outliers and noisy data
Sensitive to overlapping classes (i.e., classes which share the same parts of some feature values)

Time Complexity:

Train:
Test: #support vectors.

Local Learning

K Nearest Neighbors

Idea: label a sample based on its nearest neighbors.

Model:

Calculate distance between sample point and every training point.
Find the nearest neighbors with minimal distances.
Take the majority vote and output it as the label for the sample point.

Pros:

No training: instance-based learning (i.e., lazy learner)
Seamless data augmentation at any step
Simple and interpretable

Cons:

Scale variant
High computational cost for large datasets
High computational cost + Low variance in distance measure for high-dimensional data
Sensitive to noisy data, missing values, and outliers

Time Complexity: Test:

Space Complexity:

Kernel Method

Model (Mercer’s Theorem): $$ \forall\phi:\mathbb{R}^n\rightarrow\mathbb{R}^p\ \exists k:\mathbb{R}^{n\times n}\rightarrow\mathbb{R}\ \text{ s.t. }\ k(\mathbf{x},\mathbf{x}’)=\phi(\mathbf{x})^T\phi(\mathbf{x}’) $$

: input sample
: feature map from one space to another (typically a higher dimensional space)
: kernel function

Idea: Allow using linear models for non-linear samples, without transforming data into a higher dimensional space (i.e., without computing ).

Assumptions: The kernel function must satisfy 2 conditions:

Symmetry:
Positive-Definite:

Types of Kernels:

Type	Formula
Linear
Polynomial
RBF (Radial Basis Function)

Pros:

Low computational cost (relative to feature mapping calculation)
Applicable to all data type
Easy validation of kernel by finding arbitrary 2 points with negative determinant (i.e., negative eigenvalue)
Consider all samples as each sample’s neighbors

Cons:

Scale variant
Overfitting
Low interpretability of kernels
High difficulty in kernel selection
Biased toward closer samples

Decision Tree

Idea: build a tree where each node is a feature split to classify data points into different leaf outputs.

Preliminaries:

Information gain:
- : Impurity measure
  - Gini:
  - Entropy:
Entropy: a measure of uncertainty
- Conditional entropy (Average):
- Specific conditional entropy:

Model:

Calculate info gain for each feature.
Select the feature that maximizes info gain as the decision threshold.
Split data based on the decision. Repeat Step 1-2 until stop.

Pros:

Scale invariant
Low computational cost
Can handle multiclass classification
Interpretable and Easy to validate (easy model visualization)

Cons:

Prone to overfitting (perfect fit on smaller sample size)
Bad performance with class imbalance (biased toward most frequently occurring class)
Sensitive to noisy data, missing value, and outliers
Discrete predictions only
Difficult to find globally optimal decisions (only support locally optimal decisions at each node)
Bad performance overall

Time complexity:

Train:
Test: if balanced binary tree)

Ensemble Methods

Bootstrapping: randomly select is the fraction of samples to bootstrap.
Bagging (bootstrap aggregating): aggregate a bunch of weak models trained on bootstrapped subsets individually.
Boosting: train multiple models sequentially and dependently.
- Converge exponentially with #iterations.
- Cons: Highly sensitive to outliers and noisy data.
Common Pros of Ensemble Methods:
- Scale invariant
- Reduce overfitting
- Greater performance
- Can handle high-dimensional data efficiently
Common Cons of Ensemble Methods:
- High computational cost
- Low interpretability

Random Forest

Idea: Bagging with Decision Trees

Model:

Bootstrap.
Create a full decision tree (no pruning). On each node, randomly select features from bootstrapped subset. Find the best split.
Repeat 1-2 to create a random forest till #tree reaches limit.
Use out-of-bag samples to determine the accuracy of each tree.

Prediction: Take a majority/average vote of all trees.

Objective:

Loss: any
Regularization: increase #trees

Pros:

Reduce overfitting in decision tree & much more accurate than a single decision tree
Flexible to both categorical & numerical outputs
Automatically handle missing values by dropping or filling with median/mode
Robust to outliers and noisy data
Best used in banking and healthcare

Cons:

Worse performance than Boosting in general

Time Complexity:

Train: #trees
Test: max depth

AdaBoost

Idea: train a bunch of stumps (weak learners) sequentially and take a weighted majority vote.

Model:

Init all samples with equal sample weight.
Find optimal feature for first stump. Calculate total error. Calculate amount of say: $$ \text{Amount of Say}=\frac{1}{2}\log{\frac{1-Err_\text{tot}}{Err_\text{tot}}} $$
Modify sample weights for incorrectly and correctly predicted samples as follows:
Normalize all sample weights.
Select new samples based on new sample weights as probabilities with replacement to generate a new set.
Give equal sample weights to all samples in the new training set.
Repeat Steps 1-6 till #stumps reaches limit.

Prediction: Take a weighted majority vote using the Amount of Say from each stump.

Objective: Exponential Loss

Pros:

Reduce overfitting more than bagging (because parameters are not optimized jointly but stagewise)
Fewer hyperparameters than other models

Cons:

Guaranteed perfect fitting (training error=0) if infinite epochs
Sensitive to outliers and noisy data
Slower and generally worse performance than Gradient Boosting

Random Forest vs AdaBoost:

Random Forest	AdaBoost
No tree shape requirement	Each tree is a stump (1 node + 2 leaves)
Each tree has equal influence	Stumps have weighted influence
Each tree is independent	Each stump is dependent of its previous one

Gradient Boosting

Idea: train a bunch of fixed-size trees to fit residuals sequentially. take a weighted majority vote.

Model:

Init a constant-value leaf as initial prediction (average for reg; log-odds for cls)
Train a tree (#leaves < ) to predict the negative loss gradient w.r.t. curr ensemble’s predictions for each sample in the training data.
- Pseudo-residuals: negative gradients represent how far off curr predictions are from actual targets
- The weak learner aims to capture the patterns in the errors made by curr ensemble.
Combine leaf (+ prev trees) + curr tree (scaled with a learning rate) to make new predictions on the same data. Calculate new negative loss gradients.
Repeat Steps 2-3 till #trees reaches limit.

Objective:

Loss: arbitrary
Regularization:
- smaller learning rate (i.e., multiplicative shrinking of the weight on the weak learner)
- bootstrapping

Optimization (Hyperparams):

#stages:
bag size (fraction):
learning rate:
tree depth:

Pros:

Great performance in general
Fast Prediction
High flexibility (multiple hyperparameters to tune, multiple loss functions to use)
Can handle missing data
Reduce overfitting by using a small learning rate and incorporate bootstrapping

Cons:

Sensitive to outliers and noisy data (may cause overfitting by overemphasizing them)
Require a large grid search for hyperparameter tuning
Longer training due to sequential building

AdaBoost vs Gradient Boosting:

AdaBoost	Gradient Boosting
Init with a stump	Init with a leaf of
Stumps	Fixed-size trees
Scale stumps differently	Scale trees equally
Train on	Train on residuals

XGBoost

Idea: use a unique tree to make decisions based on similarity scores.

Preliminaries:

Pre-sort algorithm: sort samples by the feature value, then split linearly.
Histogram-based algorithm: bucket continuous feature values into discrete bins.
Level-wise tree growth (BFS)

Model:

Calculate residuals for all samples based on current prediction. Calculate similarity score for the root node of a new tree. $$ \text{Similarity}=\frac{\sum_{i=1}^{m_r}{r_i^2}}{m_r+\lambda} $$
Find similarity gain of each possible split. Choose the split with the max gain at root node. $$ \text{Gain}=\text{Similarity}\text{left}+\text{Similarity}\text{right}-\text{Similarity}_\text{root} $$
Repeat Step 2 till limit. Prune branches bottom-up by checking whether the gain of the branch is higher than a predefined threshold . If it is higher, stop. If it is lower, prune it, move on to the next branch.
Define the output value for each leaf of this tree. $$ \text{Output}=\frac{\sum_{i=1}^{m_r}{r_i}}{m_r+\lambda} $$
Define the predicted value for each sample. $$ \hat{y}_i=\bar{y}+\eta\sum{\text{Tree}(\mathbf{x}_i)} $$
Repeat Steps 1-5 till limit.

Optimization (hyperparams):

Regularization: , the lower gain, thus easier to prune.)
Prune threshold: prunes negative gains.)
Learning rate:

Pros:

Perform well when #features is small
The Pros of Gradient Boosting

Cons:

Cannot handle categorical features (must do encoding)
Bad performance on sparse and unstructured data
The Cons of Gradient Boosting

LightGBM

Idea: Gradient Boosting + GOSS + EFB

Preliminaries:

GOSS (Gradient-based One-Side Sampling): focus more on under-trained samples without changing the original data distribution.
1. Sort all samples based on abs(gradient). Select top % samples as the samples with large gradients. Keep them.
2. Randomly sample .
EFB (Exclusive Feature Bundling): bundle mutually exclusive features together into much fewer dense features by conflict rate (more nonzero values lead to higher probability of conflicts).
1. Sort features based on conflicts in a descending order. Assign each to an existing bundle with a small conflict or create a new bundle.
2. Merge exclusive features in the same bundle. If two features have joint ranges, add an offset value so that the two features can be merged into one range.
Leaf-wise tree growth: choose the leaf with max delta loss to grow. (DFS)

Pros:

Can handle categorical features
Much faster and efficient training both in time and space
Higher accuracy than most other boosting algorithms, especially on large datasets

Cons:

Overfitting (Because of leaf-wise splitting, the tree can be much more complex)
Bad performance on small datasets

Gradient Boosting Comparisons:

	XGBoost	CatBoost	LightGBM
Tree growth	Asymmetric level-wise tree growth	Symmetric growth	Asymmetric leaf-wise tree growth
Split	Pre-sort + Histogram	Greedy	GOSS + EFB
Numerical features	Support	Support	Support
Categorical features	Need external encoding into numerical features Cannot interpret ordinal category	Support with default encoding methods	Support with default encoding methods Can interpret ordinal category
Text features	NO	YES by converting them to numerical features via bag-of-words, BM-25, or Naive Bayes	NO
Missing values	Interpret as NaN Assign to side that reduces loss the most in each split	Interpret as NaN Process as min/max	Interpret as NaN Assign to side that reduces loss the most in each split

Online Learning

In comparison to batch learning which can be extremely expensive for big datasets, online learning is easy in a map-reduce environment.

Online learning is like SGD where we learn only on a single sample each time.

Common Pros:

Extreme flexibility
Extremely fast in terms of reaching optimum
The only door towards continual learning so far

Common Cons:

Catastrophic forgetting (The model forgets what it learnt before)
Highly noisy convergence (might not converge due to frequent updates)

Least Mean Squares

Idea: Fit LinReg on each observation sequentially.

Model: LinReg

Objective:

Optimization: SGD $$ \textbf{w}_{i+1}=\textbf{w}_i-\frac{\eta}{2}\frac{\partial\mathcal{L}}{\partial\textbf{w}_i}=\textbf{w}_i+\eta(y_i-\textbf{w}^T\textbf{x}_i)\textbf{x}_i $$

This algorithm is guaranteed to converge for .
The convergence rate is proportional to ).

Perceptron

Idea: Fit a linear classifier on each observation sequentially.

Model: linear classifier

Binary:
Multiclass:

Objective:

Optimization: SGD

Binary: $$ \textbf{w}_{i+1}=\textbf{w}_i+\frac{1}{2}(y_i-\text{sign}(\textbf{w}^T\textbf{x}_i))\textbf{x}_i=\textbf{w}_i+y_i\textbf{x}_i $$
- If we get it correct, no update at all because the residual is 0.
- If we get it wrong, drop weights for negative samples and raise weights for positive samples.
Multiclass: Similar to Binary but raise the weight vector for the actual class and reduce the weight vector for the predicted wrong class.

Pros:

Guaranteed to converge to a solution if samples are linearly separable
- #mistakes before convergence is always less than .
- Numerator: size of the biggest sample.
- Denominator: margin of the decision boundary ( )

Cons:

Highly unstable and bounce around if samples are not linearly separable

Variations:

Voted Perceptron: Perceptron but keep track of all the intermediate models and take a majority vote during prediction.
Averaged Perceptron: Voted Perceptron but take an average vote during prediction.
Pros:
- Better generalization performance than perceptron
- Same training time as perceptron
- Nearly as good as SVM
- Can use the Kernel Trick to replace dot product with Kernel
Cons:
- Higher memory cost
- Higher inference cost
Further variations: different ways to tune to maximize margin)

Passive Aggressive

Idea: Perceptron but minimizing hinge loss (i.e., maximize margin)

Model: Perceptron

Objective: Hinge loss $$ \mathcal{L}=\begin{cases} 0 & \text{ if }y_i\textbf{w}^T\textbf{x}_i\geq1 \\ 1-y_i\textbf{w}^T\textbf{x}_i & \text{ if }y_i\textbf{w}^T\textbf{x}_i<1 \end{cases} $$

Optimization: Margin-Infused Relaxed Algorithm (MIRA)

If correct classification with a margin of at least 1, no change.
If wrong, $$ \textbf{w}_{i+1}=\textbf{w}_i+\frac{\mathcal{L}}{||\textbf{x}_i||^2}y_i\textbf{x}_i $$
MIRA attempts to make the smallest changes to the weight vector(s) by moving the hyperplane to include the new sample point onto the margin, therefore maximizing the margin for the entire dataset.

Pros & Cons Summary

Unsupervised Learning

Optimization

Docs

Q.Docs

Title here

Supervised Learning

Linear Models

Naive Bayes

Linear Regression

Logistic Regression

Support Vector Machine

Local Learning

K Nearest Neighbors

Kernel Method

Decision Tree

Ensemble Methods

Random Forest

AdaBoost

Gradient Boosting

XGBoost

LightGBM

Online Learning

Least Mean Squares

Perceptron

Passive Aggressive

Pros & Cons Summary

Supervised Learning

Linear Models

Naive Bayes#

Linear Regression#

Logistic Regression#

Support Vector Machine

Local Learning

K Nearest Neighbors#

Kernel Method#

Decision Tree

Ensemble Methods

Random Forest#

AdaBoost#

Gradient Boosting#

XGBoost#

LightGBM#

Online Learning

Least Mean Squares#

Perceptron#

Passive Aggressive#

Pros & Cons Summary

Naive Bayes

Linear Regression

Logistic Regression

K Nearest Neighbors

Kernel Method

Random Forest

AdaBoost

Gradient Boosting

XGBoost

LightGBM

Least Mean Squares

Perceptron

Passive Aggressive