Supervised Learning
On this page
Linear Models
Naive Bayes
Why: provide quick & fairly accurate predictions
What: the simplest generative, parametric classifier solely based on Bayes’ Theorem
- Background (#Params):
- General:
- Joint (binary):
- Independent (binary):
- General:
- Assumption: Conditional Independence of features given label
- Params:
: prior probability for class
: conditional density param for class
and feature
How:
- Training:
- Objective:
- Loss: 0-1
- Regularization: Laplace smoothing
- Optimization:
- MLE:
: #samples of class
: #samples of class
with feature
- MAP:
: shifted Dirichlet param for
(conjugate prior + likelihood)
: shifted Dirichlet param for
(conjugate prior + likelihood)
- Laplace smoothing:
- MLE:
- Objective:
- Inference:
Pros:
- Easy
- Scale invariant
- High computational efficiency (significantly low #params)
- Robust to outliers and noisy data
- No overfitting (unless feature value missing in training set)
- Can handle real-time prediction
- High accuracy on small datasets (> LogReg)
Cons:
- Assumption fails in real life
- Low accuracy on large datasets (< LogReg)
- Low accuracy in terms of probability estimates (« LogReg)
Linear Regression
What: Linear regression fits a linear hyperplane between features and labels.
When:
- Linearity: The underlying relationship between
is linear.
- Independence:
is independent of each other.
- Normality:
follows Gaussian distribution.
- Non-Collinearity: No/Minimal explanatory variables correlate with each other.
- Homoskedasticity: The variance of all noises is the same constant
.
How:
Training:
- Hyperparameters:
- Objective:
- Loss: MSE
- Regularization: L1, L2, ElasticNet, L0
- Optimization: $$\begin{aligned} &\text{OLS/MLE}:\ &&\hat{\mathbf{w}}=(X^TX)^{-1}X^T\mathbf{y} \\ &\text{Ridge}:\ &&\hat{\mathbf{w}}=(X^TX+\lambda I)^{-1}X^T\mathbf{y} \\ &\text{Lasso}:\ &&\frac{\partial \mathcal{L}B}{\partial w_j}=\frac{1}{m}\sum{i=1}^{m}[x_{ij}(\hat{y}i-y_i)]+\lambda\cdot\text{sign}(w_j) \\ &\text{ElasticNet}:\ &&\frac{\partial \mathcal{L}B}{\partial w_j}=\frac{1}{m}\sum{i=1}^{m}[x{ij}(\hat{y}_i-y_i)]+\lambda r_1\cdot\text{sign}(w_j)+\lambda(1-r_1)w_j \end{aligned}$$
Inference: $$ \hat{y}_i=\mathbf{w}^T\mathbf{x}_i $$
Pros:
Cons:
Objective:
Optimization:
Pros:
- Simple and Interpretable
- Scale invariant
- Consistent and Unbiased (OLS/MLE ver.)
Cons:
- Sensitive to outliers
- Limited to assumptions (if any assumption fails, LinReg fails)
Time complexity:
- Train:
- Exact Solution:
- Gradient Descent:
- Exact Solution:
- Test:
Logistic Regression
Idea: use sigmoid/softmax as link function for linear regression for classification.
Model:
Prediction: $$ \hat{y}i=\arg\max_k\hat{p}{ik} $$
Objective:
- Loss: Cross Entropy
- Regularization: L1, L2, ElasticNet
Optimization: $$ \frac{\partial \mathcal{L}}{\partial w_{jk}}=\frac{1}{m}\sum_{i=1}^{m}[x_{ij}(\hat{p}_{ik}-y_i)] $$
Pros:
- Scale invariant
- Easy expansion to multiclass
- Coefficients resemble feature importance
- Easy gradient calculation
Cons:
- bad performance when
- bad performance for nonlinear cases (assume linearity by log odds)
- prioritize correctly classifying the more prevalent class, even if it means misclassifying the less prevalent class
Time Complexity: Train:
Space Complexity:
Support Vector Machine
Idea: choose a decision boundary that maximizes the soft margin between classes.
Preliminaries:
- Primal vs Dual
- Primal operates in the feature space
- Dual operates in the sample space
(i.e., Kernel Matrix)
- Params in Primal & Dual are transferable:
- Dual > Primal:
- = Weighted combination of support vectors
- Sparsity
- Kernel Trick
- Primal operates in the feature space
- Hard margin vs Soft margin
- Hard margin does NOT accept any misclassification
prone to overfitting
- Soft margin allows some misclassifications
regularization
- Hard margin does NOT accept any misclassification
- Support vectors are 1) on the margin 2) on the wrong side 3) within the margin.
- In linearly separable case, the decision boundary with the maximal margin is unique.
Model/Prediction: linear classifier $$ \hat{y}_i=\text{sign}(\textbf{w}^T\phi(\textbf{x}_i)) $$
Objective: Hinge Loss + L2 Penalty (can use other losses/penalties but rare)
- Primal (Linear):
- Primal (Kernel):
- Dual (Linear):
- Dual (Kernel):
Optimization:
- Param Estimation: Decomposition (quadratic programming), Closed-form, GD, etc.
- Hyperparam Tuning:
, a larger-margin hyperplane will be chosen.
Pros:
- Good performance on high-dimensional and non-linearly separable data (with Kernel trick)
- Good generalization to unseen data
- Guaranteed convexity. Easy to optimize
- Low memory cost (only support vectors matter)
Cons:
- High computational cost, especially 1) with Kernels 2) when sample size is too large 3) with Multiclass classification (no native support for it; need 1v1 or 1-v-rest strategies)
- Low interpretability: No probability estimates
- Bad performance when
.
- Bad performance on large datasets (i.e.,
).
- Sensitive to outliers and noisy data
- Sensitive to overlapping classes (i.e., classes which share the same parts of some feature values)
Time Complexity:
- Train:
- Test:
#support vectors.
Local Learning
K Nearest Neighbors
Idea: label a sample based on its
nearest neighbors.
Model:
- Calculate distance between sample point and every training point.
- Find the
nearest neighbors with minimal distances.
- Take the majority vote and output it as the label for the sample point.
Pros:
- No training: instance-based learning (i.e., lazy learner)
- Seamless data augmentation at any step
- Simple and interpretable
Cons:
- Scale variant
- High computational cost for large datasets
- High computational cost + Low variance in distance measure for high-dimensional data
- Sensitive to noisy data, missing values, and outliers
Time Complexity: Test:
Space Complexity:
Kernel Method
Model (Mercer’s Theorem): $$ \forall\phi:\mathbb{R}^n\rightarrow\mathbb{R}^p\ \exists k:\mathbb{R}^{n\times n}\rightarrow\mathbb{R}\ \text{ s.t. }\ k(\mathbf{x},\mathbf{x}’)=\phi(\mathbf{x})^T\phi(\mathbf{x}’) $$
: input sample
: feature map from one space to another (typically a higher dimensional space)
: kernel function
Idea: Allow using linear models for non-linear samples, without transforming data into a higher dimensional space (i.e., without computing
).
Assumptions: The kernel function must satisfy 2 conditions:
- Symmetry:
- Positive-Definite:
Types of Kernels:
Type | Formula |
---|---|
Linear | |
Polynomial | |
RBF (Radial Basis Function) |
Pros:
- Low computational cost (relative to feature mapping calculation)
- Applicable to all data type
- Easy validation of kernel by finding arbitrary 2 points with negative determinant (i.e., negative eigenvalue)
- Consider all samples as each sample’s neighbors
Cons:
- Scale variant
- Overfitting
- Low interpretability of kernels
- High difficulty in kernel selection
- Biased toward closer samples
Decision Tree
Idea: build a tree where each node is a feature split to classify data points into different leaf outputs.
Preliminaries:
- Information gain:
: Impurity measure
- Gini:
- Entropy:
- Gini:
- Entropy: a measure of uncertainty
- Conditional entropy (Average):
- Specific conditional entropy:
- Conditional entropy (Average):
Model:
- Calculate info gain for each feature.
- Select the feature that maximizes info gain as the decision threshold.
- Split data based on the decision. Repeat Step 1-2 until stop.
Pros:
- Scale invariant
- Low computational cost
- Can handle multiclass classification
- Interpretable and Easy to validate (easy model visualization)
Cons:
- Prone to overfitting (perfect fit on smaller sample size)
- Bad performance with class imbalance (biased toward most frequently occurring class)
- Sensitive to noisy data, missing value, and outliers
- Discrete predictions only
- Difficult to find globally optimal decisions (only support locally optimal decisions at each node)
- Bad performance overall
Time complexity:
- Train:
- Test:
if balanced binary tree)
Ensemble Methods
- Bootstrapping: randomly select
is the fraction of samples to bootstrap.
- Bagging (bootstrap aggregating): aggregate a bunch of weak models trained on bootstrapped subsets individually.
- Boosting: train multiple models sequentially and dependently.
- Converge exponentially with #iterations.
- Cons: Highly sensitive to outliers and noisy data.
- Common Pros of Ensemble Methods:
- Scale invariant
- Reduce overfitting
- Greater performance
- Can handle high-dimensional data efficiently
- Common Cons of Ensemble Methods:
- High computational cost
- Low interpretability
Random Forest
Idea: Bagging with Decision Trees
Model:
- Bootstrap.
- Create a full decision tree (no pruning). On each node, randomly select
features from bootstrapped subset. Find the best split.
- Repeat 1-2 to create a random forest till #tree reaches limit.
- Use out-of-bag samples to determine the accuracy of each tree.
Prediction: Take a majority/average vote of all trees.
Objective:
- Loss: any
- Regularization: increase #trees
Pros:
- Reduce overfitting in decision tree & much more accurate than a single decision tree
- Flexible to both categorical & numerical outputs
- Automatically handle missing values by dropping or filling with median/mode
- Robust to outliers and noisy data
- Best used in banking and healthcare
Cons:
- Worse performance than Boosting in general
Time Complexity:
- Train:
#trees
- Test:
max depth
AdaBoost
Idea: train a bunch of stumps (weak learners) sequentially and take a weighted majority vote.
Model:
- Init all samples with equal sample weight.
- Find optimal feature for first stump. Calculate total error. Calculate amount of say: $$ \text{Amount of Say}=\frac{1}{2}\log{\frac{1-Err_\text{tot}}{Err_\text{tot}}} $$
- Modify sample weights for incorrectly and correctly predicted samples as follows:
- Normalize all sample weights.
- Select new samples based on new sample weights as probabilities with replacement to generate a new set.
- Give equal sample weights to all samples in the new training set.
- Repeat Steps 1-6 till #stumps reaches limit.
Prediction: Take a weighted majority vote using the Amount of Say from each stump.
Objective: Exponential Loss
Pros:
- Reduce overfitting more than bagging (because parameters are not optimized jointly but stagewise)
- Fewer hyperparameters than other models
Cons:
- Guaranteed perfect fitting (training error=0) if infinite epochs
- Sensitive to outliers and noisy data
- Slower and generally worse performance than Gradient Boosting
Random Forest vs AdaBoost:
Random Forest | AdaBoost |
---|---|
No tree shape requirement | Each tree is a stump (1 node + 2 leaves) |
Each tree has equal influence | Stumps have weighted influence |
Each tree is independent | Each stump is dependent of its previous one |
Gradient Boosting
Idea: train a bunch of fixed-size trees to fit residuals sequentially. take a weighted majority vote.
Model:
- Init a constant-value leaf as initial prediction (average for reg; log-odds for cls)
- Train a tree (#leaves <
) to predict the negative loss gradient w.r.t. curr ensemble’s predictions for each sample in the training data.
- Pseudo-residuals: negative gradients represent how far off curr predictions are from actual targets
- The weak learner aims to capture the patterns in the errors made by curr ensemble.
- Combine leaf (+ prev trees) + curr tree (scaled with a learning rate) to make new predictions on the same data. Calculate new negative loss gradients.
- Repeat Steps 2-3 till #trees reaches limit.
Objective:
- Loss: arbitrary
- Regularization:
- smaller learning rate (i.e., multiplicative shrinking of the weight on the weak learner)
- bootstrapping
Optimization (Hyperparams):
- #stages:
- bag size (fraction):
- learning rate:
- tree depth:
Pros:
- Great performance in general
- Fast Prediction
- High flexibility (multiple hyperparameters to tune, multiple loss functions to use)
- Can handle missing data
- Reduce overfitting by using a small learning rate and incorporate bootstrapping
Cons:
- Sensitive to outliers and noisy data (may cause overfitting by overemphasizing them)
- Require a large grid search for hyperparameter tuning
- Longer training due to sequential building
AdaBoost vs Gradient Boosting:
AdaBoost | Gradient Boosting |
---|---|
Init with a stump | Init with a leaf of
|
Stumps | Fixed-size trees |
Scale stumps differently | Scale trees equally |
Train on
| Train on residuals |
XGBoost
Idea: use a unique tree to make decisions based on similarity scores.
Preliminaries:
- Pre-sort algorithm: sort samples by the feature value, then split linearly.
- Histogram-based algorithm: bucket continuous feature values into discrete bins.
- Level-wise tree growth (BFS)
Model:
- Calculate residuals for all samples based on current prediction. Calculate similarity score for the root node of a new tree. $$ \text{Similarity}=\frac{\sum_{i=1}^{m_r}{r_i^2}}{m_r+\lambda} $$
- Find similarity gain of each possible split. Choose the split with the max gain at root node. $$ \text{Gain}=\text{Similarity}\text{left}+\text{Similarity}\text{right}-\text{Similarity}_\text{root} $$
- Repeat Step 2 till limit. Prune branches bottom-up by checking whether the gain of the branch is higher than a predefined threshold
. If it is higher, stop. If it is lower, prune it, move on to the next branch.
- Define the output value for each leaf of this tree. $$ \text{Output}=\frac{\sum_{i=1}^{m_r}{r_i}}{m_r+\lambda} $$
- Define the predicted value for each sample. $$ \hat{y}_i=\bar{y}+\eta\sum{\text{Tree}(\mathbf{x}_i)} $$
- Repeat Steps 1-5 till limit.
Optimization (hyperparams):
- Regularization:
, the lower gain, thus easier to prune.)
- Prune threshold:
prunes negative gains.)
- Learning rate:
Pros:
- Perform well when #features is small
- The Pros of Gradient Boosting
Cons:
- Cannot handle categorical features (must do encoding)
- Bad performance on sparse and unstructured data
- The Cons of Gradient Boosting
LightGBM
Idea: Gradient Boosting + GOSS + EFB
Preliminaries:
- GOSS (Gradient-based One-Side Sampling): focus more on under-trained samples without changing the original data distribution.
- Sort all samples based on abs(gradient). Select top
% samples as the samples with large gradients. Keep them.
- Randomly sample
.
- Sort all samples based on abs(gradient). Select top
- EFB (Exclusive Feature Bundling): bundle mutually exclusive features together into much fewer dense features by conflict rate (more nonzero values lead to higher probability of conflicts).
- Sort features based on conflicts in a descending order. Assign each to an existing bundle with a small conflict or create a new bundle.
- Merge exclusive features in the same bundle. If two features have joint ranges, add an offset value so that the two features can be merged into one range.
- Leaf-wise tree growth: choose the leaf with max delta loss to grow. (DFS)
Pros:
- Can handle categorical features
- Much faster and efficient training both in time and space
- Higher accuracy than most other boosting algorithms, especially on large datasets
Cons:
- Overfitting (Because of leaf-wise splitting, the tree can be much more complex)
- Bad performance on small datasets
Gradient Boosting Comparisons:
XGBoost | CatBoost | LightGBM | |
---|---|---|---|
Tree growth | Asymmetric level-wise tree growth | Symmetric growth | Asymmetric leaf-wise tree growth |
Split | Pre-sort + Histogram | Greedy | GOSS + EFB |
Numerical features | Support | Support | Support |
Categorical features | Need external encoding into numerical features Cannot interpret ordinal category | Support with default encoding methods | Support with default encoding methods Can interpret ordinal category |
Text features | NO | YES by converting them to numerical features via bag-of-words, BM-25, or Naive Bayes | NO |
Missing values | Interpret as NaN Assign to side that reduces loss the most in each split | Interpret as NaN Process as min/max | Interpret as NaN Assign to side that reduces loss the most in each split |
Online Learning
In comparison to batch learning which can be extremely expensive for big datasets, online learning is easy in a map-reduce environment.
Online learning is like SGD where we learn only on a single sample each time.
Common Pros:
- Extreme flexibility
- Extremely fast in terms of reaching optimum
- The only door towards continual learning so far
Common Cons:
- Catastrophic forgetting (The model forgets what it learnt before)
- Highly noisy convergence (might not converge due to frequent updates)
Least Mean Squares
Idea: Fit LinReg on each observation sequentially.
Model: LinReg
Objective:
Optimization: SGD $$ \textbf{w}_{i+1}=\textbf{w}_i-\frac{\eta}{2}\frac{\partial\mathcal{L}}{\partial\textbf{w}_i}=\textbf{w}_i+\eta(y_i-\textbf{w}^T\textbf{x}_i)\textbf{x}_i $$
- This algorithm is guaranteed to converge for
.
- The convergence rate is proportional to
).
Perceptron
Idea: Fit a linear classifier on each observation sequentially.
Model: linear classifier
- Binary:
- Multiclass:
Objective:
Optimization: SGD
- Binary:
$$
\textbf{w}_{i+1}=\textbf{w}_i+\frac{1}{2}(y_i-\text{sign}(\textbf{w}^T\textbf{x}_i))\textbf{x}_i=\textbf{w}_i+y_i\textbf{x}_i
$$
- If we get it correct, no update at all because the residual is 0.
- If we get it wrong, drop weights for negative samples and raise weights for positive samples.
- Multiclass: Similar to Binary but raise the weight vector for the actual class and reduce the weight vector for the predicted wrong class.
Pros:
- Guaranteed to converge to a solution if samples are linearly separable
- #mistakes before convergence is always less than
.
- Numerator: size of the biggest sample.
- Denominator: margin of the decision boundary (
)
- #mistakes before convergence is always less than
Cons:
- Highly unstable and bounce around if samples are not linearly separable
Variations:
- Voted Perceptron: Perceptron but keep track of all the intermediate models and take a majority vote during prediction.
- Averaged Perceptron: Voted Perceptron but take an average vote during prediction.
- Pros:
- Better generalization performance than perceptron
- Same training time as perceptron
- Nearly as good as SVM
- Can use the Kernel Trick to replace dot product with Kernel
- Cons:
- Higher memory cost
- Higher inference cost
- Further variations: different ways to tune
to maximize margin)
Passive Aggressive
Idea: Perceptron but minimizing hinge loss (i.e., maximize margin)
Model: Perceptron
Objective: Hinge loss $$ \mathcal{L}=\begin{cases} 0 & \text{ if }y_i\textbf{w}^T\textbf{x}_i\geq1 \\ 1-y_i\textbf{w}^T\textbf{x}_i & \text{ if }y_i\textbf{w}^T\textbf{x}_i<1 \end{cases} $$
Optimization: Margin-Infused Relaxed Algorithm (MIRA)
- If correct classification with a margin of at least 1, no change.
- If wrong, $$ \textbf{w}_{i+1}=\textbf{w}_i+\frac{\mathcal{L}}{||\textbf{x}_i||^2}y_i\textbf{x}_i $$
- MIRA attempts to make the smallest changes to the weight vector(s) by moving the hyperplane to include the new sample point onto the margin, therefore maximizing the margin for the entire dataset.
Pros & Cons Summary
| | Scale Invariance | Robustness | Consistency | Generalization |