Unsupervised Learning

Clustering

Clustering groups similar samples together with no prior knowledge.

K-Means

Idea: Hard clustering (clustering with deterministic results)

Model/Algorithm:

Init centroids .
Find cluster for each data point: $$ c_i=\arg\min_k||\mathbf{x}_i-\boldsymbol{\mu}_k||^2\quad(r_{ik}=\textbf{1}\{c_i=k\}) $$
Update centroids as the mean of all data points in the current cluster: $$ \boldsymbol{\mu}_k=\frac{\sum_{i=1}^{m}r_{ik}\mathbf{x}_i}{\sum_{i=1}^{m}r_{ik}} $$
Repeat Step 2-3 until convergence (i.e., remain unchanged).

Objective: reconstruction error:

Optimization:

Objective minimization: Greedy (because this objective is NP-hard to optimize)
Hyperparameter tuning: choose the elbow point in the “reconstruction error vs clusters” graph.

Pros:

Simple. Interpretable
Guarantee convergence in a finite number of iterations
Flexible re-training
Generalize to any type/shape/size of clusters
Suitable for large datasets
Time complexity:

Cons:

Scale variant
Numerical features only
Manual hyperparameter choice:
Inconsistent: Sensitive to centroid initialization
Sensitive to outliers and noisy data by including them
Bad performance on high-dimensional data (distance metric works poorly)
Hard clustering (assume 100% in the designated cluster)

Code:

def distance(v1,v2,metric_type='L2'):
    if metric_type == "L0":
        return np.count_nonzero(v1-v2)
    if metric_type == "L1":
        return np.sum(np.abs(v1-v2))
    if metric_type == "L2":
        return np.sqrt(np.sum(np.square(v1-v2)))
    if metric_type == "Linf":
        return np.max(np.abs(v1-v2))

class KMeans:
    def __init__(self, n_clusters=8, max_iter=100):
        self.n_clusters = n_clusters
        self.max_iter = max_iter
        
    def fit(self, X_train):
        # 1) Init centroids uniformly
        x_min, x_max = np.min(X_train, axis=0), np.max(X_train, axis=0)
        self.centroids = [np.random.uniform(x_min, x_max) for _ in range(self.n_clusters)]
        
        # 4) Repeat Step 2-3
        i = 0
        centroids_cache = self.centroids
        while i < self.max_iter and np.not_equal(self.centroids, centroids_cache).any():
            # 2) Find cluster for each data point
            clustered_points = [[] for _ in range(self.n_clusters)]
            for x in X_train:
                distances = [distance(x, centroid) for centroid in self.centroids]
                clustered_points[np.argmin(distances)].append(x)
            
            # 3) Update centroids
            self.centroids = [np.mean(cluster, axis=0) for cluster in clustered_points]
            i += 1

Gaussian Mixture Model

Idea: clustering but deterministic .

Background: Shapes of clusters are determined by the properties of their covariance matrices.

Shared Spherical: #params=1, same for all clusters.
- K-means = GMM with
Spherical: #params= for all clusters.
Shared Diagonal: #params= for all clusters.
Diagonal: #params= for all clusters.
Shared Full Covariance: #params= for all clusters.
Full Covariance: #params= for all clusters.

Assumptions:

There exists a latent variable representing the index of the Gaussian distribution in the mixture.

Model: , where

#params:

Optimization: EM

Init distributions for prior
E-step: estimate : $$ r_{ik}=P(z_i=k|\textbf{x}_i)=\frac{P(\textbf{x}_i|z_i=k)P(z_i=k)}{\sum_{k=1}^{K}P(\textbf{x}_i|z_i=k)P(z_i=k)} $$
M-step: estimate params via MLE given :
Repeat Step 2-3 until convergence.

Pros:

More robust to outliers and noisy data
Flexible to a great variety of shapes of data points
Soft clustering
Weighted distance instead of absolute distance in k-means

Cons:

High computational cost

Dimensionality Reduction

Dimensionality reduction reduces dimensionality (i.e., #features) of input data.

We need dimensionality reduction because:

Curse of Dimensionality: high-dimensional data has high sparsity.
Computational Efficiency: high-dimensional data requires more computational resources (time & space).
Overfitting: high-dimensional data are prone to overfitting.
Visualization: we cannot visualize any data beyond 3D.
Performance: high-dimensional data are prone to have more noises, which are reducible by selecting the most important features.
Interpretability: only the most relevant features matter.

Singular Value Decomposition

Model: $$ X=UDV^T=\sum_{i=1}^{\text{rank}(X)}D_{ii}\textbf{u}_i\textbf{v}_i^T $$

Input:
- : arbitrary input matrix
Output (Matrix ver.):
- : left singular vectors (i.e., final rotation)
- : singular values (i.e., scaling)
- : right singular vectors (i.e., initial rotation)
- : rank
Output (Vector ver.):
- th column unit vectors
- th outer product matrix
Note:
- rotation matrices are orthonormal:
- scaling matrix is diagonal:

Idea: An arbitrary matrix = “unit matrix final rotation”

Think of an arbitrary matrix on a 2D plane. We can reconstruct this matrix with a unit disk (i.e., the 2 unit vectors along the coordinates) via the following steps:
1. Rotate the unit vectors by .
2. Scale the rotated unit disk into an ellipse by .
3. Rotate the ellipse by .

Properties:

(i.e., right singular vectors = eigenvectors of covariance matrix)
(i.e., left singular vectors = eigenvectors of outer product matrix)
Pseudo-inverse:
If (i.e., singular value = eigenvalue).
- Positive Semi-Definite: A symmetric matrix .

Applications:

Simplify OLS in regression: when calculating the “inverse” of a rectangular matrix, .
Low-rank matrix approximation: define a lower rank .
Eigenword: project high-dimensional context to low-dimensional space, assuming distributional similarity.
- Distributional similarity: words with similar contexts have similar meanings.
- Distance-based similarity measure: similar words are close in this low-dimensional space.
- Eigenwords: left singular vectors (i.e., word embeddings).
- Eigentokens: right singular vectors Context (i.e., contextual embeddings).
- Word sense disambiguation: estimate contextual embedding for a word with right singular vectors.
PCA: see next.

Principal Component Analysis

Model (Original ver.):

Calculate the covariance matrix of .
Diagonalize the covariance matrix via Spectral Theorem: .
- : eigenvalues (i.e., PC strength / sample variance of projection)
- : orthonormal eigenvectors (i.e., PCs)
Sort eigenvectors in in descending order.
Select & normalize the strongest is a hyperparameter.
- Now .
Project
- Each projected point is:

Model (SVD ver.):

Compute SVD: .
Select (the right singular matrix) with the largest singular values as PCs.
Project the original dataset eigenvectors.

Idea: We can project the input data onto an orthonormal basis of smaller dimensions while covering maximal variance among features.

Orthonormal basis: .
Ideally, we lose minimal info (represented by minimal variance) while successfully reducing the dimensionality.

Assumptions:

Input matrix is centered/standardized on the sample space, unless data is sparse.
- (optional but strongly recommended)
PCs = linear combinations of original features. (if not, then Kernel PCA)
Variance = a measure of feature importance.

Optimization: minimize distortion (i.e., maximize variance in new coordinates)

Distortion:
Variance (of projected points): $$ \text{Variance}_k=m\sum_{j=1}^{k}\textbf{v}_j^T\Sigma\textbf{v}_j=m\sum_{j=1}^{k}\lambda_j $$
Minimizing distortion = Maximizing variance: $$ \text{Variance}_k+\text{Distortion}_k=m\sum_{j=1}^{n}\lambda_j=m\cdot\text{trace}(\Sigma) $$

Inference/Reconstruction: Any sample vector can be approximated in terms of coefficients (scores) on eigenvectors (loadings).

Predict the original point via inverse mapping:
- The projected new points s can still correlate with each other. Only their basis are independent. Therefore, variance is still not 1 even if you standardize data.
The original point can be fully reconstructed if :

Pros:

Guarantee removal of correlated features (PCs are orthogonal)
Reduce overfitting for supervised learning
Improve visualization for high-dimensional data
Robust to outliers & noisy data

Cons:

Scale invariant
Low interpretability of new features (i.e., PCs)
Potential info loss if PCs & PCs are not selected carefully
Situational (e.g. cannot be applied on NLP because 1) covariance matrix is useless 2) it breaks sparse structure of words)
(weak) If any assumption fails, PCA fails (solvable by Kernel PCA)

Autoencoder

Idea: Use unsupervised NNs to learn latent representations via reconstruction. (Semi-supervised learning)

The goal of AE is NOT to reconstruct the input as accurately as possible but to LEARN major features from it. Reconstruction is only an objective for the learning process.

Model: NN

Denoising AE: add noise to the input and try to output the original image (to avoid perfect fitting)

Objective: minimize reconstruction error

Optimization: see DL

Pros:

= Nonlinear PCA (PCA = Linear Manifold)
Offer embeddings that can be used in supervised learning
Better performance than PCA in general

Cons:

High computational cost

Variational Autoencoder

tbd

Data

Supervised Learning

Docs

Q.Docs

Title here

Unsupervised Learning

Clustering

K-Means

Gaussian Mixture Model

Dimensionality Reduction

Singular Value Decomposition

Principal Component Analysis

Autoencoder

Variational Autoencoder

Unsupervised Learning

Clustering

K-Means#

Gaussian Mixture Model#

Dimensionality Reduction

Singular Value Decomposition#

Principal Component Analysis#

Autoencoder#

Variational Autoencoder#

K-Means

Gaussian Mixture Model

Singular Value Decomposition

Principal Component Analysis

Autoencoder

Variational Autoencoder