Unsupervised Learning

Clustering

Clustering groups similar samples together with no prior knowledge.

K-Means

Idea: Hard clustering (clustering with deterministic results)

Model/Algorithm:

  1. Init centroids mathematical expression or equation .
  2. Find cluster for each data point: $$ c_i=\arg\min_k||\mathbf{x}_i-\boldsymbol{\mu}_k||^2\quad(r_{ik}=\textbf{1}\{c_i=k\}) $$
  3. Update centroids as the mean of all data points in the current cluster: $$ \boldsymbol{\mu}_k=\frac{\sum_{i=1}^{m}r_{ik}\mathbf{x}_i}{\sum_{i=1}^{m}r_{ik}} $$
  4. Repeat Step 2-3 until convergence (i.e., mathematical expression or equation remain unchanged).

Objective: reconstruction error: mathematical expression or equation

Optimization:

  • Objective minimization: Greedy (because this objective is NP-hard to optimize)
  • Hyperparameter tuning: choose the elbow point in the “reconstruction error vs mathematical expression or equation clusters” graph.

Pros:

  • Simple. Interpretable
  • Guarantee convergence in a finite number of iterations
  • Flexible re-training
  • Generalize to any type/shape/size of clusters
  • Suitable for large datasets
  • Time complexity: mathematical expression or equation

Cons:

  • Scale variant
  • Numerical features only
  • Manual hyperparameter choice: mathematical expression or equation
  • Inconsistent: Sensitive to centroid initialization
  • Sensitive to outliers and noisy data by including them
  • Bad performance on high-dimensional data (distance metric works poorly)
  • Hard clustering (assume 100% in the designated cluster)

Code:

def distance(v1,v2,metric_type='L2'):
    if metric_type == "L0":
        return np.count_nonzero(v1-v2)
    if metric_type == "L1":
        return np.sum(np.abs(v1-v2))
    if metric_type == "L2":
        return np.sqrt(np.sum(np.square(v1-v2)))
    if metric_type == "Linf":
        return np.max(np.abs(v1-v2))

class KMeans:
    def __init__(self, n_clusters=8, max_iter=100):
        self.n_clusters = n_clusters
        self.max_iter = max_iter
        
    def fit(self, X_train):
        # 1) Init centroids uniformly
        x_min, x_max = np.min(X_train, axis=0), np.max(X_train, axis=0)
        self.centroids = [np.random.uniform(x_min, x_max) for _ in range(self.n_clusters)]
        
        # 4) Repeat Step 2-3
        i = 0
        centroids_cache = self.centroids
        while i < self.max_iter and np.not_equal(self.centroids, centroids_cache).any():
            # 2) Find cluster for each data point
            clustered_points = [[] for _ in range(self.n_clusters)]
            for x in X_train:
                distances = [distance(x, centroid) for centroid in self.centroids]
                clustered_points[np.argmin(distances)].append(x)
            
            # 3) Update centroids
            self.centroids = [np.mean(cluster, axis=0) for cluster in clustered_points]
            i += 1

Gaussian Mixture Model

Idea: clustering but deterministic mathematical expression or equation .

Background: Shapes of clusters are determined by the properties of their covariance matrices.

  • Shared Spherical: #params=1, same mathematical expression or equation for all clusters.
    • K-means = GMM with mathematical expression or equation
  • Spherical: #params= mathematical expression or equation for all clusters.
  • Shared Diagonal: #params= mathematical expression or equation for all clusters.
  • Diagonal: #params= mathematical expression or equation for all clusters.
  • Shared Full Covariance: #params= mathematical expression or equation for all clusters.
  • Full Covariance: #params= mathematical expression or equation for all clusters.

Assumptions:

  • There exists a latent variable mathematical expression or equation representing the index of the Gaussian distribution in the mixture.

Model: mathematical expression or equation , where

  • mathematical expression or equation
  • mathematical expression or equation

#params: mathematical expression or equation

Optimization: EM

  1. Init distributions for prior mathematical expression or equation
  2. E-step: estimate mathematical expression or equation : $$ r_{ik}=P(z_i=k|\textbf{x}_i)=\frac{P(\textbf{x}_i|z_i=k)P(z_i=k)}{\sum_{k=1}^{K}P(\textbf{x}_i|z_i=k)P(z_i=k)} $$
  3. M-step: estimate params via MLE given mathematical expression or equation : mathematical expression or equation
  4. Repeat Step 2-3 until convergence.

Pros:

  • More robust to outliers and noisy data
  • Flexible to a great variety of shapes of data points
  • Soft clustering
  • Weighted distance instead of absolute distance in k-means

Cons:

  • High computational cost

Dimensionality Reduction

Dimensionality reduction reduces dimensionality (i.e., #features) of input data.

We need dimensionality reduction because:

  • Curse of Dimensionality: high-dimensional data has high sparsity.
  • Computational Efficiency: high-dimensional data requires more computational resources (time & space).
  • Overfitting: high-dimensional data are prone to overfitting.
  • Visualization: we cannot visualize any data beyond 3D.
  • Performance: high-dimensional data are prone to have more noises, which are reducible by selecting the most important features.
  • Interpretability: only the most relevant features matter.

 

Singular Value Decomposition

Model: $$ X=UDV^T=\sum_{i=1}^{\text{rank}(X)}D_{ii}\textbf{u}_i\textbf{v}_i^T $$

  • Input:
    • mathematical expression or equation : arbitrary input matrix
  • Output (Matrix ver.):
    • mathematical expression or equation : left singular vectors (i.e., final rotation)
    • mathematical expression or equation : singular values (i.e., scaling)
    • mathematical expression or equation : right singular vectors (i.e., initial rotation)
    • mathematical expression or equation : rank
  • Output (Vector ver.):
    • mathematical expression or equation th column unit vectors
    • mathematical expression or equation th outer product matrix
  • Note:
    • rotation matrices are orthonormal: mathematical expression or equation
    • scaling matrix is diagonal: mathematical expression or equation

Idea: An arbitrary matrix = “unit matrix mathematical expression or equation final rotation”

  • Think of an arbitrary mathematical expression or equation matrix on a 2D plane. We can reconstruct this matrix with a unit disk (i.e., the 2 unit vectors along the coordinates) via the following steps:
    1. Rotate the unit vectors by mathematical expression or equation .
    2. Scale the rotated unit disk into an ellipse by mathematical expression or equation .
    3. Rotate the ellipse by mathematical expression or equation .

Properties:

  • mathematical expression or equation (i.e., right singular vectors = eigenvectors of covariance matrix)
  • mathematical expression or equation (i.e., left singular vectors = eigenvectors of outer product matrix)
  • Pseudo-inverse: mathematical expression or equation
  • If mathematical expression or equation (i.e., singular value = eigenvalue).
    • Positive Semi-Definite: A symmetric matrix mathematical expression or equation .

Applications:

  • Simplify OLS in regression: when calculating the “inverse” of a rectangular matrix, mathematical expression or equation .
  • Low-rank matrix approximation: define a lower rank mathematical expression or equation .
  • Eigenword: project high-dimensional context to low-dimensional space, assuming distributional similarity.
    • Distributional similarity: words with similar contexts have similar meanings.
    • Distance-based similarity measure: similar words are close in this low-dimensional space.
    • Eigenwords: left singular vectors (i.e., word embeddings).
    • Eigentokens: right singular vectors mathematical expression or equation Context (i.e., contextual embeddings).
    • Word sense disambiguation: estimate contextual embedding for a word with right singular vectors.
  • PCA: see next.

 

Principal Component Analysis

Model (Original ver.):

  1. Calculate the covariance matrix of mathematical expression or equation .
  2. Diagonalize the covariance matrix via Spectral Theorem: mathematical expression or equation .
    • mathematical expression or equation : eigenvalues (i.e., PC strength / sample variance of projection)
    • mathematical expression or equation : orthonormal eigenvectors (i.e., PCs)
  3. Sort eigenvectors in mathematical expression or equation in descending order.
  4. Select & normalize the strongest mathematical expression or equation is a hyperparameter.
    • Now mathematical expression or equation .
  5. Project mathematical expression or equation
    • Each projected point is: mathematical expression or equation

Model (SVD ver.):

  1. Compute SVD: mathematical expression or equation .
  2. Select mathematical expression or equation (the right singular matrix) with the largest singular values as PCs.
  3. Project the original dataset mathematical expression or equation eigenvectors.

Idea: We can project the input data onto an orthonormal basis mathematical expression or equation of smaller dimensions while covering maximal variance among features.

  • Orthonormal basis: mathematical expression or equation .
  • Ideally, we lose minimal info (represented by minimal variance) while successfully reducing the dimensionality.

Assumptions:

  • Input matrix mathematical expression or equation is centered/standardized on the sample space, unless data is sparse.
    • mathematical expression or equation
    • mathematical expression or equation (optional but strongly recommended)
  • PCs = linear combinations of original features. (if not, then Kernel PCA)
  • Variance = a measure of feature importance.

Optimization: minimize distortion (i.e., maximize variance in new coordinates)

  • Distortion: mathematical expression or equation
  • Variance (of projected points): $$ \text{Variance}_k=m\sum_{j=1}^{k}\textbf{v}_j^T\Sigma\textbf{v}_j=m\sum_{j=1}^{k}\lambda_j $$
  • Minimizing distortion = Maximizing variance: $$ \text{Variance}_k+\text{Distortion}_k=m\sum_{j=1}^{n}\lambda_j=m\cdot\text{trace}(\Sigma) $$

Inference/Reconstruction: Any sample vector can be approximated in terms of coefficients (scores) on eigenvectors (loadings).

  1. Predict the original point via inverse mapping: mathematical expression or equation
    • The projected new points mathematical expression or equation s can still correlate with each other. Only their basis are independent. Therefore, variance is still not 1 even if you standardize data.
  2. The original point can be fully reconstructed if mathematical expression or equation : mathematical expression or equation

Pros:

  • Guarantee removal of correlated features (PCs are orthogonal)
  • Reduce overfitting for supervised learning
  • Improve visualization for high-dimensional data
  • Robust to outliers & noisy data

Cons:

  • Scale invariant
  • Low interpretability of new features (i.e., PCs)
  • Potential info loss if PCs & mathematical expression or equation PCs are not selected carefully
  • Situational (e.g. cannot be applied on NLP because 1) covariance matrix is useless 2) it breaks sparse structure of words)
  • (weak) If any assumption fails, PCA fails (solvable by Kernel PCA)

 

Autoencoder

Idea: Use unsupervised NNs to learn latent representations via reconstruction. (Semi-supervised learning)

  • The goal of AE is NOT to reconstruct the input as accurately as possible but to LEARN major features from it. Reconstruction is only an objective for the learning process.

Model: NN

  • Denoising AE: add noise to the input and try to output the original image (to avoid perfect fitting)

Objective: minimize reconstruction error

Optimization: see DL

Pros:

  • = Nonlinear PCA (PCA = Linear Manifold)
  • Offer embeddings that can be used in supervised learning
  • Better performance than PCA in general

Cons:

  • High computational cost

Variational Autoencoder

tbd