Unsupervised Learning
Clustering
Clustering groups similar samples together with no prior knowledge.
K-Means
Idea: Hard clustering (clustering with deterministic results)
Model/Algorithm:
- Init centroids
.
- Find cluster for each data point: $$ c_i=\arg\min_k||\mathbf{x}_i-\boldsymbol{\mu}_k||^2\quad(r_{ik}=\textbf{1}\{c_i=k\}) $$
- Update centroids as the mean of all data points in the current cluster: $$ \boldsymbol{\mu}_k=\frac{\sum_{i=1}^{m}r_{ik}\mathbf{x}_i}{\sum_{i=1}^{m}r_{ik}} $$
- Repeat Step 2-3 until convergence (i.e.,
remain unchanged).
Objective: reconstruction error:
Optimization:
- Objective minimization: Greedy (because this objective is NP-hard to optimize)
- Hyperparameter tuning: choose the elbow point in the “reconstruction error vs
clusters” graph.
Pros:
- Simple. Interpretable
- Guarantee convergence in a finite number of iterations
- Flexible re-training
- Generalize to any type/shape/size of clusters
- Suitable for large datasets
- Time complexity:
Cons:
- Scale variant
- Numerical features only
- Manual hyperparameter choice:
- Inconsistent: Sensitive to centroid initialization
- Sensitive to outliers and noisy data by including them
- Bad performance on high-dimensional data (distance metric works poorly)
- Hard clustering (assume 100% in the designated cluster)
Code:
def distance(v1,v2,metric_type='L2'):
if metric_type == "L0":
return np.count_nonzero(v1-v2)
if metric_type == "L1":
return np.sum(np.abs(v1-v2))
if metric_type == "L2":
return np.sqrt(np.sum(np.square(v1-v2)))
if metric_type == "Linf":
return np.max(np.abs(v1-v2))
class KMeans:
def __init__(self, n_clusters=8, max_iter=100):
self.n_clusters = n_clusters
self.max_iter = max_iter
def fit(self, X_train):
# 1) Init centroids uniformly
x_min, x_max = np.min(X_train, axis=0), np.max(X_train, axis=0)
self.centroids = [np.random.uniform(x_min, x_max) for _ in range(self.n_clusters)]
# 4) Repeat Step 2-3
i = 0
centroids_cache = self.centroids
while i < self.max_iter and np.not_equal(self.centroids, centroids_cache).any():
# 2) Find cluster for each data point
clustered_points = [[] for _ in range(self.n_clusters)]
for x in X_train:
distances = [distance(x, centroid) for centroid in self.centroids]
clustered_points[np.argmin(distances)].append(x)
# 3) Update centroids
self.centroids = [np.mean(cluster, axis=0) for cluster in clustered_points]
i += 1Gaussian Mixture Model
Idea: clustering but deterministic
.
Background: Shapes of clusters are determined by the properties of their covariance matrices.
- Shared Spherical: #params=1, same
for all clusters.
- K-means = GMM with
- K-means = GMM with
- Spherical: #params=
for all clusters.
- Shared Diagonal: #params=
for all clusters.
- Diagonal: #params=
for all clusters.
- Shared Full Covariance: #params=
for all clusters.
- Full Covariance: #params=
for all clusters.
Assumptions:
- There exists a latent variable
representing the index of the Gaussian distribution in the mixture.
Model:
, where
#params:
Optimization: EM
- Init distributions for prior
- E-step: estimate
: $$ r_{ik}=P(z_i=k|\textbf{x}_i)=\frac{P(\textbf{x}_i|z_i=k)P(z_i=k)}{\sum_{k=1}^{K}P(\textbf{x}_i|z_i=k)P(z_i=k)} $$
- M-step: estimate params via MLE given
:
- Repeat Step 2-3 until convergence.
Pros:
- More robust to outliers and noisy data
- Flexible to a great variety of shapes of data points
- Soft clustering
- Weighted distance instead of absolute distance in k-means
Cons:
- High computational cost
Dimensionality Reduction
Dimensionality reduction reduces dimensionality (i.e., #features) of input data.
We need dimensionality reduction because:
- Curse of Dimensionality: high-dimensional data has high sparsity.
- Computational Efficiency: high-dimensional data requires more computational resources (time & space).
- Overfitting: high-dimensional data are prone to overfitting.
- Visualization: we cannot visualize any data beyond 3D.
- Performance: high-dimensional data are prone to have more noises, which are reducible by selecting the most important features.
- Interpretability: only the most relevant features matter.
Singular Value Decomposition
Model: $$ X=UDV^T=\sum_{i=1}^{\text{rank}(X)}D_{ii}\textbf{u}_i\textbf{v}_i^T $$
- Input:
: arbitrary input matrix
- Output (Matrix ver.):
: left singular vectors (i.e., final rotation)
: singular values (i.e., scaling)
: right singular vectors (i.e., initial rotation)
: rank
- Output (Vector ver.):
th column unit vectors
th outer product matrix
- Note:
- rotation matrices are orthonormal:
- scaling matrix is diagonal:
- rotation matrices are orthonormal:
Idea: An arbitrary matrix = “unit matrix
final rotation”
- Think of an arbitrary
matrix on a 2D plane. We can reconstruct this matrix with a unit disk (i.e., the 2 unit vectors along the coordinates) via the following steps:
- Rotate the unit vectors by
.
- Scale the rotated unit disk into an ellipse by
.
- Rotate the ellipse by
.
- Rotate the unit vectors by
Properties:
(i.e., right singular vectors = eigenvectors of covariance matrix)
(i.e., left singular vectors = eigenvectors of outer product matrix)
- Pseudo-inverse:
- If
(i.e., singular value = eigenvalue).
- Positive Semi-Definite: A symmetric matrix
.
- Positive Semi-Definite: A symmetric matrix
Applications:
- Simplify OLS in regression: when calculating the “inverse” of a rectangular matrix,
.
- Low-rank matrix approximation: define a lower rank
.
- Eigenword: project high-dimensional context to low-dimensional space, assuming distributional similarity.
- Distributional similarity: words with similar contexts have similar meanings.
- Distance-based similarity measure: similar words are close in this low-dimensional space.
- Eigenwords: left singular vectors (i.e., word embeddings).
- Eigentokens: right singular vectors
Context (i.e., contextual embeddings).
- Word sense disambiguation: estimate contextual embedding for a word with right singular vectors.
- PCA: see next.
Principal Component Analysis
Model (Original ver.):
- Calculate the covariance matrix of
.
- Diagonalize the covariance matrix via Spectral Theorem:
.
: eigenvalues (i.e., PC strength / sample variance of projection)
: orthonormal eigenvectors (i.e., PCs)
- Sort eigenvectors in
in descending order.
- Select & normalize the strongest
is a hyperparameter.
- Now
.
- Now
- Project
- Each projected point is:
- Each projected point is:
Model (SVD ver.):
- Compute SVD:
.
- Select
(the right singular matrix) with the largest singular values as PCs.
- Project the original dataset
eigenvectors.
Idea: We can project the input data onto an orthonormal basis
of smaller dimensions while covering maximal variance among features.
- Orthonormal basis:
.
- Ideally, we lose minimal info (represented by minimal variance) while successfully reducing the dimensionality.
Assumptions:
- Input matrix
is centered/standardized on the sample space, unless data is sparse.
(optional but strongly recommended)
- PCs = linear combinations of original features. (if not, then Kernel PCA)
- Variance = a measure of feature importance.
Optimization: minimize distortion (i.e., maximize variance in new coordinates)
- Distortion:
- Variance (of projected points): $$ \text{Variance}_k=m\sum_{j=1}^{k}\textbf{v}_j^T\Sigma\textbf{v}_j=m\sum_{j=1}^{k}\lambda_j $$
- Minimizing distortion = Maximizing variance: $$ \text{Variance}_k+\text{Distortion}_k=m\sum_{j=1}^{n}\lambda_j=m\cdot\text{trace}(\Sigma) $$
Inference/Reconstruction: Any sample vector can be approximated in terms of coefficients (scores) on eigenvectors (loadings).
- Predict the original point via inverse mapping:
- The projected new points
s can still correlate with each other. Only their basis are independent. Therefore, variance is still not 1 even if you standardize data.
- The projected new points
- The original point can be fully reconstructed if
:
Pros:
- Guarantee removal of correlated features (PCs are orthogonal)
- Reduce overfitting for supervised learning
- Improve visualization for high-dimensional data
- Robust to outliers & noisy data
Cons:
- Scale invariant
- Low interpretability of new features (i.e., PCs)
- Potential info loss if PCs &
PCs are not selected carefully
- Situational (e.g. cannot be applied on NLP because 1) covariance matrix is useless 2) it breaks sparse structure of words)
- (weak) If any assumption fails, PCA fails (solvable by Kernel PCA)
Autoencoder
Idea: Use unsupervised NNs to learn latent representations via reconstruction. (Semi-supervised learning)
- The goal of AE is NOT to reconstruct the input as accurately as possible but to LEARN major features from it. Reconstruction is only an objective for the learning process.
Model: NN
- Denoising AE: add noise to the input and try to output the original image (to avoid perfect fitting)
Objective: minimize reconstruction error
Optimization: see DL
Pros:
- = Nonlinear PCA (PCA = Linear Manifold)
- Offer embeddings that can be used in supervised learning
- Better performance than PCA in general
Cons:
- High computational cost
Variational Autoencoder
tbd