Unsupervised Learning: K-means Clustering
Table of Contents
from IPython.display import YouTubeVideo
YouTubeVideo('SjPrNKYcpC8', width = "560", height = "315")
In machine learning, algorithms are broadly categorized into two main types: Supervised Learning and Unsupervised Learning. These categories differ in terms of the data used for training, the learning process, and the outcomes they aim to achieve.
Supervised learning involves training a model using labeled data, where each input is paired with the correct output. The model learns to map inputs to their corresponding outputs by minimizing the error between predicted and actual values.
Key Characteristics
Labeled Data: Each training example has both input features and a known target output.
Learning Process: The algorithm iteratively adjusts its internal parameters to minimize prediction errors.
Goal: To generalize from training data and make accurate predictions on unseen data.
Common Algorithms
Regression: Linear Regression, Ridge Regression, Lasso Regression, $\cdots$
Classification: Logistic Regression, Decision Trees, Support Vector Machines (SVM), Logistic Regression, Neural Networks, $\cdots$
Unsupervised learning involves training a model on unlabeled data. The algorithm must discover patterns, relationships, or structures without explicit guidance on what to predict.
Key Characteristics
Unlabeled Data: The dataset contains only input features without predefined labels.
Learning Process: The model identifies underlying patterns, clusters, or associations in the data.
Goal: To find meaningful insights, group similar data points, or reduce data complexity.
Common Algorithms
Clustering: K-Means, Hierarchical Clustering, DBSCAN, $\cdots$
Dimensionality Reduction: PCA (Principal Component Analysis), t-SNE, UMAP, $\cdots$
Clustering is a fundamental technique in unsupervised machine learning that aims to group similar data points together based on their characteristics. Unlike supervised learning, clustering algorithms do not require labeled data. Instead, they attempt to uncover hidden patterns or natural groupings within the data.
Clustering is the process of dividing a dataset into distinct groups, or clusters, such that:
Data points within the same cluster are highly similar.
Data points in different clusters are significantly dissimilar.
The key goal in clustering is to identify meaningful structures in data without predefined labels.
Clustering is useful when:
You want to explore data to identify patterns or trends.
There are no clear labels for grouping the data.
You need to segment data for better analysis or decision-making.
You aim to reduce data complexity by grouping similar points.
K-Means is a widely used clustering algorithm designed to partition a dataset into $k$ distinct clusters. It is known for its simplicity, efficiency, and effectiveness in grouping data points based on similarity.
Given:
Goal: The objective is to group the given data into $k$ partitions such that:
Data points in the same cluster are as similar as possible.
Data points in different clusters are as dissimilar as possible.
Core Principle of Clustering
The only information clustering relies on is the similarity between data points.
Clustering algorithms group points by maximizing similarity within clusters while minimizing similarity between clusters.
High within-cluster similarity: Points within the same cluster are closely packed.
Low inter-cluster similarity: Points in different clusters are well-separated.
The "Chicken-and-Egg" Dilemma
Clustering inherently faces a circular dependency - understanding cluster centers and memberships depends on one another:
Question 1: If we knew the cluster centers ($\mu_j$), how would we assign points to clusters?
Answer 1: For each point $x^{(i)}$, assign it to the cluster with the closest center.
Question 2: If we knew the cluster memberships, how would we determine the cluster centers?
Answer 2: For each cluster, choose the cluster center $\mu_j$ to be the mean of all points in that cluster.
The Circular Dependency
Each step depends on the other:
Knowing the centers helps determine the assignments, and knowing the assignments helps determine the centers.
Since neither is known at the beginning, the system appears to be stuck.
Breaking the Dilemma
The resolution is to simply start by performing any step arbitrarily:
Start by guessing the cluster centers.
Surprisingly, despite starting with potentially poor initial conditions, this iterative process allows the system to gradually refine both the cluster assignments and the cluster centers. With each iteration, the solution improves.
The Power of Iterative Improvement: Lessons from K-Means Clustering
The K-Means clustering algorithm demonstrates a powerful lesson in problem-solving: when faced with uncertainty or dilemma, start taking action - even if your initial steps are imperfect - and refine as you go.
This philosophy of "start, iterate, improve" is a powerful concept not only in clustering but also in broader problem-solving strategies.
In Decision-Making: When facing uncertainty, take the first step rather than waiting for perfect conditions.
In Learning: Start by attempting, then refine your understanding through experience.
In Personal Growth: Progress often begins with small, imperfect actions that improve over time.
Step 1: Initialization
Step 2: Iterative Process
The algorithm alternates between two key steps:
(1) Assignment Step:
For each point $x^{(i)}$, assign it to the nearest centroid:
(2) Update Step:
For each cluster $j$, update the centroid to be the mean of points in that cluster:
Where:
Step 3
Repeat the Assignment and Update steps until:
Cluster assignments no longer change, or
The movement of centroids is below a small threshold.
Output:
Summary: K-means Algorithm
$ \,\text{Randomly initialize } k \,\text{cluster centroids } \mu_1,\mu_2,\cdots,\mu_k \in \mathbb{R}^n$
$ \begin{align*} \text{Repeat}\;&\{ \\ &\text{for $i=1$ to $m$} \\ &\quad \text{$c_i$ := index (from 1 to $k$) of cluster centroid closest to $x^{(i)}$} \\ &\text{for $k=1$ to $k$} \\ &\quad \text{$\mu_k$ := average (mean) of points assigned to cluster $k$} \\ &\} \end{align*} $
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# data generation
G0 = np.random.multivariate_normal([1, 1], np.eye(2), 100)
G1 = np.random.multivariate_normal([3, 5], np.eye(2), 100)
G2 = np.random.multivariate_normal([9, 9], np.eye(2), 100)
X = np.vstack([G0, G1, G2])
X = np.asmatrix(X)
plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'b.', alpha = 0.3)
plt.axis('equal')
plt.show()
# The number of clusters and data
k = 3
m = X.shape[0]
# ramdomly initialize mean points
mu = X[np.random.randint(0, m, k), :]
pre_mu = mu.copy()
plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'b.', alpha = 0.3)
plt.plot(mu[:,0], mu[:,1], 'rx', markersize = 15)
plt.axis('equal')
plt.show()
y = np.empty([m,1])
# Run K-means
for n_iter in range(500):
for i in range(m):
d0 = np.linalg.norm(X[i,:] - mu[0,:], 2)
d1 = np.linalg.norm(X[i,:] - mu[1,:], 2)
d2 = np.linalg.norm(X[i,:] - mu[2,:], 2)
y[i] = np.argmin([d0, d1, d2])
err = 0
for i in range(k):
mu[i,:] = np.mean(X[np.where(y == i)[0]], axis = 0)
err += np.linalg.norm(pre_mu[i,:] - mu[i,:], 2)
pre_mu = mu.copy()
if err < 1e-10:
print("Iteration:", n_iter)
break
X0 = X[np.where(y==0)[0]]
X1 = X[np.where(y==1)[0]]
X2 = X[np.where(y==2)[0]]
plt.figure(figsize = (6, 4))
plt.plot(X0[:,0], X0[:,1], 'b.', label = 'C0')
plt.plot(X1[:,0], X1[:,1], 'g.', label = 'C1')
plt.plot(X2[:,0], X2[:,1], 'r.', label = 'C2')
plt.axis('equal')
plt.legend()
plt.show()
# use kmeans from the scikit-learn module
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 3, n_init = 10, random_state = 0)
kmeans.fit(np.array(X))
plt.figure(figsize = (6, 4))
plt.plot(X[kmeans.labels_ == 0,0],X[kmeans.labels_ == 0,1], 'b.', label = 'C0')
plt.plot(X[kmeans.labels_ == 1,0],X[kmeans.labels_ == 1,1], 'g.', label = 'C1')
plt.plot(X[kmeans.labels_ == 2,0],X[kmeans.labels_ == 2,1], 'r.', label = 'C2')
plt.axis('equal')
plt.legend()
plt.show()
Initialization plays a crucial role in the performance of the K-Means algorithm. Poor initialization can significantly affect both the algorithm's convergence and the quality of the resulting clusters.
Sensitivity to Initialization
K-Means is highly sensitive to the choice of initial cluster centers.
Inappropriate initialization can lead to:
To mitigate the risks associated with poor initialization, several techniques can be employed:
Strategic Selection of Initial Centroids:
Choose the first centroid randomly from the data points.
Select subsequent centroids by choosing the point farthest from the previously chosen centroids. This method helps ensure well-separated starting points.
Multiple Initializations (Random Restarts):
Run the K-Means algorithm multiple times with different random initializations.
Select the result that minimizes the total within-cluster variance (i.e., the best objective value).
In K-Means clustering, the number of clusters $ k $ must be specified in advance. However, determining the value of $ k $ is often a challenging task, particularly in real-world applications where the true number of underlying clusters is not immediately evident.
The Elbow Method
The Elbow Method is a popular technique for selecting the optimal number of clusters in K-Means.
The Elbow Method is based on the idea that increasing $k$ will generally decrease the K-Means objective (the within-cluster sum of squares).
Where
However, adding more clusters will eventually yield diminishing returns - additional clusters no longer provide significant improvement.
The optimal $k$ is typically found at the "elbow point," where the rate of decrease in the objective function slows noticeably.
# data generation
G0 = np.random.multivariate_normal([1, 1], np.eye(2), 100)
G1 = np.random.multivariate_normal([3, 5], np.eye(2), 100)
G2 = np.random.multivariate_normal([9, 9], np.eye(2), 100)
X = np.vstack([G0, G1, G2])
wcss = []
for i in range(1,11):
kmeans = KMeans(n_clusters = i, n_init = 10, random_state = 0).fit(X)
wcss.append(abs(kmeans.inertia_))
plt.figure(figsize = (6, 4))
plt.plot(range(1,11), wcss, 'o-')
plt.plot(2, wcss[2], 'r-')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.xticks(np.arange(11))
plt.xlim([0.5, 10.5])
plt.grid(alpha = 0.3)
plt.show()
From the plot, we can observe that the elbow point occurs at $k=3$.
While K-Means is a widely used and effective clustering algorithm, it has several limitations that can affect its performance in certain scenarios.
Hard Assignments
K-Means makes hard assignments, meaning each point is either:
Fully assigned to a cluster.
Or completely excluded from all other clusters.
$\rightarrow$ Alternative Solution: Soft Assignment
K-Means does not provide a probability distribution over multiple clusters for each data point.
Algorithms such as the Gaussian Mixture Model (GMM) and Fuzzy K-Means provide soft assignments, allowing points to have a probability of belonging to multiple clusters. These approaches are often more suitable for datasets with overlapping clusters.
Sensitivity to Outliers
K-Means uses the mean to calculate cluster centers, which makes it highly sensitive to outliers.
Even a single extreme data point can significantly shift the centroid, degrading the quality of clustering.
$\rightarrow$ Alternative Solution: K-Medians
Non-Convex Cluster Shapes
K-Means algorithm is its inability to effectively handle non-convex cluster shapes. K-Means relies on minimizing the Euclidean distance between points and centroids, which assumes that clusters are convex (spherical) and well-separated. Consequently, it performs poorly when data exhibits complex, non-convex structures.
Imbalanced Cluster Sizes (or Clusters with Different Densities)
One significant limitation of the K-Means algorithm is its poor performance when clusters have imbalanced sizes. Since K-Means assigns points to the nearest centroid based on Euclidean distance, it often struggles to correctly group data points in situations where some clusters are substantially larger or denser than others.
This example demonstrates how the K-Means algorithm can be effectively used for image compression by reducing the number of colors in an image.
Understanding RGB Representation
Each pixel in a 24-bit image is represented by 3 values (Red, Green, and Blue) ranging from 0 to 255.
This creates a 3-dimensional RGB encoding (8-bit for each color)
from google.colab import drive
drive.mount('/content/drive')
import cv2
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
img = cv2.imread('/content/drive/MyDrive/ML/ML_data/bird.bmp')
print(img.shape)
plt.figure(figsize = (5, 5))
plt.imshow(img)
plt.axis('off')
plt.show()
Compression Goal
The objective is to reduce the number of colors in the image to $k$ distinct colors using K-Means clustering.
By grouping similar pixel colors into clusters, we can efficiently represent the image using just $k$ color centroids.
from sklearn.cluster import KMeans
m = img.shape[0]
X = img.reshape(m*m, 3)
k = 16
kmeans = KMeans(n_clusters = k, n_init = 10, random_state = 0)
kmeans.fit(X)
kmeans.cluster_centers_
kmeans.labels_
Assign Pixels to Centroids
For each pixel:
Assign it to its nearest centroid.
Store only the index (0-$k$) of the assigned centroid
X_clustered = np.zeros([m*m, 3])
for i in range(k):
X_clustered[np.where(kmeans.labels_ == i)] = kmeans.cluster_centers_[i]
plt.figure(figsize = (5, 5))
plt.imshow(X_clustered.reshape(m, m, 3).astype('uint8'))
plt.axis('off')
plt.show()
The compressed image retains most of the original image's features.
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')