Unsupervised Learning: K-Means Clustering

Table of Contents

from IPython.display import YouTubeVideo
YouTubeVideo('SjPrNKYcpC8', width = "560", height = "315")

1. Supervised vs. Unsupervised Learning¶

In machine learning, algorithms are broadly categorized into two main types: Supervised Learning and Unsupervised Learning. These categories differ in terms of the data used for training, the learning process, and the outcomes they aim to achieve.

1.1. Supervised Learning¶

Supervised learning involves training a model using labeled data, where each input is paired with the correct output. The model learns to map inputs to their corresponding outputs by minimizing the error between predicted and actual values.

Key Characteristics

Labeled Data: Each training example has both input features and a known target output.
Learning Process: The algorithm iteratively adjusts its internal parameters to minimize prediction errors.
Goal: To generalize from training data and make accurate predictions on unseen data.

Common Algorithms

Regression: Linear Regression, Ridge Regression, Lasso Regression, $\cdots$
Classification: Logistic Regression, Decision Trees, Support Vector Machines (SVM), Logistic Regression, Neural Networks, $\cdots$

$$ \begin{array}{Icr} \{x^{(1)},x^{(2)},\cdots,x^{(m)}\}\\ \{y^{(1)},y^{(2)},\cdots,y^{(m)}\} \end{array} \quad \Rightarrow \quad \text{Classification} $$

1.2. Unsupervised Learning¶

Unsupervised learning involves training a model on unlabeled data. The algorithm must discover patterns, relationships, or structures without explicit guidance on what to predict.

Key Characteristics

Unlabeled Data: The dataset contains only input features without predefined labels.
Learning Process: The model identifies underlying patterns, clusters, or associations in the data.
Goal: To find meaningful insights, group similar data points, or reduce data complexity.

Common Algorithms

Clustering: K-Means, Hierarchical Clustering, DBSCAN, $\cdots$
Dimensionality Reduction: PCA (Principal Component Analysis), t-SNE, UMAP, $\cdots$

2. Clustering¶

Clustering is a fundamental technique in unsupervised machine learning that aims to group similar data points together based on their characteristics. Unlike supervised learning, clustering algorithms do not require labeled data. Instead, they attempt to uncover hidden patterns or natural groupings within the data.

Clustering is the process of dividing a dataset into distinct groups, or clusters, such that:

Data points within the same cluster are highly similar.
Data points in different clusters are significantly dissimilar.

The key goal in clustering is to identify meaningful structures in data without predefined labels.

Clustering is useful when:

You want to explore data to identify patterns or trends.
There are no clear labels for grouping the data.
You need to segment data for better analysis or decision-making.
You aim to reduce data complexity by grouping similar points.

2.1. K-Means¶

K-Means is a widely used clustering algorithm designed to partition a dataset into $k$ distinct clusters. It is known for its simplicity, efficiency, and effectiveness in grouping data points based on similarity.

Given:

A set of $m$ unlabeled examples:

$$\{x^{(1)},x^{(2)}\cdots, x^{(m)}\}$$

The desired number of groups, known as partitions (denoted as $k$).

Goal: The objective is to group the given data into $k$ partitions such that:

Data points in the same cluster are as similar as possible.
Data points in different clusters are as dissimilar as possible.

Core Principle of Clustering

The only information clustering relies on is the similarity between data points.
Clustering algorithms group points by maximizing similarity within clusters while minimizing similarity between clusters.
- High within-cluster similarity: Points within the same cluster are closely packed.
- Low inter-cluster similarity: Points in different clusters are well-separated.

2.2. (Iterative) Algorithm¶

Step 1: Initialization

Select $k$ points randomly from the dataset as the initial centroids.

Step 2: Iterative Process

The algorithm alternates between two key steps:

(1) Assignment Step:

For each point $x^{(i)}$, assign it to the nearest centroid:

$$\text{assign}(x^{(i)}) = \arg\min_j \|x^{(i)} - \mu_j\|^2$$

(2) Update Step:

For each cluster $j$, update the centroid to be the mean of points in that cluster:

$$\mu_j = \frac{1}{|S_j|} \sum_{x \in S_j} x$$

Where:

$S_j$ is the set of points assigned to cluster $j$.
$\mu_j$ is the updated centroid.

Step 3

Repeat the Assignment and Update steps until:

Cluster assignments no longer change, or
The movement of centroids is below a small threshold.

Output:

$c$ (label): index (1 to $k$) of cluster $\{c_1,c_2,\cdots,c_k\}$
$\mu$ : averages (mean) of points assigned to cluster $\{\mu_1,\mu_2,\cdots,\mu_k\}$

2.3. The "Chicken-and-Egg" Dilemma¶

Clustering inherently faces a circular dependency - understanding cluster centers and memberships depends on one another:

Question 1: If we knew the cluster centers ($\mu_j$), how would we assign points to clusters?
Answer 1: For each point $x^{(i)}$, assign it to the cluster with the closest center.

Question 2: If we knew the cluster memberships, how would we determine the cluster centers?
Answer 2: For each cluster, choose the cluster center $\mu_j$ to be the mean of all points in that cluster.

The Circular Dependency

Each step depends on the other:

Knowing the centers helps determine the assignments, and knowing the assignments helps determine the centers.
Since neither is known at the beginning, the system appears to be stuck.

Breaking the Dilemma

The resolution is to simply start by performing any step arbitrarily:

Start by guessing the cluster centers.

The initial guess may be poor, but that's okay.
Based on the guessed centers, assign points to clusters.
Using the new assignments, recalculate the cluster centers.
Repeat this process iteratively.

Surprisingly, despite starting with potentially poor initial conditions, this iterative process allows the system to gradually refine both the cluster assignments and the cluster centers. With each iteration, the solution improves.

The Power of Iterative Improvement: Lessons from K-Means Clustering

The K-Means clustering algorithm demonstrates a powerful lesson in problem-solving: when faced with uncertainty or dilemma, start taking action - even if your initial steps are imperfect - and refine as you go.

This philosophy of "start, iterate, improve" is a powerful concept not only in clustering but also in broader problem-solving strategies.

In Decision-Making: When facing uncertainty, take the first step rather than waiting for perfect conditions.
In Learning: Start by attempting, then refine your understanding through experience.
In Personal Growth: Progress often begins with small, imperfect actions that improve over time.

3. K-Means in Python¶

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# data generation
G0 = np.random.multivariate_normal([1, 1], np.eye(2), 100)
G1 = np.random.multivariate_normal([3, 5], np.eye(2), 100)
G2 = np.random.multivariate_normal([9, 9], np.eye(2), 100)

X = np.vstack([G0, G1, G2])
X = np.asmatrix(X)

plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'b.', alpha = 0.3)
plt.axis('equal')
plt.show()

# The number of clusters and data

k = 3
m = X.shape[0]

# ramdomly initialize mean points
mu = X[np.random.randint(0, m, k), :]
pre_mu = mu.copy()

plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'b.', alpha = 0.3)
plt.plot(mu[:,0], mu[:,1], 'rx', markersize = 15)
plt.axis('equal')
plt.show()

y = np.empty([m,1])

# Run K-means
for n_iter in range(500):
    for i in range(m):
        d0 = np.linalg.norm(X[i,:] - mu[0,:], 2)
        d1 = np.linalg.norm(X[i,:] - mu[1,:], 2)
        d2 = np.linalg.norm(X[i,:] - mu[2,:], 2)

        y[i] = np.argmin([d0, d1, d2])

    err = 0
    for i in range(k):
        mu[i,:] = np.mean(X[np.where(y == i)[0]], axis = 0)
        err += np.linalg.norm(pre_mu[i,:] - mu[i,:], 2)

    pre_mu = mu.copy()

    if err < 1e-10:
        print("Iteration:", n_iter)
        break

Iteration: 6

X0 = X[np.where(y==0)[0]]
X1 = X[np.where(y==1)[0]]
X2 = X[np.where(y==2)[0]]

plt.figure(figsize = (6, 4))
plt.plot(X0[:,0], X0[:,1], 'b.', label = 'C0')
plt.plot(X1[:,0], X1[:,1], 'g.', label = 'C1')
plt.plot(X2[:,0], X2[:,1], 'r.', label = 'C2')
plt.axis('equal')
plt.legend()
plt.show()

# use kmeans from the scikit-learn module

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 3, n_init = 10, random_state = 0)
kmeans.fit(np.array(X))

plt.figure(figsize = (6, 4))
plt.plot(X[kmeans.labels_ == 0,0],X[kmeans.labels_ == 0,1], 'b.', label = 'C0')
plt.plot(X[kmeans.labels_ == 1,0],X[kmeans.labels_ == 1,1], 'g.', label = 'C1')
plt.plot(X[kmeans.labels_ == 2,0],X[kmeans.labels_ == 2,1], 'r.', label = 'C2')
plt.axis('equal')
plt.legend()
plt.show()

4. Some Issues in K-Means¶

4.1. Initialization Issues¶

Initialization plays a crucial role in the performance of the K-Means algorithm. Poor initialization can significantly affect both the algorithm's convergence and the quality of the resulting clusters.

Sensitivity to Initialization

K-Means is highly sensitive to the choice of initial cluster centers.
Inappropriate initialization can lead to:
- Poor convergence speed - The algorithm may take longer to reach a stable solution.
- Suboptimal clustering results - The algorithm may converge to a local minimum rather than the optimal clustering.

To mitigate the risks associated with poor initialization, several techniques can be employed:

Strategic Selection of Initial Centroids:
- Choose the first centroid randomly from the data points.
- Select subsequent centroids by choosing the point farthest from the previously chosen centroids. This method helps ensure well-separated starting points.
Multiple Initializations (Random Restarts):
- Run the K-Means algorithm multiple times with different random initializations.
- Select the result that minimizes the total within-cluster variance (i.e., the best objective value).

4.2. Choosing the Number of Clusters¶

In K-Means clustering, the number of clusters $ k $ must be specified in advance. However, determining the value of $ k $ is often a challenging task, particularly in real-world applications where the true number of underlying clusters is not immediately evident.

The Elbow Method

The Elbow Method is a popular technique for selecting the optimal number of clusters in K-Means.

The Elbow Method is based on the idea that increasing $k$ will generally decrease the K-Means objective (the within-cluster sum of squares).

$$\text{WCSS} = \sum_{i=1}^{m} \| x^{(i)} - \mu_{j(i)} \|^2$$
Where
- $ \mu_{j(i)} $ = the centroid of the cluster assigned to $ x^{(i)} $

However, adding more clusters will eventually yield diminishing returns - additional clusters no longer provide significant improvement.
The optimal $k$ is typically found at the "elbow point," where the rate of decrease in the objective function slows noticeably.

# data generation
G0 = np.random.multivariate_normal([1, 1], np.eye(2), 100)
G1 = np.random.multivariate_normal([3, 5], np.eye(2), 100)
G2 = np.random.multivariate_normal([9, 9], np.eye(2), 100)

X = np.vstack([G0, G1, G2])

wcss = []
for i in range(1,11):
    kmeans = KMeans(n_clusters = i, n_init = 10, random_state = 0).fit(X)
    wcss.append(abs(kmeans.inertia_))

plt.figure(figsize = (6, 4))
plt.plot(range(1,11), wcss, 'o-')
plt.plot(2, wcss[2], 'r-')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.xticks(np.arange(11))
plt.xlim([0.5, 10.5])
plt.grid(alpha = 0.3)
plt.show()

From the plot, we can observe that the elbow point occurs at $k=3$.

Discussion

In the original clustering problem, we are given:

$m$ unlabeled data points

With only this information, we cannot proceed directly, because:

We do not know how many clusters (i.e., groups) are present in the data.
As a result, no clustering algorithm can proceed meaningfully without additional assumptions or inputs.

Now, suppose we are given:

$m$ unlabeled data points and the number of clusters $k$

This changes the problem entirely.

We now have a well-defined objective: group the data into exactly $k$ clusters.
With this information, algorithms like K-Means can be applied effectively.

Thus, the addition of $k$ transforms an ill-posed problem into a well-posed one.

However, in most real-world cases, the number of clusters $k$ is not known in advance.

So we return to the original question:

Can we estimate a reasonable value for $k$?

Yes - using a technique known as the Elbow Method:

Run K-Means with various values of $k$.
Plot the within-cluster sum of squares (WCSS) as a function of $k$.
Look for the "elbow point" in the graph - the point where the rate of decrease sharply changes - which suggests a suitable value for $k$.

Lesson

Do not say, "I can't do anything" just because a problem is ill-posed.
Instead, seek ways to reformulate the problem or introduce reasonable assumptions that make it solvable.

This mindset not only helps in clustering but is a fundamental principle in problem-solving across all fields.

4.3. K-means: Limitations¶

While K-Means is a widely used and effective clustering algorithm, it has several limitations that can affect its performance in certain scenarios.

Hard Assignments

K-Means makes hard assignments, meaning each point is either:

Fully assigned to a cluster.
Or completely excluded from all other clusters.

$\rightarrow$ Alternative Solution: Soft Assignment

K-Means does not provide a probability distribution over multiple clusters for each data point.
Algorithms such as the Gaussian Mixture Model (GMM) and Fuzzy K-Means provide soft assignments, allowing points to have a probability of belonging to multiple clusters. These approaches are often more suitable for datasets with overlapping clusters.

Sensitivity to Outliers

K-Means uses the mean to calculate cluster centers, which makes it highly sensitive to outliers.

Even a single extreme data point can significantly shift the centroid, degrading the quality of clustering.

$\rightarrow$ Alternative Solution: K-Medians

The K-Medians algorithm, which computes the median rather than the mean for cluster centers, is more robust to outliers.

Non-Convex Cluster Shapes

K-Means algorithm is its inability to effectively handle non-convex cluster shapes. K-Means relies on minimizing the Euclidean distance between points and centroids, which assumes that clusters are convex (spherical) and well-separated. Consequently, it performs poorly when data exhibits complex, non-convex structures.

Imbalanced Cluster Sizes (or Clusters with Different Densities)

One significant limitation of the K-Means algorithm is its poor performance when clusters have imbalanced sizes. Since K-Means assigns points to the nearest centroid based on Euclidean distance, it often struggles to correctly group data points in situations where some clusters are substantially larger or denser than others.

5. Example¶

This example demonstrates how the K-Means algorithm can be effectively used for image compression by reducing the number of colors in an image.

This example is adapted from Andrew Ng.
Download bird.bmp

Understanding RGB Representation

Each pixel in a 24-bit image is represented by 3 values (Red, Green, and Blue) ranging from 0 to 255.
This creates a 3-dimensional RGB encoding (8-bit for each color)

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

import cv2
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

img = cv2.imread('/content/drive/MyDrive/ML/ML_data/bird.bmp')
print(img.shape)

plt.figure(figsize = (5, 5))
plt.imshow(img)
plt.axis('off')
plt.show()

(538, 538, 3)

Compression Goal

The objective is to reduce the number of colors in the image to $k$ distinct colors using K-Means clustering.
By grouping similar pixel colors into clusters, we can efficiently represent the image using just $k$ color centroids.

from sklearn.cluster import KMeans

m = img.shape[0]
X = img.reshape(m*m, 3)

k = 16
kmeans = KMeans(n_clusters = k, n_init = 10, random_state = 0)
kmeans.fit(X)

KMeans(n_clusters=16, n_init=10, random_state=0)

KMeans(n_clusters=16, n_init=10, random_state=0)

kmeans.cluster_centers_

array([[ 18.82967573,  21.64860383,  19.81385113],
       [117.03572369, 184.05422758, 222.58211511],
       [ 50.39526959, 117.90198437, 164.22843589],
       [213.35199433, 242.70950597, 248.78740636],
       [ 99.64315692,  87.74326953,  86.82894872],
       [ 86.24238809, 153.39991701, 192.0468904 ],
       [ 56.13762124,  58.09331291,  64.85666156],
       [105.58784668, 117.412808  , 140.70366623],
       [183.63012821, 186.37728938, 208.74532967],
       [183.41486146, 129.35894207, 106.8324937 ],
       [163.76269735, 220.97243778, 246.89943805],
       [ 62.59485676, 147.64301827, 232.79111211],
       [ 52.71658393,  83.00031421, 116.01495633],
       [235.06987884, 200.17413356, 158.85263454],
       [140.75096047, 155.58749535, 173.007188  ],
       [ 36.07560899,  38.57339163,  37.51577139]])

kmeans.labels_

array([ 1,  1,  1, ..., 15, 15, 15], dtype=int32)

Assign Pixels to Centroids

For each pixel:

Assign it to its nearest centroid.
Store only the index (0-$k$) of the assigned centroid

X_clustered = np.zeros([m*m, 3])

for i in range(k):
    X_clustered[np.where(kmeans.labels_ == i)] = kmeans.cluster_centers_[i]

plt.figure(figsize = (5, 5))
plt.imshow(X_clustered.reshape(m, m, 3).astype('uint8'))
plt.axis('off')
plt.show()

The compressed image retains most of the original image's features.

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')