Unsupervised Learning: K-means Clustering

Table of Contents

1. Supervised vs. Unsupervised Learning

  • Supervised: building a model from labeled data
  • Unsupervised: clustering from unlabeled data

Supervised Learning


$$ \begin{array}{Icr} \{x^{(1)},x^{(2)},\cdots,x^{(m)}\}\\ \{y^{(1)},y^{(2)},\cdots,y^{(m)}\} \end{array} \quad \Rightarrow \quad \text{Classification} $$

Unsupervised Learning

  • Data clustering is an unsupervised learning problem

  • Given:

  • $m$ unlabeled examples $\{x^{(1)},x^{(2)}\cdots, x^{(m)}\}$

  • the number of partitions $k$

  • Goal: group the examples into $k$ partitions


$$\{x^{(1)},x^{(2)},\cdots,x^{(m)}\} \quad \Rightarrow \quad \text{Clustering}$$


  • The only information clustering uses is the similarity between examples

  • Clustering groups examples based of their mutual similarities

  • A good clustering is one that achieves:

    • high within-cluster similarity
    • low inter-cluster similarity
  • It is a "chicken and egg" problem (dilemma)

    • Q: if we knew $c_i$s, how would we determine which points to associate with each cluster center?
    • A: for each point $x^{(i)}$, choose closest $c_i$

    • Q: if we knew the cluster memberships, how do we get the centers?
    • A: choose $c_i$ to be the mean of all points in the cluster

2. K-means

2.1. (Iterative) Algorithm


1) Initialization

Input:

  • $k$: the number of clusters
  • Training set $\{x^{(1)},x^{(2)},\cdots,x^{(m)}\}$

Randomly initialized anywhere in $\mathbb{R}^n$




2) Iteration





$$ \begin{align*} c_k &= \{n: k = \arg \min_k \, \lVert x_n - \mu_k \rVert^2\}\\ \mu_k &= \frac{1}{\lvert c_k \rvert} \sum_{n \in c_k}x_n \end{align*} $$


Repeat until convergence

  • A possible convergence criteria: cluster centers do not change anymore

3) Output

Output: model

  • $c$ (label): index (1 to $k$) of cluster centroid $\{c_1,c_2,\cdots,c_k\}$
  • $\mu$ : averages (mean) of points assigned to cluster $\{\mu_1,\mu_2,\cdots,\mu_k\}$


2.2. Summary: K-means Algorithm

$ \,\text{Randomly initialize } k \,\text{cluster centroids } \mu_1,\mu_2,\cdots,\mu_k \in \mathbb{R}^n$

$ \begin{align*} \text{Repeat}\;&\{ \\ &\text{for $i=1$ to $m$} \\ &\quad \text{$c_i$ := index (from 1 to $k$) of cluster centroid closest to $x^{(i)}$} \\ &\text{for $k=1$ to $k$} \\ &\quad \text{$\mu_k$ := average (mean) of points assigned to cluster $k$} \\ &\} \end{align*} $

2.3. K-means Optimization Point of View (optional)

  • $c_i$= index of cluster $(1,2,\cdots,k)$ to which example $x^{(i)}$ is currently assigned

  • $\mu_k$= cluster centroid $k$ ($\mu_k \in \mathbb{R}^n$)

  • $\mu_{c_i}$= cluster centroid of cluster to which example $x^{(i)}$ has been assigned

  • Optimization objective:


$$ J(c_1,\cdots,c_m,\mu_1,\cdots,\mu_k) = \frac{1}{m}\sum\limits_{i=1}^m \lVert x^{(i)}-\mu_{c_i}\rVert^2$$

$$ \min\limits_{c_1,\cdots,c_m, \; \mu_1,\cdots,\mu_k} J(c_1,\cdots,c_m,\mu_1,\cdots,\mu_k)$$


3. Python code

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
# data generation
G0 = np.random.multivariate_normal([1, 1], np.eye(2), 100)
G1 = np.random.multivariate_normal([3, 5], np.eye(2), 100)
G2 = np.random.multivariate_normal([9, 9], np.eye(2), 100)

X = np.vstack([G0, G1, G2])
X = np.asmatrix(X)
print(X.shape)

plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'b.')
plt.axis('equal')
plt.show()
(300, 2)
No description has been provided for this image
In [ ]:
# The number of clusters and data

k = 3
m = X.shape[0]

# ramdomly initialize mean points
mu = X[np.random.randint(0, m, k), :]
pre_mu = mu.copy()
print(mu)

plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'b.')
plt.plot(mu[:,0], mu[:,1], 'ko')
plt.axis('equal')
plt.show()
[[ 1.81486329  0.58239813]
 [ 3.92874216  3.73896023]
 [ 1.05318842 -1.02330076]]
No description has been provided for this image
In [ ]:
y = np.empty([m,1])

# Run K-means
for n_iter in range(500):
    for i in range(m):
        d0 = np.linalg.norm(X[i,:] - mu[0,:], 2)
        d1 = np.linalg.norm(X[i,:] - mu[1,:], 2)
        d2 = np.linalg.norm(X[i,:] - mu[2,:], 2)

        y[i] = np.argmin([d0, d1, d2])

    err = 0
    for i in range(k):
        mu[i,:] = np.mean(X[np.where(y == i)[0]], axis = 0)
        err += np.linalg.norm(pre_mu[i,:] - mu[i,:], 2)

    pre_mu = mu.copy()

    if err < 1e-10:
        print("Iteration:", n_iter)
        break
Iteration: 7
In [ ]:
X0 = X[np.where(y==0)[0]]
X1 = X[np.where(y==1)[0]]
X2 = X[np.where(y==2)[0]]

plt.figure(figsize = (6, 4))
plt.plot(X0[:,0], X0[:,1], 'b.', label = 'C0')
plt.plot(X1[:,0], X1[:,1], 'g.', label = 'C1')
plt.plot(X2[:,0], X2[:,1], 'r.', label = 'C2')
plt.axis('equal')
plt.legend(fontsize = 12)
plt.show()
No description has been provided for this image
In [ ]:
# use kmeans from the scikit-learn module

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 3, n_init = 10, random_state = 0)
kmeans.fit(np.array(X))

plt.figure(figsize = (6, 4))
plt.plot(X[kmeans.labels_ == 0,0],X[kmeans.labels_ == 0,1], 'b.', label = 'C0')
plt.plot(X[kmeans.labels_ == 1,0],X[kmeans.labels_ == 1,1], 'g.', label = 'C1')
plt.plot(X[kmeans.labels_ == 2,0],X[kmeans.labels_ == 2,1], 'r.', label = 'C2')
plt.axis('equal')
plt.legend(fontsize = 12)
plt.show()
No description has been provided for this image

4. Some Issues in K-means

4.1. K-means: Initialization issues

  • k-means is extremely senstitive to cluster center initialization

  • Bad initialization can lead to

    • Poor convergence speed
    • Bad overall clustering
  • Safeguarding measures:

    • Choose first center as one of the examples, second which is the farthest from the first, third which is the farthest from both, and so on.
    • Try multiple initialization and choose the best result

4.2. Choosing the Number of Clusters

  • Idea: when adding another cluster does not give much better modeling of the data

  • One way to select $k$ for the K-means algorithm is to try different values of $k$, plot the K-means objective versus $k$, and look at the 'elbow-point' in the plot

In [ ]:
# data generation
G0 = np.random.multivariate_normal([1, 1], np.eye(2), 100)
G1 = np.random.multivariate_normal([3, 5], np.eye(2), 100)
G2 = np.random.multivariate_normal([9, 9], np.eye(2), 100)

X = np.vstack([G0, G1, G2])
In [ ]:
cost = []
for i in range(1,11):
    kmeans = KMeans(n_clusters = i, n_init = 10, random_state = 0).fit(X)
    cost.append(abs(kmeans.score(X)))

plt.figure(figsize = (6, 4))
plt.stem(range(1,11), cost)
plt.xticks(np.arange(11))
plt.xlim([0.5, 10.5])
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image

4.3. K-means: Limitations

  • Make hard assignments of points to clusters

    • A point either completely belongs to a cluster or not belongs at all
    • No notion of a soft assignment (i.e., probability of being assigned to each cluster)
    • Gaussian mixture model (we will study later) and Fuzzy K-means allow soft assignments
  • Sensitive to outlier examples (such example can affect the mean by a lot)

    • K-medians algorithm is a more robust alternative for data with outliers
  • Works well only for round shaped, and of roughly equal sizes/density cluster

  • Does badly if the cluster have non-convex shapes

    • Spectral clustering (we will study later) and Kernelized K-means can be an alternative
  • Non-convex/non-round-shaped cluster: standard K-means fails !


  • Clusters with different densities

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
import cv2
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

img = cv2.imread('/content/drive/MyDrive/ML_Colab/ML_data/bird.bmp')
print(img.shape)

plt.figure(figsize = (6, 6))
plt.imshow(img)
plt.axis('off')
plt.show()
(538, 538, 3)
No description has been provided for this image
In [ ]:
from sklearn.cluster import KMeans

m = img.shape[0]
X = img.reshape(m*m, 3)

k = 6
kmeans = KMeans(n_clusters = k, n_init = 10, random_state = 0)
kmeans.fit(X)
Out[ ]:
KMeans(n_clusters=6, n_init=10, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:
kmeans.cluster_centers_
Out[ ]:
array([[ 28.05233436,  30.61784736,  29.60313112],
       [115.36443238, 180.29499531, 220.19269706],
       [ 70.80793196,  81.53310905,  99.27704992],
       [ 66.14371143, 131.1380864 , 179.56029163],
       [195.91452463, 227.07209622, 239.95924399],
       [149.87682175, 146.06898122, 153.96272422]])
In [ ]:
kmeans.labels_
Out[ ]:
array([1, 1, 1, ..., 0, 0, 0], dtype=int32)
In [ ]:
X_clustered = np.zeros([m*m, 3])

for i in range(k):
    X_clustered[np.where(kmeans.labels_ == i)] = kmeans.cluster_centers_[i]
In [ ]:
plt.figure(figsize = (6, 6))
plt.imshow(X_clustered.reshape(m, m, 3).astype('uint8'))
plt.axis('off')
plt.show()
No description has been provided for this image
In [ ]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')