Unsupervised Learning: K-means Clustering


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Supervised vs. Unsupervised Learning

  • Supervised: building a model from labeled data
  • Unsupervised: clustering from unlabeled data

Supervised Learning


$$ \begin{array}{Icr} \{x^{(1)},x^{(2)},\cdots,x^{(m)}\}\\ \{y^{(1)},y^{(2)},\cdots,y^{(m)}\} \end{array} \quad \Rightarrow \quad \text{Classification} $$


Unsupervised Learning

  • Data clustering is an unsupervised learning problem

  • Given:

    • $m$ unlabeled examples $\{x^{(1)},x^{(2)}\cdots, x^{(m)}\}$
    • the number of partitions $k$
  • Goal: group the examples into $k$ partitions


$$\{x^{(1)},x^{(2)},\cdots,x^{(m)}\} \quad \Rightarrow \quad \text{Clustering}$$


  • The only information clustering uses is the similarity between examples

  • Clustering groups examples based of their mutual similarities

  • A good clustering is one that achieves:

    • high within-cluster similarity
    • low inter-cluster similarity
  • It is a "chicken and egg" problem (dilemma)
    • Q: if we knew $c_i$s, how would we determine which points to associate with each cluster center?
    • A: for each point $x^{(i)}$, choose closest $c_i$

    • Q: if we knew the cluster memberships, how do we get the centers?
    • A: choose $c_i$ to be the mean of all points in the cluster

2. K-means

2.1. (Iterative) Algorithm


1) Initialization

Input:

  • $k$: the number of clusters
  • Training set $\{x^{(1)},x^{(2)},\cdots,x^{(m)}\}$

Randomly initialized anywhere in $\mathbb{R}^n$




2) Iteration





$$ \begin{align*} c_k &= \{n: k = \arg \min_k \, \lVert x_n - \mu_k \rVert^2\}\\ \mu_k &= \frac{1}{\lvert c_k \rvert} \sum_{n \in c_k}x_n \end{align*} $$

Repeat until convergence

  • A possible convergence criteria: cluster centers do not change anymore

3) Output

Output: model

  • $c$ (label): index (1 to $k$) of cluster centroid $\{c_1,c_2,\cdots,c_k\}$
  • $\mu$ : averages (mean) of points assigned to cluster $\{\mu_1,\mu_2,\cdots,\mu_k\}$



2.2. Summary: K-means Algorithm

$ \,\text{Randomly initialize } k \,\text{cluster centroids } \mu_1,\mu_2,\cdots,\mu_k \in \mathbb{R}^n$

$ \begin{align*} \text{Repeat}\;&\{ \\ &\text{for $i=1$ to $m$} \\ &\quad \text{$c_i$ := index (from 1 to $k$) of cluster centroid closest to $x^{(i)}$} \\ &\text{for $k=1$ to $k$} \\ &\quad \text{$\mu_k$ := average (mean) of points assigned to cluster $k$} \\ &\} \end{align*} $

2.3. K-means Optimization Point of View (optional)

  • $c_i$= index of cluster $(1,2,\cdots,k)$ to which example $x^{(i)}$ is currently assigned
  • $\mu_k$= cluster centroid $k$ ($\mu_k \in \mathbb{R}^n$)
  • $\mu_{c_i}$= cluster centroid of cluster to which example $x^{(i)}$ has been assigned
  • Optimization objective:
$$ J(c_1,\cdots,c_m,\mu_1,\cdots,\mu_k) = \frac{1}{m}\sum\limits_{i=1}^m \lVert x^{(i)}-\mu_{c_i}\rVert^2$$$$ \min\limits_{c_1,\cdots,c_m, \; \mu_1,\cdots,\mu_k} J(c_1,\cdots,c_m,\mu_1,\cdots,\mu_k)$$

3. Python code

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
# data generation
G0 = np.random.multivariate_normal([1, 1], np.eye(2), 100)
G1 = np.random.multivariate_normal([3, 5], np.eye(2), 100)
G2 = np.random.multivariate_normal([9, 9], np.eye(2), 100)

X = np.vstack([G0, G1, G2])
X = np.asmatrix(X)
print(X.shape)

plt.figure(figsize = (10, 8))
plt.plot(X[:,0], X[:,1], 'b.')
plt.axis('equal')
plt.show()
(300, 2)
In [3]:
# The number of clusters and data
k = 3
m = X.shape[0]

# ramdomly initialize mean points
mu = X[np.random.randint(0, m, k), :]
pre_mu = mu.copy()
print(mu)

plt.figure(figsize = (10, 8))
plt.plot(X[:,0], X[:,1], 'b.')
plt.plot(mu[:,0], mu[:,1], 'ko')
plt.axis('equal')
plt.show()
[[-0.05160412 -0.93811677]
 [ 3.15244106  5.78743665]
 [ 3.71475946  4.54270318]]
In [4]:
y = np.empty([m,1])

# Run K-means
for n_iter in range(500):
    for i in range(m):
        d0 = np.linalg.norm(X[i,:] - mu[0,:], 2)
        d1 = np.linalg.norm(X[i,:] - mu[1,:], 2)
        d2 = np.linalg.norm(X[i,:] - mu[2,:], 2)

        y[i] = np.argmin([d0, d1, d2])
    
    err = 0
    for i in range(k):
        mu[i,:] = np.mean(X[np.where(y == i)[0]], axis = 0)
        err += np.linalg.norm(pre_mu[i,:] - mu[i,:], 2)
    
    pre_mu = mu.copy()
    
    if err < 1e-10:
        print("Iteration:", n_iter)
        break    
Iteration: 4
In [5]:
X0 = X[np.where(y==0)[0]]
X1 = X[np.where(y==1)[0]]
X2 = X[np.where(y==2)[0]]

plt.figure(figsize = (10, 8))
plt.plot(X0[:,0], X0[:,1], 'b.', label = 'C0')
plt.plot(X1[:,0], X1[:,1], 'g.', label = 'C1')
plt.plot(X2[:,0], X2[:,1], 'r.', label = 'C2')
plt.axis('equal')
plt.legend(fontsize = 12)
plt.show()
In [6]:
# use kmeans from the scikit-learn module

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 3, random_state = 0)
kmeans.fit(X)

plt.figure(figsize = (10,8))
plt.plot(X[kmeans.labels_ == 0,0],X[kmeans.labels_ == 0,1], 'b.', label = 'C0')
plt.plot(X[kmeans.labels_ == 1,0],X[kmeans.labels_ == 1,1], 'g.', label = 'C1')
plt.plot(X[kmeans.labels_ == 2,0],X[kmeans.labels_ == 2,1], 'r.', label = 'C2')
plt.axis('equal')
plt.legend(fontsize = 12)
plt.show()

4. Some Issues in K-means

4.1. K-means: Initialization issues

  • k-means is extremely senstitive to cluster center initialization
  • Bad initialization can lead to
    • Poor convergence speed
    • Bad overall clustering
  • Safeguarding measures:
    • Choose first center as one of the examples, second which is the farthest from the first, third which is the farthest from both, and so on.
    • Try multiple initialization and choose the best result

4.2. Choosing the Number of Clusters

  • Idea: when adding another cluster does not give much better modeling of the data

  • One way to select $k$ for the K-means algorithm is to try different values of $k$, plot the K-means objective versus $k$, and look at the 'elbow-point' in the plot

In [7]:
# data generation
G0 = np.random.multivariate_normal([1, 1], np.eye(2), 100)
G1 = np.random.multivariate_normal([3, 5], np.eye(2), 100)
G2 = np.random.multivariate_normal([9, 9], np.eye(2), 100)

X = np.vstack([G0, G1, G2])
X = np.asmatrix(X)
In [8]:
cost = []
for i in range(1,11):
    kmeans = KMeans(n_clusters = i, random_state = 0).fit(X)
    cost.append(abs(kmeans.score(X)))

plt.figure(figsize = (10,8))
plt.stem(range(1,11), cost)
plt.xticks(np.arange(11))
plt.xlim([0.5, 10.5])
plt.grid(alpha = 0.3)
plt.show()

4.3. K-means: Limitations

  • Make hard assignments of points to clusters
    • A point either completely belongs to a cluster or not belongs at all
    • No notion of a soft assignment (i.e., probability of being assigned to each cluster)
    • Gaussian mixture model (we will study later) and Fuzzy K-means allow soft assignments
  • Sensitive to outlier examples (such example can affect the mean by a lot)
    • K-medians algorithm is a more robust alternative for data with outliers
  • Works well only for round shaped, and of roughly equal sizes/density cluster
  • Does badly if the cluster have non-convex shapes
    • Spectral clustering (we will study later) and Kernelized K-means can be an alternative
  • Non-convex/non-round-shaped cluster: standard K-means fails !


  • Clusters with different densities


5. Video Lectures

In [9]:
%%html
<center><iframe src="https://www.youtube.com/embed/gd8IhnWqjSo?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [10]:
%%html
<center><iframe src="https://www.youtube.com/embed//t-ectMGYe-4?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [11]:
%%html
<center><iframe src="https://www.youtube.com/embed/nBfPsNg6F_k?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [12]:
%%html
<center><iframe src="https://www.youtube.com/embed/8f4-TD8-rPA?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [13]:
%%html
<center><iframe src="https://www.youtube.com/embed/tk9LGtOYVL8?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
In [14]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')