Unsupervised Learning: K-means Clustering
Table of Contents
1. Supervised vs. Unsupervised Learning¶
- Supervised: building a model from labeled data
- Unsupervised: clustering from unlabeled data
Supervised Learning
$$
\begin{array}{Icr}
\{x^{(1)},x^{(2)},\cdots,x^{(m)}\}\\
\{y^{(1)},y^{(2)},\cdots,y^{(m)}\}
\end{array}
\quad \Rightarrow \quad \text{Classification}
$$
Unsupervised Learning
Data clustering is an unsupervised learning problem
Given:
$m$ unlabeled examples $\{x^{(1)},x^{(2)}\cdots, x^{(m)}\}$
the number of partitions $k$
Goal: group the examples into $k$ partitions
$$\{x^{(1)},x^{(2)},\cdots,x^{(m)}\} \quad \Rightarrow \quad \text{Clustering}$$
The only information clustering uses is the similarity between examples
Clustering groups examples based of their mutual similarities
A good clustering is one that achieves:
- high within-cluster similarity
- low inter-cluster similarity
It is a "chicken and egg" problem (dilemma)
- Q: if we knew $c_i$s, how would we determine which points to associate with each cluster center?
- A: for each point $x^{(i)}$, choose closest $c_i$
- Q: if we knew the cluster memberships, how do we get the centers?
- A: choose $c_i$ to be the mean of all points in the cluster
2. K-means¶
2.1. (Iterative) Algorithm¶
1) Initialization
Input:
- $k$: the number of clusters
- Training set $\{x^{(1)},x^{(2)},\cdots,x^{(m)}\}$
Randomly initialized anywhere in $\mathbb{R}^n$
2) Iteration
$$ \begin{align*} c_k &= \{n: k = \arg \min_k \, \lVert x_n - \mu_k \rVert^2\}\\ \mu_k &= \frac{1}{\lvert c_k \rvert} \sum_{n \in c_k}x_n \end{align*} $$
Repeat until convergence
- A possible convergence criteria: cluster centers do not change anymore
3) Output
Output: model
- $c$ (label): index (1 to $k$) of cluster centroid $\{c_1,c_2,\cdots,c_k\}$
- $\mu$ : averages (mean) of points assigned to cluster $\{\mu_1,\mu_2,\cdots,\mu_k\}$
2.2. Summary: K-means Algorithm¶
$ \,\text{Randomly initialize } k \,\text{cluster centroids } \mu_1,\mu_2,\cdots,\mu_k \in \mathbb{R}^n$
$ \begin{align*} \text{Repeat}\;&\{ \\ &\text{for $i=1$ to $m$} \\ &\quad \text{$c_i$ := index (from 1 to $k$) of cluster centroid closest to $x^{(i)}$} \\ &\text{for $k=1$ to $k$} \\ &\quad \text{$\mu_k$ := average (mean) of points assigned to cluster $k$} \\ &\} \end{align*} $
2.3. K-means Optimization Point of View (optional)¶
$c_i$= index of cluster $(1,2,\cdots,k)$ to which example $x^{(i)}$ is currently assigned
$\mu_k$= cluster centroid $k$ ($\mu_k \in \mathbb{R}^n$)
$\mu_{c_i}$= cluster centroid of cluster to which example $x^{(i)}$ has been assigned
Optimization objective:
$$ J(c_1,\cdots,c_m,\mu_1,\cdots,\mu_k) = \frac{1}{m}\sum\limits_{i=1}^m \lVert x^{(i)}-\mu_{c_i}\rVert^2$$
$$ \min\limits_{c_1,\cdots,c_m, \; \mu_1,\cdots,\mu_k} J(c_1,\cdots,c_m,\mu_1,\cdots,\mu_k)$$
3. Python code¶
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# data generation
G0 = np.random.multivariate_normal([1, 1], np.eye(2), 100)
G1 = np.random.multivariate_normal([3, 5], np.eye(2), 100)
G2 = np.random.multivariate_normal([9, 9], np.eye(2), 100)
X = np.vstack([G0, G1, G2])
X = np.asmatrix(X)
print(X.shape)
plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'b.')
plt.axis('equal')
plt.show()
# The number of clusters and data
k = 3
m = X.shape[0]
# ramdomly initialize mean points
mu = X[np.random.randint(0, m, k), :]
pre_mu = mu.copy()
print(mu)
plt.figure(figsize = (6, 4))
plt.plot(X[:,0], X[:,1], 'b.')
plt.plot(mu[:,0], mu[:,1], 'ko')
plt.axis('equal')
plt.show()
y = np.empty([m,1])
# Run K-means
for n_iter in range(500):
for i in range(m):
d0 = np.linalg.norm(X[i,:] - mu[0,:], 2)
d1 = np.linalg.norm(X[i,:] - mu[1,:], 2)
d2 = np.linalg.norm(X[i,:] - mu[2,:], 2)
y[i] = np.argmin([d0, d1, d2])
err = 0
for i in range(k):
mu[i,:] = np.mean(X[np.where(y == i)[0]], axis = 0)
err += np.linalg.norm(pre_mu[i,:] - mu[i,:], 2)
pre_mu = mu.copy()
if err < 1e-10:
print("Iteration:", n_iter)
break
X0 = X[np.where(y==0)[0]]
X1 = X[np.where(y==1)[0]]
X2 = X[np.where(y==2)[0]]
plt.figure(figsize = (6, 4))
plt.plot(X0[:,0], X0[:,1], 'b.', label = 'C0')
plt.plot(X1[:,0], X1[:,1], 'g.', label = 'C1')
plt.plot(X2[:,0], X2[:,1], 'r.', label = 'C2')
plt.axis('equal')
plt.legend(fontsize = 12)
plt.show()
# use kmeans from the scikit-learn module
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 3, n_init = 10, random_state = 0)
kmeans.fit(np.array(X))
plt.figure(figsize = (6, 4))
plt.plot(X[kmeans.labels_ == 0,0],X[kmeans.labels_ == 0,1], 'b.', label = 'C0')
plt.plot(X[kmeans.labels_ == 1,0],X[kmeans.labels_ == 1,1], 'g.', label = 'C1')
plt.plot(X[kmeans.labels_ == 2,0],X[kmeans.labels_ == 2,1], 'r.', label = 'C2')
plt.axis('equal')
plt.legend(fontsize = 12)
plt.show()
4. Some Issues in K-means¶
4.1. K-means: Initialization issues¶
k-means is extremely senstitive to cluster center initialization
Bad initialization can lead to
- Poor convergence speed
- Bad overall clustering
Safeguarding measures:
- Choose first center as one of the examples, second which is the farthest from the first, third which is the farthest from both, and so on.
- Try multiple initialization and choose the best result
4.2. Choosing the Number of Clusters¶
Idea: when adding another cluster does not give much better modeling of the data
One way to select $k$ for the K-means algorithm is to try different values of $k$, plot the K-means objective versus $k$, and look at the 'elbow-point' in the plot
# data generation
G0 = np.random.multivariate_normal([1, 1], np.eye(2), 100)
G1 = np.random.multivariate_normal([3, 5], np.eye(2), 100)
G2 = np.random.multivariate_normal([9, 9], np.eye(2), 100)
X = np.vstack([G0, G1, G2])
cost = []
for i in range(1,11):
kmeans = KMeans(n_clusters = i, n_init = 10, random_state = 0).fit(X)
cost.append(abs(kmeans.score(X)))
plt.figure(figsize = (6, 4))
plt.stem(range(1,11), cost)
plt.xticks(np.arange(11))
plt.xlim([0.5, 10.5])
plt.grid(alpha = 0.3)
plt.show()
4.3. K-means: Limitations¶
Make hard assignments of points to clusters
- A point either completely belongs to a cluster or not belongs at all
- No notion of a soft assignment (i.e., probability of being assigned to each cluster)
- Gaussian mixture model (we will study later) and Fuzzy K-means allow soft assignments
Sensitive to outlier examples (such example can affect the mean by a lot)
- K-medians algorithm is a more robust alternative for data with outliers
Works well only for round shaped, and of roughly equal sizes/density cluster
Does badly if the cluster have non-convex shapes
- Spectral clustering (we will study later) and Kernelized K-means can be an alternative
Non-convex/non-round-shaped cluster: standard K-means fails !
- Clusters with different densities
from google.colab import drive
drive.mount('/content/drive')
import cv2
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
img = cv2.imread('/content/drive/MyDrive/ML_Colab/ML_data/bird.bmp')
print(img.shape)
plt.figure(figsize = (6, 6))
plt.imshow(img)
plt.axis('off')
plt.show()
from sklearn.cluster import KMeans
m = img.shape[0]
X = img.reshape(m*m, 3)
k = 6
kmeans = KMeans(n_clusters = k, n_init = 10, random_state = 0)
kmeans.fit(X)
kmeans.cluster_centers_
kmeans.labels_
X_clustered = np.zeros([m*m, 3])
for i in range(k):
X_clustered[np.where(kmeans.labels_ == i)] = kmeans.cluster_centers_[i]
plt.figure(figsize = (6, 6))
plt.imshow(X_clustered.reshape(m, m, 3).astype('uint8'))
plt.axis('off')
plt.show()
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')