K-Nearest Neighbor (KNN) and Decision Tree

1. Supervised Learning¶

• Given training set $\left\{ \left(x^{(1)}, y^{(1)}\right), \left(x^{(2)}, y^{(2)}\right),\cdots,\left(x^{(m)}, y^{(m)}\right) \right\}$

• Want to find a function $f_{\omega}$ with learning parameter, $\omega$
• $f_{\omega}$ desired to be as close as possible to $y$ for future $(x,y)$
• $i.e., f_{\omega}(x) \approx y$

• Define a loss function

$$\ell \left(f_{\omega} \left(x^{(i)}\right), y^{(i)}\right)$$

• Solve the following optimization problem:

\begin{align*} \text{minimize} &\quad \frac{1}{m} \sum_{i=1}^{m} \ell \left(f_{\omega} \left(x^{(i)}\right), y^{(i)}\right)\\ \text{subject to} &\quad \omega \in \boldsymbol{\omega} \end{align*}

• Function approximation between inputs and outputs

• Once it is learned,

2. K-Nearest Neighbor (KNN)¶

2.1. K-Nearest Neighbor (KNN) Regression¶

Non-parametric method

We write our model as

$$y = f(x) + \varepsilon$$

where $\varepsilon$ captures measurement errors and other discrepancies.

Then, with a good $f$ we can make predictions of $y$ at new points $x_{\text{new}}$ . One possible way so called "nearest neighbor method" is:

$$\hat y = \text{avg} \left(y \mid x \in \mathcal{N}(x_{\text{new}}) \right)$$

where $\mathcal{N}(x)$ is some neighborhood of $x$

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

N = 100
w1 = 0.5
w0 = 2
x = np.random.normal(0, 15, N).reshape(-1,1)
y = w1*x + w0 + 5*np.random.normal(0, 1, N).reshape(-1,1)

plt.figure(figsize = (6, 4))
plt.title('Data Set', fontsize = 12)
plt.plot(x, y, '.', label = 'Data')
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.legend()
plt.axis('equal')
plt.axis([-40, 40, -30, 30])
plt.grid(alpha = 0.3)
plt.show()

In [ ]:
from sklearn import neighbors

reg = neighbors.KNeighborsRegressor(n_neighbors = 1)
reg.fit(x, y)

Out[ ]:
KNeighborsRegressor(n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
In [ ]:
x_new = np.array([[5]])
pred = reg.predict(x_new)[0,0]
print(pred)

2.3727550981413934

In [ ]:
xp = np.linspace(-30, 30, 100).reshape(-1,1)
yp = reg.predict(xp)

plt.figure(figsize = (6, 4))
plt.title('k-Nearest Neighbor Regression', fontsize = 12)
plt.plot(x, y, '.', label = 'Original Data')
plt.plot(xp, yp, label = 'kNN')
plt.plot(x_new, pred, 'o', label = 'Prediction')
plt.plot([x_new[0,0], x_new[0,0]], [-30, pred], 'k--', alpha = 0.5)
plt.plot([-40, x_new[0,0]], [pred, pred], 'k--', alpha = 0.5)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.legend()
plt.axis('equal')
plt.axis([-40, 40, -30, 30])
plt.grid(alpha = 0.3)
plt.show()

In [ ]:
reg = neighbors.KNeighborsRegressor(n_neighbors = 21)
reg.fit(x, y)

xp = np.linspace(-30, 30, 100).reshape(-1,1)
yp = reg.predict(xp)

plt.figure(figsize = (6, 4))
plt.title('k-Nearest Neighbor Regression', fontsize = 12)
plt.plot(x, y, '.', label='Original Data')
plt.plot(xp, yp, label = 'Regression Result')
plt.plot(x_new, pred, 'o', label='Prediction')
plt.plot([x_new[0,0], x_new[0,0]], [-30, pred], 'k--', alpha = 0.5)
plt.plot([-40, x_new[0,0]], [pred, pred], 'k--', alpha = 0.5)
plt.xlabel('X', fontsize = 12)
plt.ylabel('Y', fontsize = 12)
plt.legend()
plt.axis('equal')
plt.axis([-40, 40, -30, 30])
plt.grid(alpha = 0.3)
plt.show()


2.2. K-Nearest Neighbor (KNN) Classification¶

• Non-parametric method

• In k-NN classification, an object is assigned to the class most common among its $k$ nearest neighbors ($k$ is a positive integer, typically small).

• If $k = 1$, then the object is simply assigned to the class of that single nearest neighbor.

• Zoom in,
In [ ]:
m = 1000
X = -1.5 + 3*np.random.uniform(size = (m,2))

y = np.zeros([m,1])
for i in range(m):
if np.linalg.norm(X[i,:],2) <= 1:
y[i] = 1

C1 = np.where(y == 1)[0]
C0 = np.where(y == 0)[0]

theta = np.linspace(0, 2*np.pi, 100)

plt.figure(figsize = (6, 6))
plt.plot(X[C1,0], X[C1,1], 'o', label = 'C1', markerfacecolor = "k", markeredgecolor = 'k', markersize = 4)
plt.plot(X[C0,0], X[C0,1], 'o', label = 'C0', markerfacecolor = "None", alpha = 0.3, markeredgecolor = 'k', markersize = 4)
plt.plot(np.cos(theta), np.sin(theta), '--', color = 'orange')
plt.axis([-1.5, 1.5, -1.5, 1.5])
plt.axis('equal')
plt.axis('off')
plt.show()

In [ ]:
from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors = 1)
clf.fit(X, np.ravel(y))

Out[ ]:
KNeighborsClassifier(n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
In [ ]:
X_new = np.array([1,1]).reshape(1,-1)
result = clf.predict(X_new)[0]
print(result)

0.0

In [ ]:
res = 0.01
[X1gr, X2gr] = np.meshgrid(np.arange(-1.5, 1.5, res), np.arange(-1.5, 1.5, res))

Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])

inC1 = clf.predict(Xp).reshape(-1,1)
inCircle = np.where(inC1 == 1)[0]

plt.figure(figsize = (6, 6))
plt.plot(X[C1,0], X[C1,1], 'o', label = 'C1', markerfacecolor = "k", alpha = 0.5, markeredgecolor = 'k', markersize = 4)
plt.plot(X[C0,0], X[C0,1], 'o', label = 'C0', markerfacecolor = "None", alpha = 0.3, markeredgecolor='k', markersize = 4)
plt.plot(np.cos(theta), np.sin(theta), '--', color = 'orange')
plt.plot(Xp[inCircle][:,0], Xp[inCircle][:,1], 's', alpha = 0.5, color = 'r', markersize = 1)
plt.axis([-1.5, 1.5, -1.5, 1.5])
plt.axis('equal')
plt.axis('off')
plt.show()


When outliers exist

In [ ]:
m = 1000
X = -1.5 + 3*np.random.uniform(size = (m,2))

y = np.zeros([m,1])
for i in range(m):
if np.linalg.norm(X[i,:], 2) <= 1:
if np.random.uniform() < 0.05:
y[i] = 0
else:
y[i] = 1
else:
if np.random.uniform() < 0.05:
y[i] = 1
else:
y[i] = 0

C1 = np.where(y == 1)[0]
C0 = np.where(y == 0)[0]

theta = np.linspace(0, 2*np.pi, 100)

plt.figure(figsize = (6, 6))
plt.plot(X[C1,0], X[C1,1], 'o', label = 'C1', markerfacecolor = "k", markeredgecolor = 'k', markersize = 4)
plt.plot(X[C0,0], X[C0,1], 'o', label = 'C0', markerfacecolor = "None", alpha = 0.3, markeredgecolor = 'k', markersize = 4)
plt.plot(np.cos(theta), np.sin(theta), '--', color = 'orange')
plt.axis([-1.5, 1.5, -1.5, 1.5])
plt.axis('equal')
plt.axis('off')
plt.show()


$k = 1$

In [ ]:
clf = neighbors.KNeighborsClassifier(n_neighbors = 1)
clf.fit(X, np.ravel(y))

res = 0.01
[X1gr, X2gr] = np.meshgrid(np.arange(-1.5, 1.5, res), np.arange(-1.5, 1.5, res))

Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])

inC1 = clf.predict(Xp).reshape(-1,1)
inCircle = np.where(inC1 == 1)[0]

plt.figure(figsize = (6, 6))
plt.plot(X[C1,0], X[C1,1], 'o', label = 'C1', markerfacecolor = "k", alpha = 0.5, markeredgecolor = 'k', markersize = 4)
plt.plot(X[C0,0], X[C0,1], 'o', label = 'C0', markerfacecolor = "None", alpha = 0.3, markeredgecolor='k', markersize = 4)
plt.plot(np.cos(theta), np.sin(theta), '--', color = 'orange')
plt.plot(Xp[inCircle][:,0], Xp[inCircle][:,1], 's', alpha = 0.5, color = 'r', markersize = 1)
plt.axis([-1.5, 1.5, -1.5, 1.5])
plt.axis('equal')
plt.axis('off')
plt.show()


$k = 11$

In [ ]:
clf = neighbors.KNeighborsClassifier(n_neighbors = 11)
clf.fit(X, np.ravel(y))

res = 0.01
[X1gr, X2gr] = np.meshgrid(np.arange(-1.5, 1.5, res), np.arange(-1.5, 1.5, res))

Xp = np.hstack([X1gr.reshape(-1,1), X2gr.reshape(-1,1)])

inC1 = clf.predict(Xp).reshape(-1,1)
inCircle = np.where(inC1 == 1)[0]

plt.figure(figsize = (6, 6))
plt.plot(Xp[inCircle][:,0], Xp[inCircle][:,1], 's', alpha = 0.5, color = 'r', markersize = 1)
plt.plot(X[C1,0], X[C1,1], 'o', label = 'C1', markerfacecolor = "k", alpha = 0.5, markeredgecolor = 'k', markersize = 4)
plt.plot(X[C0,0], X[C0,1], 'o', label = 'C0', markerfacecolor = "None", alpha = 0.3, markeredgecolor='k', markersize = 4)
plt.plot(np.cos(theta), np.sin(theta), '--', color = 'orange')
# plt.legend(fontsize = 12)
plt.axis([-1.5, 1.5, -1.5, 1.5])
plt.axis('equal')
plt.axis('off')
plt.show()