Machine Learning

By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KASIT

Table of Contents

1. Linear Regression¶

Consider a linear regression.

$\text{Given} \; \begin{cases} x_{i} \; \text{: inputs} \\ y_{i} \; \text{: outputs} \end{cases}$ , Find $\theta_{0}$ and $\theta_{1}$

$$x= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}, \qquad y= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix} \approx \hat{y}_{i} = \theta_{0} + \theta_{1}x_{i}$$

$ \hat{y}_{i} $ : predicted output

$ \theta = \begin{bmatrix} \theta_{0} \\ \theta_{1} \\ \end{bmatrix} $ : Model parameters

$$ \hat{y}_{i} = f(x_{i}\,; \theta) \; \text{ in general}$$

in many cases, a linear model is used to predict $y_{i}$

$$ \hat{y}_{i} = \theta_{0} + \theta_{1}x_{i} \; \quad \text{ such that }\quad \min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2$$

1.1. Re-cast Problem as a Least Squares¶

For convenience, we define a function that maps inputs to feature vectors, $\phi$

$$\begin{array}{Icr}\begin{align*} \hat{y}_{i} & = \theta_0 + x_i \theta_1 = 1 \cdot \theta_0 + x_i \theta_1 \\ \\ & = \begin{bmatrix}1 & x_{i}\end{bmatrix}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\begin{bmatrix}1 \\ x_{i} \end{bmatrix}^{T}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\phi^{T}(x_{i})\theta \end{align*}\end{array} \begin{array}{Icr} \quad \quad \text{feature vector} \; \phi(x_{i}) = \begin{bmatrix}1 \\ x_{i}\end{bmatrix} \end{array}$$

$$\Phi = \begin{bmatrix}1 & x_{1} \\ 1 & x_{2} \\ \vdots \\1 & x_{m} \end{bmatrix}=\begin{bmatrix}\phi^T(x_{1}) \\\phi^T(x_{2}) \\\vdots \\\phi^T(x_{m}) \end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}_{1} \\\hat{y}_{2} \\\vdots \\\hat{y}_{m}\end{bmatrix}=\Phi\theta$$

Optimization problem

$$\min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2 =\min\limits_{\theta}\lVert\Phi\theta-y\rVert^2_2 \qquad \qquad \left(\text{same as} \; \min_{x} \lVert Ax-b \rVert_2^2 \right)$$

$$ \text{solution} \; \theta^* = (\Phi^{T}\Phi)^{-1}\Phi^{T} y $$

1.2. Solve using Linear Algebra¶

known as least square

$$ \theta = (A^TA)^{-1}A^T y $$

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# data points in column vector [input, output]
x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

m = y.shape[0]
#A = np.hstack([np.ones([m, 1]), x])
A = np.hstack([x**0, x])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y

print('theta:\n', theta)

theta:
 [[0.65306531]
 [0.67129519]]

# to plot
plt.figure(figsize = (6, 4))
plt.title('Regression', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.plot(x, y, 'ko', label = "data")

# to plot a straight line (fitted line)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp

plt.plot(xp, yp, 'r', linewidth = 2, label = "regression")
plt.legend(fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

1.3 Scikit-Learn¶

Machine Learning in Python
Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license

https://scikit-learn.org/stable/index.html#

from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit(x, y)

LinearRegression()

LinearRegression()

reg.coef_

array([[0.67129519]])

reg.intercept_

array([0.65306531])

# to plot
plt.figure(figsize = (6, 4))
plt.title('Regression', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.plot(x, y, 'ko', label = "data")

# to plot a straight line (fitted line)
plt.plot(xp, reg.predict(xp), 'r', linewidth = 2, label = "regression")
plt.legend(fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

2. Classification: Perceptron¶

2.1. Classification¶

where $y$ is a discrete value
- develop the classification algorithm to determine which class a new input should fall into

start with binary class problems
- Later look at multiclass classification problem, although this is just an extension of binary classification

We could use linear regression
- Then, threshold the classifier output (i.e. anything over some value is yes, else no)
- linear regression with thresholding seems to work

We will learn
- perceptron
- logistic regression

2.2. Perceptron¶

For input $x = \begin{bmatrix}x_1\\ \vdots\\ x_d \end{bmatrix}\;$ 'attributes of a customer'

weights $\omega = \begin{bmatrix}\omega_1\\ \vdots\\ \omega_d \end{bmatrix}$

$$\begin{align*} \text{Approve credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i > \text{threshold}, \\ \text{Deny credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i < \text{threshold}. \end{align*}$$

$$h(x) = \text{sign} \left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)- \text{threshold} \right) = \text{sign}\left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)+ \omega_0\right)$$

Introduce an artificial coordinate $x_0 = 1$:

$$h(x) = \text{sign}\left( \sum\limits_{i=0}^{d}\omega_ix_i \right)$$

In a vector form, the perceptron implements

$$h(x) = \text{sign}\left( \omega^T x \right)$$

sign function

$$ \text{sgn}(x) = \begin{cases} 1, &\text{if }\; x > 0\\ 0, &\text{if }\; x = 0\\ -1, &\text{if }\; x < 0 \end{cases} $$

Hyperplane
- Separates a D-dimensional space into two half-spaces
- Defined by an outward pointing normal vector $\omega$
- $\omega$ is orthogonal to any vector lying on the hyperplane
- Assume the hyperplane passes through origin, $\omega^T x = 0$ with $x_0 = 1$

Sign with respect to a line

$$ \begin{align*} \omega = \begin{bmatrix}\omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega_0 + \omega^T x\\ \omega = \begin{bmatrix}\omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega^T x \end{align*} $$

Goal: to learn the hyperplane $g_{\omega}(x)=0$ using the training data

How to find $\omega$
- All data in class 1 $$g(x) > 0$$
- All data in class 0 $$g(x) < 0$$

2.2.1. Perceptron Algorithm¶

The perceptron implements

$$h(x) = \text{sign}\left( \omega^Tx \right)$$

Given the training set

$$(x_1, y_1), (x_2, y_2), \cdots, (x_N, y_N) \quad \text{where } y_i \in \{-1,1\}$$

1) pick a misclassified point

$$ \text{sign}\left(\omega^Tx_n \right) \neq y_n$$

2) and update the weight vector

$$\omega \leftarrow \omega + y_nx_n$$

Why perceptron updates work ?

Let's look at a misclassified positive example ($y_n = +1$)
- perceptron (wrongly) thinks $\omega_{old}^T x_n < 0$

updates would be

$$ \begin{align*}\omega_{new} &= \omega_{old} + y_n x_n = \omega_{old} + x_n \\ \\ \omega_{new}^T x_n &= (\omega_{old} + x_n)^T x_n = \omega_{old}^T x_n + x_n^T x_n \end{align*}$$

Thus $\omega_{new}^T x_n$ is less negative than $\omega_{old}^T x_n$

2.2.2. Iterations of Perceptron¶

Randomly assign $\omega$
One iteration of the PLA (perceptron learning algorithm) $$\omega \leftarrow \omega + yx$$ where $(x, y)$ is a misclassified training point
At iteration $t = 1, 2, 3, \cdots,$ pick a misclassified point from $$(x_1,y_1),(x_2,y_2),\cdots,(x_N, y_N)$$
and run a PLA iteration on it
That's it!

Summary

2.2.3. Perceptron loss function¶

$$ \mathscr{L}(\omega) = \sum_{n =1}^{m} \max \left\{ 0, -y_n \cdot \left(\omega^T x_n \right)\right\} $$

$\text{Loss} = 0$ on examples where perceptron is correct, i.e., $y_n \cdot \left(\omega^T x_n \right) > 0$

$\text{Loss} > 0$ on examples where perceptron misclassified, i.e., $y_n \cdot \left(\omega^T x_n \right) < 0$

Note: $\text{sign}\left(\omega^T x_n \right) \neq y_n$ is equivalent to $ y_n \cdot \left(\omega^T x_n \right) < 0$

2.3. Perceptron in Python¶

$$g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0$$

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}\\ \\ x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\ \\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#training data gerneration
m = 100
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

C1 = np.where(g >= 1)
C0 = np.where(g < -1)
print(C1)

(array([ 3,  4, 10, 11, 13, 16, 18, 21, 23, 27, 28, 33, 34, 37, 39, 40, 45,
       61, 64, 67, 69, 74, 77, 82, 83, 85, 87, 90, 96]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0]))

C1 = np.where(g >= 1)[0]
C0 = np.where(g < -1)[0]
print(C1.shape)
print(C0.shape)

(29,)
(40,)

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.title('Linearly Separable Classes', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()

$$ \begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

X1 = np.hstack([np.ones([C1.shape[0],1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([C0.shape[0],1]), x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])

X = np.asmatrix(X)
y = np.asmatrix(y)

$$\omega = \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}$$

$$\omega \leftarrow \omega + yx$$ where $(x, y)$ is a misclassified training point

w = np.ones([3,1])
w = np.asmatrix(w)

n_iter = y.shape[0]
flag = 0

while flag == 0:
    flag = 1
    for i in range(n_iter):
        if y[i,0] != np.sign(X[i,:]*w)[0,0]:
            w += y[i,0]*X[i,:].T
            flag = 0

print(w)

[[-14.        ]
 [  4.77848286]
 [  8.61057437]]

$$ \begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*} $$

x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w[1,0]/w[2,0]*x1p - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 3, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.show()

2.4. Perceptron using Scikit-Learn¶

$$ \begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)}\\ \vdots & \vdots \\ x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\\\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])

from sklearn import linear_model

clf = linear_model.Perceptron(tol = 1e-3)
clf.fit(X, np.ravel(y))

Perceptron()

Perceptron()

clf.predict([[3, -2]])

array([-1.])

clf.predict([[6, 2]])

array([1.])

clf.coef_

array([[4.30276835, 7.18194009]])

clf.intercept_

array([-12.])

$$ \begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*} $$

w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w1/w2*x1p - w0/w2

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 4, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.show()

2.5. The best hyperplane separator?¶

Perceptron finds one of the many possible hyperplanes separating the data if one exists

Of the many possible choices, which one is the best?

Utilize distance information from all data samples
- We will see this formally when we discuss the logistic regression

3. Classification: Logistic Regression¶

Logistic regression is a classification algorithm - don't be confused

3.1. Using all Distances¶

Perceptron: make use of sign of data
We want to use distance information of all data points $\rightarrow$ logistic regression

basic idea: to find the decision boundary (hyperplane) of $g(x)=\omega^T x =0$ such that maximizes $\prod_i \lvert h_i \rvert$
- Inequality of arithmetic and geometric means
  $$ \frac{h_1+h_2}{2} \geq \sqrt{h_1 h_2} $$ and that equality holds if and only if $h_1 = h_2$

Roughly speaking, this optimization of $\max \prod_i \lvert h_i \rvert$ tends to position a hyperplane in the middle of two classes

$$h = \frac{g(x)}{\lVert \omega \rVert} = \frac{\omega^T x}{\lVert \omega \rVert} \sim \omega^T x$$

We link or squeeze $(-\infty, +\infty)$ to $(0,1)$ for several reasons:

If $\sigma(z)$ is the sigmoid function, or the logistic function

$$ \sigma(z) = \frac{1}{1+e^{-z}} \implies \sigma \left(\omega^T x \right) = \frac{1}{1+e^{-\omega^T x}}$$
- Logistic function always generates a value between 0 and 1
- Crosses 0.5 at the origin, then flattens out
- The derivative of the sigmoid function satisfies
  
  $$\sigma'(z) = \sigma(z)\left( 1 - \sigma(z)\right)$$

# plot a sigmoid function

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

z = np.linspace(-4,4,100)
s = 1/(1 + np.exp(-z))

plt.figure(figsize = (8,2))
plt.plot(z, s)
plt.xlim([-4, 4])
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()

Benefit of mapping via the logistic function
- monotonic: same or similar optimziation solution
- continuous and differentiable: good for gradient descent optimization
- probability or confidence: can be considered as probability
$$P\left(y = +1 \mid x\,;\omega\right) = \frac{1}{1+e^{-\omega^T x}} \;\; \in \; [0,1]$$
- Often we do note care about predicting the label $y$
- Rather, we want to predict the label probabilities $P\left(y \mid x\,;\omega\right)$
  - the probability that the label is $+1$ $$P\left(y = +1 \mid x\,;\omega\right)$$
  - the probability that the label is $0$ $$P\left(y = 0 \mid x\,;\omega\right) = 1 - P\left(y = +1 \mid x\,;\omega\right)$$

Goal: we need to fit $\omega$ to our data

For a single data point $(x,y)$ with parameters $\omega$

$$ \begin{align*} P\left(y = +1 \mid x\,;\omega\right) &= h_{\omega}(x) = \sigma \left(\omega^T x \right)\\ P\left(y = 0 \mid x\,;\omega\right) &= 1 - h_{\omega}(x) = 1- \sigma \left(\omega^T x \right) \end{align*} $$

It can be compactly written as

$$P\left(y \mid x\,;\omega\right) = \left(h_{\omega}(x) \right)^y \left(1 - h_{\omega}(x)\right)^{1-y}$$

For $m$ training data points, the likelihood function of the parameters:

$$ \begin{align*} \mathscr{L}(\omega) &= P\left(y^{(1)}, \cdots, y^{(m)} \mid x^{(1)}, \cdots, x^{(m)}\,;\omega\right)\\ &= \prod\limits_{i=1}^{m}P\left(y^{(i)} \mid x^{(i)}\,;\omega\right)\\ &= \prod\limits_{i=1}^{m}\left(h_{\omega}\left(x^{(i)}\right) \right)^{y^{(i)}} \left(1 - h_{\omega}\left(x^{(i)}\right)\right)^{1-y^{(i)}} \qquad \left(\sim \prod_i \lvert h_i \rvert \right) \end{align*} $$

It would be easier to work on the log likelihood.

$$\ell(\omega) = \log \mathscr{L}(\omega) = \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right)$$

The logistic regression problem can be solved as a (convex) optimization problem as

$$\hat{\omega} = \arg\max_{\omega} \ell(\omega)$$

3.2. Logistic Regression using Scikit-Learn¶

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_1 \\ \omega_2\end{bmatrix}, \qquad \omega_0, \qquad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots \\\end{bmatrix}\\ \\ y & = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

X

array([[ 5.39876541e+00,  1.46063605e+00],
       [ 5.67066894e+00,  2.04340495e+00],
       [ 5.14798156e+00,  1.68338832e-01],
       [ 6.41703851e+00,  4.24331047e-01],
       [ 2.76484878e+00,  2.30685788e+00],
       [ 7.09085307e+00,  1.09244820e+00],
       [ 6.57799977e+00,  2.29612160e+00],
       [ 4.01448000e+00,  1.50781011e+00],
       [ 7.90904831e+00, -4.16032175e-01],
       [ 3.36716955e+00,  2.00811226e+00],
       [ 7.08619530e+00,  7.94156154e-01],
       [ 2.71662173e+00,  2.46155210e+00],
       [ 3.54552314e+00,  1.65905921e+00],
       [ 5.54257280e+00,  2.55494279e+00],
       [ 4.32730115e+00,  1.76266428e+00],
       [ 7.32879365e+00, -1.51653699e+00],
       [ 6.46251862e+00, -4.56948153e-01],
       [ 5.77981650e+00,  2.87100386e+00],
       [ 6.53889358e+00,  2.78474807e+00],
       [ 2.40034795e+00,  2.35773263e+00],
       [ 7.25373066e+00, -1.05602069e+00],
       [ 6.50148326e+00,  9.20715747e-01],
       [ 3.41296328e+00,  1.58627278e+00],
       [ 6.26514907e+00,  9.36900454e-01],
       [ 6.13208666e+00,  2.89444656e+00],
       [ 7.88035321e+00, -7.30331367e-01],
       [ 5.21773076e+00,  1.73906793e-01],
       [ 4.39537288e+00,  2.75800816e+00],
       [ 3.52440328e+00,  1.67392334e+00],
       [ 6.90982210e+00, -3.56760928e+00],
       [ 1.79271158e+00, -6.77349746e-01],
       [ 3.02063150e-03, -1.02477447e+00],
       [ 9.10365616e-02,  4.88502000e-01],
       [ 1.73735764e+00, -5.74834802e-01],
       [ 2.13942449e+00, -6.62268350e-01],
       [ 2.98194119e+00, -1.12744267e+00],
       [ 3.25245652e+00, -1.66472784e+00],
       [ 5.77676894e+00, -2.64132859e+00],
       [ 2.63470714e+00, -2.50145062e+00],
       [ 1.46918721e+00,  6.84560495e-01],
       [ 5.54537437e+00, -3.16274317e+00],
       [ 7.09041992e+00, -3.73411603e+00],
       [ 7.19773940e-01,  1.00944090e+00],
       [ 2.35863704e+00, -2.93060431e+00],
       [ 2.93901627e+00, -3.54943466e+00],
       [ 8.15227503e-01, -3.90692765e+00],
       [ 3.92908235e+00, -1.38055389e+00],
       [ 1.30386169e+00,  2.06540288e-01],
       [ 4.15588404e+00, -2.20170044e+00],
       [ 1.95973444e+00, -3.16122414e+00],
       [ 2.53659322e+00, -3.01127512e-01],
       [ 2.31573572e-01,  1.38496845e+00],
       [ 1.33536242e+00, -2.93688705e+00],
       [ 8.43232873e-02, -6.09709213e-02],
       [ 3.19360577e+00, -1.20601746e+00],
       [ 6.29141092e+00, -3.63708483e+00],
       [ 2.22703878e+00, -1.02304464e+00],
       [ 3.90666147e+00, -3.93133227e+00],
       [ 3.92305178e-01, -3.74731306e+00],
       [ 2.36327523e+00, -2.60689261e+00],
       [ 1.33450693e+00, -3.39210867e+00],
       [ 6.28688997e-01, -2.80856152e-01],
       [ 7.86650117e-01,  8.09147889e-01],
       [ 2.33985817e+00, -2.89769844e-01],
       [ 3.20649702e+00, -3.28685429e+00],
       [ 2.19846807e+00, -2.69772773e+00],
       [ 8.43095804e-01, -7.53622379e-01],
       [ 1.69204385e+00, -6.19324844e-01],
       [ 3.06372114e+00, -3.12419555e+00]])

# datat generation

m = 100
w0 = -6
w = np.array([[2], [1]])
X = np.hstack([4*np.random.rand(m,1), 4*np.random.rand(m,1)])

w = np.asmatrix(w)
X = np.asmatrix(X)

y = 1/(1 + np.exp(-w0-X*w)) > 0.5

C1 = np.where(y == True)[0]
C0 = np.where(y == False)[0]

y = np.empty([m,1])
y[C1] = 1
y[C0] = 0

plt.figure(figsize = (6, 4))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

from sklearn import linear_model

clf = linear_model.LogisticRegression(solver = 'lbfgs')
clf.fit(np.asarray(X), np.ravel(y))

LogisticRegression()

LogisticRegression()

clf.coef_

array([[3.21453711, 1.29943009]])

clf.intercept_

array([-9.18960726])

w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w1/w2*xp - w0/w2

plt.figure(figsize = (6, 4))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

4. Deep Learning Libraries¶

Tensorflow

Platform: Linux, Mac OS, Windows
Written in: C++, Python
Interface: Python, C/C++, Java, Go, R
https://www.tensorflow.org/

Keras

https://keras.io/

PyTorch

https://pytorch.org/

5. TensorFlow¶

TensorFlow is an open-source software library for deep learning.

It’s a framework to perform computation very efficiently, and it can tap into the GPU (Graphics Processor Unit) in order to speed it up even further. This will make a huge effect as we shall see shortly. TensorFlow can be controlled by a simple Python API.

Tensorflow is one of the widely used libraries for implementing machine learning and other algorithms involving large number of mathematical operations. Tensorflow was developed by Google and it’s one of the most popular Machine Learning libraries on GitHub. Google uses Tensorflow for implementing Machine learning in almost all applications.

Tensor

TensorFlow gets its name from tensors, which are arrays of arbitrary dimensionality. A vector is a 1-d array and is known as a 1st-order tensor. A matrix is a 2-d array and a 2nd-order tensor. The "flow" part of the name refers to computation flowing through a graph. Training and inference in a neural network, for example, involves the propagation of matrix computations through many nodes in a computational graph.

5.1. TensorFlow with Gradient Tape¶

With eager execution enabled, Tensorflow will calculate the values of tensors as they occur in your code. This means that it won't precompute a static graph for which inputs are fed in through placeholders. This means to back propagate errors, you have to keep track of the gradients of your computation and then apply these gradients to an optimiser.

This is very different from running without eager execution, where you would build a graph and then simply use sess.run to evaluate your loss and then pass this into an optimiser directly.

Fundamentally, because tensors are evaluated immediately, you don't have a graph to calculate gradients and so you need a gradient tape. It is not so much that it is just used for visualisation, but more that you cannot implement a gradient descent in eager mode without it.

Obviously, Tensorflow could just keep track of every gradient for every computation on every tf.Variable. However, that could be a huge performance bottleneck. They expose a gradient tape so that you can control what areas of your code need the gradient information. Note that in non-eager mode, this will be statically determined based on the computational branches that are descendants of your loss but in eager mode there is no static graph and so no way of knowing.

5.2. TensorFlow as Optimization Solver¶

$$\min_{\omega}\;\;(\omega - 4)^2$$

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline

w = tf.Variable(0, dtype = tf.float32)

LR = 0.05

# Training
cost_record = []
for i in range(50):
    with tf.GradientTape() as tape:
        cost = w*w - 8*w + 16
        w_grad = tape.gradient(cost, w)
    cost_record.append(cost)
    w.assign_sub(LR * w_grad)

print("\n optimal w =", w.numpy())

plt.figure(figsize = (6, 4))
plt.plot(cost_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('cost', fontsize = 15)
plt.show()

 optimal w = 3.979385

6. Machine Learning with TensorFlow¶

6.1. Linear Regression¶

$$\hat{y} = \omega x + b$$

Given $x$ and $y$
Want to estimate $\omega$ and $b$

Data generation

# data points in column vector [input, output]
train_x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
train_y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)

m = train_x.shape[0]

plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y, 'ko')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

Given $(x_i, y_i)$ for $i=1,\cdots, m$

$$ \hat{y}_{i} = \omega x_{i} + b \; \quad \text{ such that }\quad \min\limits_{\omega, b}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2$$

LR = 0.001
n_iter = 1000

w = tf.Variable([[0]], dtype = tf.float32)
b = tf.Variable([[0]], dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        cost = tf.reduce_mean(tf.square(w*train_x + b - train_y))
        w_grad, b_grad = tape.gradient(cost, [w,b])

    loss_record.append(cost)
    w.assign_sub(LR * w_grad)
    b.assign_sub(LR * b_grad)

w_val = w.numpy()
b_val = b.numpy()
print("\n optimal w =", w_val)
print("\n optimal b =", b_val)

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()

 optimal w = [[0.74257565]]

 optimal b = [[0.41717836]]

xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = w_val*xp + b_val

plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y, 'ko')
plt.plot(xp, yp, 'r')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

6.2. Logistic Regression¶

$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}, \qquad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots & \vdots \\\end{bmatrix}, \quad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

# datat generation

m = 1000
true_w = np.array([[-6], [2], [1]])
train_X = np.hstack([np.ones([m,1]), 5*np.random.rand(m,1), 4*np.random.rand(m,1)])

true_w = np.asmatrix(true_w)
train_X = np.asmatrix(train_X)

train_y = 1/(1 + np.exp(-train_X*true_w)) > 0.5

C1 = np.where(train_y == True)[0]
C0 = np.where(train_y == False)[0]

train_y = np.empty([m,1])
train_y[C1] = 1
train_y[C0] = 0

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()

$$ \begin{align*} \ell(\omega) = \log \mathscr{L}(\omega) &= \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right)\\ &\Rightarrow \frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right) \end{align*} $$

LR = 0.05
n_iter = 1500

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_y, dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.sigmoid(tf.matmul(train_x, w))
        loss = - train_y*tf.math.log(y_pred) - (1-train_y)*tf.math.log(1-y_pred)
        loss = tf.reduce_mean(loss)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

w_hat = w.numpy()
print(w_hat)

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()

[[-4.096146  ]
 [ 1.7331805 ]
 [ 0.47786987]]

xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()

TensorFlow embedded functions

tf.nn.sigmoid_cross_entropy_with_logits for binary classification
tf.nn.softmax_cross_entropy_with_logits for multiclass classification

LR = 0.05
n_iter = 1500

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_y, dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.matmul(train_x,w)
        loss = tf.nn.sigmoid_cross_entropy_with_logits(labels = train_y, logits = y_pred)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

w_hat = w.numpy()
print(w_hat)

xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()

[[-132.96732 ]
 [  44.290165]
 [  22.316422]]

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')