Machine Learning


By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KASIT

Table of Contents

1. Linear Regression

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8', width = "560", height = "315")
Out[ ]:

Consider a linear regression.


  • $\text{Given} \; \begin{cases} x_{i} \; \text{: inputs} \\ y_{i} \; \text{: outputs} \end{cases}$ , Find $\theta_{0}$ and $\theta_{1}$

$$x= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}, \qquad y= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix} \approx \hat{y}_{i} = \theta_{0} + \theta_{1}x_{i}$$


  • $ \hat{y}_{i} $ : predicted output

  • $ \theta = \begin{bmatrix} \theta_{0} \\ \theta_{1} \\ \end{bmatrix} $ : Model parameters

$$ \hat{y}_{i} = f(x_{i}\,; \theta) \; \text{ in general}$$


  • in many cases, a linear model is used to predict $y_{i}$

$$ \hat{y}_{i} = \theta_{0} + \theta_{1}x_{i} \; \quad \text{ such that }\quad \min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2$$




1.1. Re-cast Problem as a Least Squares

  • For convenience, we define a function that maps inputs to feature vectors, $\phi$

$$\begin{array}{Icr}\begin{align*} \hat{y}_{i} & = \theta_0 + x_i \theta_1 = 1 \cdot \theta_0 + x_i \theta_1 \\ \\ & = \begin{bmatrix}1 & x_{i}\end{bmatrix}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\begin{bmatrix}1 \\ x_{i} \end{bmatrix}^{T}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\phi^{T}(x_{i})\theta \end{align*}\end{array} \begin{array}{Icr} \quad \quad \text{feature vector} \; \phi(x_{i}) = \begin{bmatrix}1 \\ x_{i}\end{bmatrix} \end{array}$$


$$\Phi = \begin{bmatrix}1 & x_{1} \\ 1 & x_{2} \\ \vdots \\1 & x_{m} \end{bmatrix}=\begin{bmatrix}\phi^T(x_{1}) \\\phi^T(x_{2}) \\\vdots \\\phi^T(x_{m}) \end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}_{1} \\\hat{y}_{2} \\\vdots \\\hat{y}_{m}\end{bmatrix}=\Phi\theta$$


  • Optimization problem

$$\min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2 =\min\limits_{\theta}\lVert\Phi\theta-y\rVert^2_2 \qquad \qquad \left(\text{same as} \; \min_{x} \lVert Ax-b \rVert_2^2 \right)$$


$$ \text{solution} \; \theta^* = (\Phi^{T}\Phi)^{-1}\Phi^{T} y $$

1.2. Solve using Linear Algebra

  • known as least square

$$ \theta = (A^TA)^{-1}A^T y $$

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
# data points in column vector [input, output]
x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()
No description has been provided for this image
In [ ]:
m = y.shape[0]
#A = np.hstack([np.ones([m, 1]), x])
A = np.hstack([x**0, x])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y

print('theta:\n', theta)
theta:
 [[0.65306531]
 [0.67129519]]
In [ ]:
# to plot
plt.figure(figsize = (6, 4))
plt.title('Regression', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.plot(x, y, 'ko', label = "data")

# to plot a straight line (fitted line)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp

plt.plot(xp, yp, 'r', linewidth = 2, label = "regression")
plt.legend(fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()
No description has been provided for this image

1.3 Scikit-Learn




  • Machine Learning in Python

  • Simple and efficient tools for data mining and data analysis

  • Accessible to everybody, and reusable in various contexts

  • Built on NumPy, SciPy, and matplotlib

  • Open source, commercially usable - BSD license

  • https://scikit-learn.org/stable/index.html#



In [ ]:
from sklearn import linear_model
In [ ]:
reg = linear_model.LinearRegression()
reg.fit(x, y)
Out[ ]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:
reg.coef_
Out[ ]:
array([[0.67129519]])
In [ ]:
reg.intercept_
Out[ ]:
array([0.65306531])
In [ ]:
# to plot
plt.figure(figsize = (6, 4))
plt.title('Regression', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.plot(x, y, 'ko', label = "data")

# to plot a straight line (fitted line)
plt.plot(xp, reg.predict(xp), 'r', linewidth = 2, label = "regression")
plt.legend(fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()
No description has been provided for this image

2. Classification: Perceptron

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8?si=_hkRnh2jEhJVDXsY&start=931', width = "560", height = "315")
Out[ ]:

2.1. Classification

  • where $y$ is a discrete value

    • develop the classification algorithm to determine which class a new input should fall into
  • start with binary class problems

    • Later look at multiclass classification problem, although this is just an extension of binary classification
  • We could use linear regression

    • Then, threshold the classifier output (i.e. anything over some value is yes, else no)
    • linear regression with thresholding seems to work
  • We will learn

    • perceptron
    • logistic regression

2.2. Perceptron

  • For input $x = \begin{bmatrix}x_1\\ \vdots\\ x_d \end{bmatrix}\;$ 'attributes of a customer'

  • weights $\omega = \begin{bmatrix}\omega_1\\ \vdots\\ \omega_d \end{bmatrix}$


$$\begin{align*} \text{Approve credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i > \text{threshold}, \\ \text{Deny credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i < \text{threshold}. \end{align*}$$


$$h(x) = \text{sign} \left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)- \text{threshold} \right) = \text{sign}\left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)+ \omega_0\right)$$



  • Introduce an artificial coordinate $x_0 = 1$:

$$h(x) = \text{sign}\left( \sum\limits_{i=0}^{d}\omega_ix_i \right)$$


  • In a vector form, the perceptron implements

$$h(x) = \text{sign}\left( \omega^T x \right)$$


  • sign function

$$ \text{sgn}(x) = \begin{cases} 1, &\text{if }\; x > 0\\ 0, &\text{if }\; x = 0\\ -1, &\text{if }\; x < 0 \end{cases} $$




  • Hyperplane

    • Separates a D-dimensional space into two half-spaces
    • Defined by an outward pointing normal vector $\omega$
    • $\omega$ is orthogonal to any vector lying on the hyperplane
    • Assume the hyperplane passes through origin, $\omega^T x = 0$ with $x_0 = 1$




  • Sign with respect to a line

$$ \begin{align*} \omega = \begin{bmatrix}\omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega_0 + \omega^T x\\ \omega = \begin{bmatrix}\omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega^T x \end{align*} $$




  • Goal: to learn the hyperplane $g_{\omega}(x)=0$ using the training data

  • How to find $\omega$

    • All data in class 1 $$g(x) > 0$$
    • All data in class 0 $$g(x) < 0$$

2.2.1. Perceptron Algorithm

The perceptron implements


$$h(x) = \text{sign}\left( \omega^Tx \right)$$


Given the training set


$$(x_1, y_1), (x_2, y_2), \cdots, (x_N, y_N) \quad \text{where } y_i \in \{-1,1\}$$


  1. pick a misclassified point

$$ \text{sign}\left(\omega^Tx_n \right) \neq y_n$$


  1. and update the weight vector

$$\omega \leftarrow \omega + y_nx_n$$





Why perceptron updates work ?

  • Let's look at a misclassified positive example ($y_n = +1$)

    • perceptron (wrongly) thinks $\omega_{old}^T x_n < 0$
  • updates would be


$$ \begin{align*}\omega_{new} &= \omega_{old} + y_n x_n = \omega_{old} + x_n \\ \\ \omega_{new}^T x_n &= (\omega_{old} + x_n)^T x_n = \omega_{old}^T x_n + x_n^T x_n \end{align*}$$


  • Thus $\omega_{new}^T x_n$ is less negative than $\omega_{old}^T x_n$

2.2.2. Iterations of Perceptron

  1. Randomly assign $\omega$

  2. One iteration of the PLA (perceptron learning algorithm) $$\omega \leftarrow \omega + yx$$ where $(x, y)$ is a misclassified training point

  3. At iteration $t = 1, 2, 3, \cdots,$ pick a misclassified point from $$(x_1,y_1),(x_2,y_2),\cdots,(x_N, y_N)$$

  4. and run a PLA iteration on it

  5. That's it!



Summary





2.2.3. Perceptron loss function


$$ \mathscr{L}(\omega) = \sum_{n =1}^{m} \max \left\{ 0, -y_n \cdot \left(\omega^T x_n \right)\right\} $$


  • $\text{Loss} = 0$ on examples where perceptron is correct, i.e., $y_n \cdot \left(\omega^T x_n \right) > 0$

  • $\text{Loss} > 0$ on examples where perceptron misclassified, i.e., $y_n \cdot \left(\omega^T x_n \right) < 0$

Note: $\text{sign}\left(\omega^T x_n \right) \neq y_n$ is equivalent to $ y_n \cdot \left(\omega^T x_n \right) < 0$


2.3. Perceptron in Python


$$g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0$$



$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}\\ \\ x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\ \\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
#training data gerneration
m = 100
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3
In [ ]:
C1 = np.where(g >= 1)
C0 = np.where(g < -1)
print(C1)
(array([ 3,  4, 10, 11, 13, 16, 18, 21, 23, 27, 28, 33, 34, 37, 39, 40, 45,
       61, 64, 67, 69, 74, 77, 82, 83, 85, 87, 90, 96]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0]))
In [ ]:
C1 = np.where(g >= 1)[0]
C0 = np.where(g < -1)[0]
print(C1.shape)
print(C0.shape)
(29,)
(40,)
In [ ]:
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.title('Linearly Separable Classes', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()
No description has been provided for this image

$$ \begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

In [ ]:
X1 = np.hstack([np.ones([C1.shape[0],1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([C0.shape[0],1]), x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])

X = np.asmatrix(X)
y = np.asmatrix(y)

$$\omega = \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}$$


$$\omega \leftarrow \omega + yx$$ where $(x, y)$ is a misclassified training point
In [ ]:
w = np.ones([3,1])
w = np.asmatrix(w)

n_iter = y.shape[0]
flag = 0

while flag == 0:
    flag = 1
    for i in range(n_iter):
        if y[i,0] != np.sign(X[i,:]*w)[0,0]:
            w += y[i,0]*X[i,:].T
            flag = 0

print(w)
[[-14.        ]
 [  4.77848286]
 [  8.61057437]]

$$ \begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*} $$

In [ ]:
x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w[1,0]/w[2,0]*x1p - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 3, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.show()
No description has been provided for this image

2.4. Perceptron using Scikit-Learn



$$ \begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)}\\ \vdots & \vdots \\ x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\\\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

In [ ]:
X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])
In [ ]:
from sklearn import linear_model

clf = linear_model.Perceptron(tol = 1e-3)
clf.fit(X, np.ravel(y))
Out[ ]:
Perceptron()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:
clf.predict([[3, -2]])
Out[ ]:
array([-1.])
In [ ]:
clf.predict([[6, 2]])
Out[ ]:
array([1.])
In [ ]:
clf.coef_
Out[ ]:
array([[4.30276835, 7.18194009]])
In [ ]:
clf.intercept_
Out[ ]:
array([-12.])

$$ \begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*} $$

In [ ]:
w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]
In [ ]:
x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w1/w2*x1p - w0/w2

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 4, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.show()
No description has been provided for this image

2.5. The best hyperplane separator?

  • Perceptron finds one of the many possible hyperplanes separating the data if one exists

  • Of the many possible choices, which one is the best?

  • Utilize distance information from all data samples

    • We will see this formally when we discuss the logistic regression

3. Classification: Logistic Regression

  • Logistic regression is a classification algorithm - don't be confused
In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('IRrGVQV8vZ8?si=8H3cmkAUyNIu2NDb&amp;start=2750', width = "560", height = "315")
Out[ ]:

3.1. Using all Distances

  • Perceptron: make use of sign of data

  • We want to use distance information of all data points $\rightarrow$ logistic regression





  • basic idea: to find the decision boundary (hyperplane) of $g(x)=\omega^T x =0$ such that maximizes $\prod_i \lvert h_i \rvert$
    • Inequality of arithmetic and geometric means
      $$ \frac{h_1+h_2}{2} \geq \sqrt{h_1 h_2} $$ and that equality holds if and only if $h_1 = h_2$

  • Roughly speaking, this optimization of $\max \prod_i \lvert h_i \rvert$ tends to position a hyperplane in the middle of two classes

$$h = \frac{g(x)}{\lVert \omega \rVert} = \frac{\omega^T x}{\lVert \omega \rVert} \sim \omega^T x$$

  • We link or squeeze $(-\infty, +\infty)$ to $(0,1)$ for several reasons:





  • If $\sigma(z)$ is the sigmoid function, or the logistic function

    $$ \sigma(z) = \frac{1}{1+e^{-z}} \implies \sigma \left(\omega^T x \right) = \frac{1}{1+e^{-\omega^T x}}$$

  • Logistic function always generates a value between 0 and 1

  • Crosses 0.5 at the origin, then flattens out

  • The derivative of the sigmoid function satisfies

    $$\sigma'(z) = \sigma(z)\left( 1 - \sigma(z)\right)$$

In [ ]:
# plot a sigmoid function

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

z = np.linspace(-4,4,100)
s = 1/(1 + np.exp(-z))

plt.figure(figsize = (8,2))
plt.plot(z, s)
plt.xlim([-4, 4])
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image
  • Benefit of mapping via the logistic function

  • monotonic: same or similar optimziation solution

  • continuous and differentiable: good for gradient descent optimization

  • probability or confidence: can be considered as probability

$$P\left(y = +1 \mid x\,;\omega\right) = \frac{1}{1+e^{-\omega^T x}} \;\; \in \; [0,1]$$

  • Often we do note care about predicting the label $y$

  • Rather, we want to predict the label probabilities $P\left(y \mid x\,;\omega\right)$

    • the probability that the label is $+1$ $$P\left(y = +1 \mid x\,;\omega\right)$$
    • the probability that the label is $0$ $$P\left(y = 0 \mid x\,;\omega\right) = 1 - P\left(y = +1 \mid x\,;\omega\right)$$
  • Goal: we need to fit $\omega$ to our data

For a single data point $(x,y)$ with parameters $\omega$


$$ \begin{align*} P\left(y = +1 \mid x\,;\omega\right) &= h_{\omega}(x) = \sigma \left(\omega^T x \right)\\ P\left(y = 0 \mid x\,;\omega\right) &= 1 - h_{\omega}(x) = 1- \sigma \left(\omega^T x \right) \end{align*} $$


It can be compactly written as


$$P\left(y \mid x\,;\omega\right) = \left(h_{\omega}(x) \right)^y \left(1 - h_{\omega}(x)\right)^{1-y}$$


For $m$ training data points, the likelihood function of the parameters:

$$ \begin{align*} \mathscr{L}(\omega) &= P\left(y^{(1)}, \cdots, y^{(m)} \mid x^{(1)}, \cdots, x^{(m)}\,;\omega\right)\\ &= \prod\limits_{i=1}^{m}P\left(y^{(i)} \mid x^{(i)}\,;\omega\right)\\ &= \prod\limits_{i=1}^{m}\left(h_{\omega}\left(x^{(i)}\right) \right)^{y^{(i)}} \left(1 - h_{\omega}\left(x^{(i)}\right)\right)^{1-y^{(i)}} \qquad \left(\sim \prod_i \lvert h_i \rvert \right) \end{align*} $$

It would be easier to work on the log likelihood.


$$\ell(\omega) = \log \mathscr{L}(\omega) = \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right)$$

The logistic regression problem can be solved as a (convex) optimization problem as


$$\hat{\omega} = \arg\max_{\omega} \ell(\omega)$$

3.2. Logistic Regression using Scikit-Learn


$$ \begin{align*} \omega &= \begin{bmatrix} \omega_1 \\ \omega_2\end{bmatrix}, \qquad \omega_0, \qquad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots \\\end{bmatrix}\\ \\ y & = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$
In [ ]:
X
Out[ ]:
array([[ 5.39876541e+00,  1.46063605e+00],
       [ 5.67066894e+00,  2.04340495e+00],
       [ 5.14798156e+00,  1.68338832e-01],
       [ 6.41703851e+00,  4.24331047e-01],
       [ 2.76484878e+00,  2.30685788e+00],
       [ 7.09085307e+00,  1.09244820e+00],
       [ 6.57799977e+00,  2.29612160e+00],
       [ 4.01448000e+00,  1.50781011e+00],
       [ 7.90904831e+00, -4.16032175e-01],
       [ 3.36716955e+00,  2.00811226e+00],
       [ 7.08619530e+00,  7.94156154e-01],
       [ 2.71662173e+00,  2.46155210e+00],
       [ 3.54552314e+00,  1.65905921e+00],
       [ 5.54257280e+00,  2.55494279e+00],
       [ 4.32730115e+00,  1.76266428e+00],
       [ 7.32879365e+00, -1.51653699e+00],
       [ 6.46251862e+00, -4.56948153e-01],
       [ 5.77981650e+00,  2.87100386e+00],
       [ 6.53889358e+00,  2.78474807e+00],
       [ 2.40034795e+00,  2.35773263e+00],
       [ 7.25373066e+00, -1.05602069e+00],
       [ 6.50148326e+00,  9.20715747e-01],
       [ 3.41296328e+00,  1.58627278e+00],
       [ 6.26514907e+00,  9.36900454e-01],
       [ 6.13208666e+00,  2.89444656e+00],
       [ 7.88035321e+00, -7.30331367e-01],
       [ 5.21773076e+00,  1.73906793e-01],
       [ 4.39537288e+00,  2.75800816e+00],
       [ 3.52440328e+00,  1.67392334e+00],
       [ 6.90982210e+00, -3.56760928e+00],
       [ 1.79271158e+00, -6.77349746e-01],
       [ 3.02063150e-03, -1.02477447e+00],
       [ 9.10365616e-02,  4.88502000e-01],
       [ 1.73735764e+00, -5.74834802e-01],
       [ 2.13942449e+00, -6.62268350e-01],
       [ 2.98194119e+00, -1.12744267e+00],
       [ 3.25245652e+00, -1.66472784e+00],
       [ 5.77676894e+00, -2.64132859e+00],
       [ 2.63470714e+00, -2.50145062e+00],
       [ 1.46918721e+00,  6.84560495e-01],
       [ 5.54537437e+00, -3.16274317e+00],
       [ 7.09041992e+00, -3.73411603e+00],
       [ 7.19773940e-01,  1.00944090e+00],
       [ 2.35863704e+00, -2.93060431e+00],
       [ 2.93901627e+00, -3.54943466e+00],
       [ 8.15227503e-01, -3.90692765e+00],
       [ 3.92908235e+00, -1.38055389e+00],
       [ 1.30386169e+00,  2.06540288e-01],
       [ 4.15588404e+00, -2.20170044e+00],
       [ 1.95973444e+00, -3.16122414e+00],
       [ 2.53659322e+00, -3.01127512e-01],
       [ 2.31573572e-01,  1.38496845e+00],
       [ 1.33536242e+00, -2.93688705e+00],
       [ 8.43232873e-02, -6.09709213e-02],
       [ 3.19360577e+00, -1.20601746e+00],
       [ 6.29141092e+00, -3.63708483e+00],
       [ 2.22703878e+00, -1.02304464e+00],
       [ 3.90666147e+00, -3.93133227e+00],
       [ 3.92305178e-01, -3.74731306e+00],
       [ 2.36327523e+00, -2.60689261e+00],
       [ 1.33450693e+00, -3.39210867e+00],
       [ 6.28688997e-01, -2.80856152e-01],
       [ 7.86650117e-01,  8.09147889e-01],
       [ 2.33985817e+00, -2.89769844e-01],
       [ 3.20649702e+00, -3.28685429e+00],
       [ 2.19846807e+00, -2.69772773e+00],
       [ 8.43095804e-01, -7.53622379e-01],
       [ 1.69204385e+00, -6.19324844e-01],
       [ 3.06372114e+00, -3.12419555e+00]])
In [ ]:
# datat generation

m = 100
w0 = -6
w = np.array([[2], [1]])
X = np.hstack([4*np.random.rand(m,1), 4*np.random.rand(m,1)])

w = np.asmatrix(w)
X = np.asmatrix(X)

y = 1/(1 + np.exp(-w0-X*w)) > 0.5

C1 = np.where(y == True)[0]
C0 = np.where(y == False)[0]

y = np.empty([m,1])
y[C1] = 1
y[C0] = 0

plt.figure(figsize = (6, 4))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
No description has been provided for this image
In [ ]:
from sklearn import linear_model

clf = linear_model.LogisticRegression(solver = 'lbfgs')
clf.fit(np.asarray(X), np.ravel(y))
Out[ ]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:
clf.coef_
Out[ ]:
array([[3.21453711, 1.29943009]])
In [ ]:
clf.intercept_
Out[ ]:
array([-9.18960726])
In [ ]:
w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w1/w2*xp - w0/w2

plt.figure(figsize = (6, 4))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
No description has been provided for this image

4. Deep Learning Libraries

Tensorflow

Keras

5. TensorFlow

  • TensorFlow is an open-source software library for deep learning.

It’s a framework to perform computation very efficiently, and it can tap into the GPU (Graphics Processor Unit) in order to speed it up even further. This will make a huge effect as we shall see shortly. TensorFlow can be controlled by a simple Python API.

Tensorflow is one of the widely used libraries for implementing machine learning and other algorithms involving large number of mathematical operations. Tensorflow was developed by Google and it’s one of the most popular Machine Learning libraries on GitHub. Google uses Tensorflow for implementing Machine learning in almost all applications.

Tensor

TensorFlow gets its name from tensors, which are arrays of arbitrary dimensionality. A vector is a 1-d array and is known as a 1st-order tensor. A matrix is a 2-d array and a 2nd-order tensor. The "flow" part of the name refers to computation flowing through a graph. Training and inference in a neural network, for example, involves the propagation of matrix computations through many nodes in a computational graph.




5.1. TensorFlow with Gradient Tape

With eager execution enabled, Tensorflow will calculate the values of tensors as they occur in your code. This means that it won't precompute a static graph for which inputs are fed in through placeholders. This means to back propagate errors, you have to keep track of the gradients of your computation and then apply these gradients to an optimiser.

This is very different from running without eager execution, where you would build a graph and then simply use sess.run to evaluate your loss and then pass this into an optimiser directly.

Fundamentally, because tensors are evaluated immediately, you don't have a graph to calculate gradients and so you need a gradient tape. It is not so much that it is just used for visualisation, but more that you cannot implement a gradient descent in eager mode without it.

Obviously, Tensorflow could just keep track of every gradient for every computation on every tf.Variable. However, that could be a huge performance bottleneck. They expose a gradient tape so that you can control what areas of your code need the gradient information. Note that in non-eager mode, this will be statically determined based on the computational branches that are descendants of your loss but in eager mode there is no static graph and so no way of knowing.



5.2. TensorFlow as Optimization Solver



$$\min_{\omega}\;\;(\omega - 4)^2$$

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline
In [ ]:
w = tf.Variable(0, dtype = tf.float32)

LR = 0.05

# Training
cost_record = []
for i in range(50):
    with tf.GradientTape() as tape:
        cost = w*w - 8*w + 16
        w_grad = tape.gradient(cost, w)
    cost_record.append(cost)
    w.assign_sub(LR * w_grad)

print("\n optimal w =", w.numpy())

plt.figure(figsize = (6, 4))
plt.plot(cost_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('cost', fontsize = 15)
plt.show()
 optimal w = 3.979385
No description has been provided for this image

6. Machine Learning with TensorFlow

6.1. Linear Regression

$$\hat{y} = \omega x + b$$

  • Given $x$ and $y$
  • Want to estimate $\omega$ and $b$

Data generation

In [ ]:
# data points in column vector [input, output]
train_x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
train_y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)

m = train_x.shape[0]

plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y, 'ko')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()
No description has been provided for this image
  • Given $(x_i, y_i)$ for $i=1,\cdots, m$

$$ \hat{y}_{i} = \omega x_{i} + b \; \quad \text{ such that }\quad \min\limits_{\omega, b}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2$$

In [ ]:
LR = 0.001
n_iter = 1000

w = tf.Variable([[0]], dtype = tf.float32)
b = tf.Variable([[0]], dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        cost = tf.reduce_mean(tf.square(w*train_x + b - train_y))
        w_grad, b_grad = tape.gradient(cost, [w,b])

    loss_record.append(cost)
    w.assign_sub(LR * w_grad)
    b.assign_sub(LR * b_grad)

w_val = w.numpy()
b_val = b.numpy()
print("\n optimal w =", w_val)
print("\n optimal b =", b_val)

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
 optimal w = [[0.74257565]]

 optimal b = [[0.41717836]]
No description has been provided for this image
In [ ]:
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = w_val*xp + b_val

plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y, 'ko')
plt.plot(xp, yp, 'r')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()
No description has been provided for this image

6.2. Logistic Regression


$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}, \qquad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots & \vdots \\\end{bmatrix}, \quad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$
In [ ]:
# datat generation

m = 1000
true_w = np.array([[-6], [2], [1]])
train_X = np.hstack([np.ones([m,1]), 5*np.random.rand(m,1), 4*np.random.rand(m,1)])

true_w = np.asmatrix(true_w)
train_X = np.asmatrix(train_X)

train_y = 1/(1 + np.exp(-train_X*true_w)) > 0.5

C1 = np.where(train_y == True)[0]
C0 = np.where(train_y == False)[0]

train_y = np.empty([m,1])
train_y[C1] = 1
train_y[C0] = 0

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
No description has been provided for this image

$$ \begin{align*} \ell(\omega) = \log \mathscr{L}(\omega) &= \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right)\\ &\Rightarrow \frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right) \end{align*} $$

In [ ]:
LR = 0.05
n_iter = 1500

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_y, dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.sigmoid(tf.matmul(train_x, w))
        loss = - train_y*tf.math.log(y_pred) - (1-train_y)*tf.math.log(1-y_pred)
        loss = tf.reduce_mean(loss)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

w_hat = w.numpy()
print(w_hat)

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
[[-4.096146  ]
 [ 1.7331805 ]
 [ 0.47786987]]
No description has been provided for this image
In [ ]:
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
No description has been provided for this image

TensorFlow embedded functions

In [ ]:
LR = 0.05
n_iter = 1500

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_y, dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.matmul(train_x,w)
        loss = tf.nn.sigmoid_cross_entropy_with_logits(labels = train_y, logits = y_pred)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

w_hat = w.numpy()
print(w_hat)

xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
[[-132.96732 ]
 [  44.290165]
 [  22.316422]]
No description has been provided for this image
In [ ]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')