Machine Learning


By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KASIT

Table of Contents

1. Linear RegressionĀ¶

Consider a linear regression.


  • $\text{Given} \; \begin{cases} x_{i} \; \text{: inputs} \\ y_{i} \; \text{: outputs} \end{cases}$ , Find $\theta_{0}$ and $\theta_{1}$


$$x= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{bmatrix}, \qquad y= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix} \approx \hat{y}_{i} = \theta_{0} + \theta_{1}x_{i}$$


  • $ \hat{y}_{i} $ : predicted output


  • $ \theta = \begin{bmatrix} \theta_{0} \\ \theta_{1} \\ \end{bmatrix} $ : Model parameters


$$ \hat{y}_{i} = f(x_{i}\,; \theta) \; \text{ in general}$$


  • in many cases, a linear model is used to predict $y_{i}$


$$ \hat{y}_{i} = \theta_{0} + \theta_{1}x_{i} \; \quad \text{ such that }\quad \min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2$$




1.1. Re-cast Problem as a Least SquaresĀ¶

  • For convenience, we define a function that maps inputs to feature vectors, $\phi$


$$\begin{array}{Icr}\begin{align*} \hat{y}_{i} & = \theta_0 + x_i \theta_1 = 1 \cdot \theta_0 + x_i \theta_1 \\ \\ & = \begin{bmatrix}1 & x_{i}\end{bmatrix}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\begin{bmatrix}1 \\ x_{i} \end{bmatrix}^{T}\begin{bmatrix}\theta_{0} \\ \theta_{1}\end{bmatrix} \\\\ & =\phi^{T}(x_{i})\theta \end{align*}\end{array} \begin{array}{Icr} \quad \quad \text{feature vector} \; \phi(x_{i}) = \begin{bmatrix}1 \\ x_{i}\end{bmatrix} \end{array}$$


$$\Phi = \begin{bmatrix}1 & x_{1} \\ 1 & x_{2} \\ \vdots \\1 & x_{m} \end{bmatrix}=\begin{bmatrix}\phi^T(x_{1}) \\\phi^T(x_{2}) \\\vdots \\\phi^T(x_{m}) \end{bmatrix} \quad \implies \quad \hat{y} = \begin{bmatrix}\hat{y}_{1} \\\hat{y}_{2} \\\vdots \\\hat{y}_{m}\end{bmatrix}=\Phi\theta$$


  • Optimization problem


$$\min\limits_{\theta_{0}, \theta_{1}}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2 =\min\limits_{\theta}\lVert\Phi\theta-y\rVert^2_2 \qquad \qquad \left(\text{same as} \; \min_{x} \lVert Ax-b \rVert_2^2 \right)$$


$$ \text{solution} \; \theta^* = (\Phi^{T}\Phi)^{-1}\Phi^{T} y $$

1.2. Solve using Linear AlgebraĀ¶

  • known as least square


$$ \theta = (A^TA)^{-1}A^T y $$
InĀ [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
InĀ [2]:
# data points in column vector [input, output]
x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)

plt.figure(figsize = (6, 4))
plt.plot(x, y, 'ko')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()
InĀ [3]:
m = y.shape[0]
#A = np.hstack([np.ones([m, 1]), x])
A = np.hstack([x**0, x])
A = np.asmatrix(A)

theta = (A.T*A).I*A.T*y

print('theta:\n', theta)
theta:
 [[0.65306531]
 [0.67129519]]
InĀ [4]:
# to plot
plt.figure(figsize = (6, 4))
plt.title('Regression', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.plot(x, y, 'ko', label = "data")

# to plot a straight line (fitted line)
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = theta[0,0] + theta[1,0]*xp

plt.plot(xp, yp, 'r', linewidth = 2, label = "regression")
plt.legend(fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

1.3 Scikit-LearnĀ¶




  • Machine Learning in Python
  • Simple and efficient tools for data mining and data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib
  • Open source, commercially usable - BSD license



InĀ [5]:
from sklearn import linear_model
InĀ [6]:
reg = linear_model.LinearRegression()
reg.fit(x, y)
Out[6]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
InĀ [7]:
reg.coef_
Out[7]:
array([[0.67129519]])
InĀ [8]:
reg.intercept_
Out[8]:
array([0.65306531])
InĀ [9]:
# to plot
plt.figure(figsize = (6, 4))
plt.title('Regression', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.plot(x, y, 'ko', label = "data")

# to plot a straight line (fitted line)
plt.plot(xp, reg.predict(xp), 'r', linewidth = 2, label = "regression")
plt.legend(fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

2. Classification: PerceptronĀ¶

2.1. ClassificationĀ¶

  • where $y$ is a discrete value
    • develop the classification algorithm to determine which class a new input should fall into
  • start with binary class problems
    • Later look at multiclass classification problem, although this is just an extension of binary classification
  • We could use linear regression
    • Then, threshold the classifier output (i.e. anything over some value is yes, else no)
    • linear regression with thresholding seems to work
  • We will learn
    • perceptron
    • logistic regression

2.2. PerceptronĀ¶

  • For input $x = \begin{bmatrix}x_1\\ \vdots\\ x_d \end{bmatrix}\;$ 'attributes of a customer'
  • weights $\omega = \begin{bmatrix}\omega_1\\ \vdots\\ \omega_d \end{bmatrix}$


$$\begin{align*} \text{Approve credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i > \text{threshold}, \\ \text{Deny credit if} \; & \sum\limits_{i=1}^{d}\omega_ix_i < \text{threshold}. \end{align*}$$


$$h(x) = \text{sign} \left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)- \text{threshold} \right) = \text{sign}\left(\left( \sum\limits_{i=1}^{d}\omega_ix_i \right)+ \omega_0\right)$$



  • Introduce an artificial coordinate $x_0 = 1$:


$$h(x) = \text{sign}\left( \sum\limits_{i=0}^{d}\omega_ix_i \right)$$


  • In a vector form, the perceptron implements


$$h(x) = \text{sign}\left( \omega^T x \right)$$


  • sign function


$$ \text{sgn}(x) = \begin{cases} 1, &\text{if }\; x > 0\\ 0, &\text{if }\; x = 0\\ -1, &\text{if }\; x < 0 \end{cases} $$




  • Hyperplane

    • Separates a D-dimensional space into two half-spaces
    • Defined by an outward pointing normal vector $\omega$
    • $\omega$ is orthogonal to any vector lying on the hyperplane
    • Assume the hyperplane passes through origin, $\omega^T x = 0$ with $x_0 = 1$




  • Sign with respect to a line


$$ \begin{align*} \omega = \begin{bmatrix}\omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega_0 + \omega^T x\\ \omega = \begin{bmatrix}\omega_0 \\ \omega_1 \\ \omega_2 \end{bmatrix}, \quad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix} &\implies g(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 = \omega^T x \end{align*} $$




  • Goal: to learn the hyperplane $g_{\omega}(x)=0$ using the training data
  • How to find $\omega$

    • All data in class 1 $$g(x) > 0$$
    • All data in class 0 $$g(x) < 0$$

2.2.1. Perceptron AlgorithmĀ¶

The perceptron implements


$$h(x) = \text{sign}\left( \omega^Tx \right)$$


Given the training set


$$(x_1, y_1), (x_2, y_2), \cdots, (x_N, y_N) \quad \text{where } y_i \in \{-1,1\}$$


1) pick a misclassified point


$$ \text{sign}\left(\omega^Tx_n \right) \neq y_n$$


2) and update the weight vector


$$\omega \leftarrow \omega + y_nx_n$$





Why perceptron updates work ?

  • Let's look at a misclassified positive example ($y_n = +1$)

    • perceptron (wrongly) thinks $\omega_{old}^T x_n < 0$
  • updates would be


$$ \begin{align*}\omega_{new} &= \omega_{old} + y_n x_n = \omega_{old} + x_n \\ \\ \omega_{new}^T x_n &= (\omega_{old} + x_n)^T x_n = \omega_{old}^T x_n + x_n^T x_n \end{align*}$$


  • Thus $\omega_{new}^T x_n$ is less negative than $\omega_{old}^T x_n$

2.2.2. Iterations of PerceptronĀ¶

  1. Randomly assign $\omega$

  2. One iteration of the PLA (perceptron learning algorithm) $$\omega \leftarrow \omega + yx$$ where $(x, y)$ is a misclassified training point

  3. At iteration $t = 1, 2, 3, \cdots,$ pick a misclassified point from $$(x_1,y_1),(x_2,y_2),\cdots,(x_N, y_N)$$

  4. and run a PLA iteration on it

  5. That's it!




Summary





2.2.3. Perceptron loss functionĀ¶


$$ \mathscr{L}(\omega) = \sum_{n =1}^{m} \max \left\{ 0, -y_n \cdot \left(\omega^T x_n \right)\right\} $$


  • $\text{Loss} = 0$ on examples where perceptron is correct, i.e., $y_n \cdot \left(\omega^T x_n \right) > 0$


  • $\text{Loss} > 0$ on examples where perceptron misclassified, i.e., $y_n \cdot \left(\omega^T x_n \right) < 0$


Note: $\text{sign}\left(\omega^T x_n \right) \neq y_n$ is equivalent to $ y_n \cdot \left(\omega^T x_n \right) < 0$


2.3. Perceptron in PythonĀ¶


$$g(x) = \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0$$



$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}\\ \\ x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\ \\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$
InĀ [10]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
InĀ [11]:
#training data gerneration
m = 100
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3
InĀ [12]:
C1 = np.where(g >= 1)
C0 = np.where(g < -1)
print(C1)
(array([ 3,  4, 10, 11, 13, 16, 18, 21, 23, 27, 28, 33, 34, 37, 39, 40, 45,
       61, 64, 67, 69, 74, 77, 82, 83, 85, 87, 90, 96]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0]))
InĀ [13]:
C1 = np.where(g >= 1)[0]
C0 = np.where(g < -1)[0]
print(C1.shape)
print(C0.shape)
(29,)
(40,)
InĀ [14]:
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.title('Linearly Separable Classes', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()
$$ \begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)}\\\vdots & \vdots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$
InĀ [15]:
X1 = np.hstack([np.ones([C1.shape[0],1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([C0.shape[0],1]), x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])

X = np.asmatrix(X)
y = np.asmatrix(y)
$$\omega = \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}$$


$$\omega \leftarrow \omega + yx$$ where $(x, y)$ is a misclassified training point

InĀ [16]:
w = np.ones([3,1])
w = np.asmatrix(w)

n_iter = y.shape[0]
flag = 0

while flag == 0:
    flag = 1
    for i in range(n_iter):
        if y[i,0] != np.sign(X[i,:]*w)[0,0]:
            w += y[i,0]*X[i,:].T
            flag = 0

print(w)
[[-14.        ]
 [  4.77848286]
 [  8.61057437]]
$$ \begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*} $$
InĀ [17]:
x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w[1,0]/w[2,0]*x1p - w[0,0]/w[2,0]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 3, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.show()

2.4. Perceptron using Scikit-LearnĀ¶



$$ \begin{align*} x &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T\\ \vdots \\ \left(x^{(m)}\right)^T \end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)}\\ \vdots & \vdots \\ x_1^{(m)} & x_2^{(m)}\end{bmatrix} \\\\ y &= \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)}\\ \vdots \\ y^{(m)} \end{bmatrix} \end{align*}$$

InĀ [18]:
X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
X = np.vstack([X1, X0])

y = np.vstack([np.ones([C1.shape[0],1]), -np.ones([C0.shape[0],1])])
InĀ [19]:
from sklearn import linear_model

clf = linear_model.Perceptron(tol = 1e-3)
clf.fit(X, np.ravel(y))
Out[19]:
Perceptron()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
InĀ [20]:
clf.predict([[3, -2]])
Out[20]:
array([-1.])
InĀ [21]:
clf.predict([[6, 2]])
Out[21]:
array([1.])
InĀ [22]:
clf.coef_
Out[22]:
array([[4.30276835, 7.18194009]])
InĀ [23]:
clf.intercept_
Out[23]:
array([-12.])
$$ \begin{align*} g(x) &= \omega_0 + \omega^Tx = \omega_0 + \omega_1x_1 + \omega_2x_2 = 0 \\\\ \implies x_2 &= -\frac{\omega_1}{\omega_2} x_1 - \frac{\omega_0}{\omega_2} \end{align*} $$
InĀ [24]:
w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]
InĀ [25]:
x1p = np.linspace(0,8,100).reshape(-1,1)
x2p = - w1/w2*x1p - w0/w2

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, c = 'k', linewidth = 4, label = 'perceptron')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.show()

2.5. The best hyperplane separator?Ā¶

  • Perceptron finds one of the many possible hyperplanes separating the data if one exists
  • Of the many possible choices, which one is the best?
  • Utilize distance information from all data samples
    • We will see this formally when we discuss the logistic regression

3. Classification: Logistic RegressionĀ¶

  • Logistic regression is a classification algorithm - don't be confused

3.1. Using all DistancesĀ¶

  • Perceptron: make use of sign of data

  • We want to use distance information of all data points $\rightarrow$ logistic regression





  • basic idea: to find the decision boundary (hyperplane) of $g(x)=\omega^T x =0$ such that maximizes $\prod_i \lvert h_i \rvert$
    • Inequality of arithmetic and geometric means
      $$ \frac{h_1+h_2}{2} \geq \sqrt{h_1 h_2} $$ and that equality holds if and only if $h_1 = h_2$


  • Roughly speaking, this optimization of $\max \prod_i \lvert h_i \rvert$ tends to position a hyperplane in the middle of two classes


$$h = \frac{g(x)}{\lVert \omega \rVert} = \frac{\omega^T x}{\lVert \omega \rVert} \sim \omega^T x$$
  • We link or squeeze $(-\infty, +\infty)$ to $(0,1)$ for several reasons:





  • If $\sigma(z)$ is the sigmoid function, or the logistic function

    $$ \sigma(z) = \frac{1}{1+e^{-z}} \implies \sigma \left(\omega^T x \right) = \frac{1}{1+e^{-\omega^T x}}$$

    • Logistic function always generates a value between 0 and 1
    • Crosses 0.5 at the origin, then flattens out
    • The derivative of the sigmoid function satisfies

      $$\sigma'(z) = \sigma(z)\left( 1 - \sigma(z)\right)$$

InĀ [26]:
# plot a sigmoid function

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

z = np.linspace(-4,4,100)
s = 1/(1 + np.exp(-z))

plt.figure(figsize = (8,2))
plt.plot(z, s)
plt.xlim([-4, 4])
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.show()
  • Benefit of mapping via the logistic function

    • monotonic: same or similar optimziation solution
    • continuous and differentiable: good for gradient descent optimization
    • probability or confidence: can be considered as probability

    $$P\left(y = +1 \mid x\,;\omega\right) = \frac{1}{1+e^{-\omega^T x}} \;\; \in \; [0,1]$$

    • Often we do note care about predicting the label $y$

    • Rather, we want to predict the label probabilities $P\left(y \mid x\,;\omega\right)$

      • the probability that the label is $+1$ $$P\left(y = +1 \mid x\,;\omega\right)$$
      • the probability that the label is $0$ $$P\left(y = 0 \mid x\,;\omega\right) = 1 - P\left(y = +1 \mid x\,;\omega\right)$$
  • Goal: we need to fit $\omega$ to our data

For a single data point $(x,y)$ with parameters $\omega$


$$ \begin{align*} P\left(y = +1 \mid x\,;\omega\right) &= h_{\omega}(x) = \sigma \left(\omega^T x \right)\\ P\left(y = 0 \mid x\,;\omega\right) &= 1 - h_{\omega}(x) = 1- \sigma \left(\omega^T x \right) \end{align*} $$


It can be compactly written as


$$P\left(y \mid x\,;\omega\right) = \left(h_{\omega}(x) \right)^y \left(1 - h_{\omega}(x)\right)^{1-y}$$


For $m$ training data points, the likelihood function of the parameters:

$$ \begin{align*} \mathscr{L}(\omega) &= P\left(y^{(1)}, \cdots, y^{(m)} \mid x^{(1)}, \cdots, x^{(m)}\,;\omega\right)\\ &= \prod\limits_{i=1}^{m}P\left(y^{(i)} \mid x^{(i)}\,;\omega\right)\\ &= \prod\limits_{i=1}^{m}\left(h_{\omega}\left(x^{(i)}\right) \right)^{y^{(i)}} \left(1 - h_{\omega}\left(x^{(i)}\right)\right)^{1-y^{(i)}} \qquad \left(\sim \prod_i \lvert h_i \rvert \right) \end{align*} $$

It would be easier to work on the log likelihood.


$$\ell(\omega) = \log \mathscr{L}(\omega) = \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right)$$


The logistic regression problem can be solved as a (convex) optimization problem as


$$\hat{\omega} = \arg\max_{\omega} \ell(\omega)$$

3.2. Logistic Regression using Scikit-LearnĀ¶


$$ \begin{align*} \omega &= \begin{bmatrix} \omega_1 \\ \omega_2\end{bmatrix}, \qquad \omega_0, \qquad x = \begin{bmatrix} x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots \\\end{bmatrix}\\ \\ y & = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

InĀ [27]:
X
Out[27]:
array([[ 5.39876541e+00,  1.46063605e+00],
       [ 5.67066894e+00,  2.04340495e+00],
       [ 5.14798156e+00,  1.68338832e-01],
       [ 6.41703851e+00,  4.24331047e-01],
       [ 2.76484878e+00,  2.30685788e+00],
       [ 7.09085307e+00,  1.09244820e+00],
       [ 6.57799977e+00,  2.29612160e+00],
       [ 4.01448000e+00,  1.50781011e+00],
       [ 7.90904831e+00, -4.16032175e-01],
       [ 3.36716955e+00,  2.00811226e+00],
       [ 7.08619530e+00,  7.94156154e-01],
       [ 2.71662173e+00,  2.46155210e+00],
       [ 3.54552314e+00,  1.65905921e+00],
       [ 5.54257280e+00,  2.55494279e+00],
       [ 4.32730115e+00,  1.76266428e+00],
       [ 7.32879365e+00, -1.51653699e+00],
       [ 6.46251862e+00, -4.56948153e-01],
       [ 5.77981650e+00,  2.87100386e+00],
       [ 6.53889358e+00,  2.78474807e+00],
       [ 2.40034795e+00,  2.35773263e+00],
       [ 7.25373066e+00, -1.05602069e+00],
       [ 6.50148326e+00,  9.20715747e-01],
       [ 3.41296328e+00,  1.58627278e+00],
       [ 6.26514907e+00,  9.36900454e-01],
       [ 6.13208666e+00,  2.89444656e+00],
       [ 7.88035321e+00, -7.30331367e-01],
       [ 5.21773076e+00,  1.73906793e-01],
       [ 4.39537288e+00,  2.75800816e+00],
       [ 3.52440328e+00,  1.67392334e+00],
       [ 6.90982210e+00, -3.56760928e+00],
       [ 1.79271158e+00, -6.77349746e-01],
       [ 3.02063150e-03, -1.02477447e+00],
       [ 9.10365616e-02,  4.88502000e-01],
       [ 1.73735764e+00, -5.74834802e-01],
       [ 2.13942449e+00, -6.62268350e-01],
       [ 2.98194119e+00, -1.12744267e+00],
       [ 3.25245652e+00, -1.66472784e+00],
       [ 5.77676894e+00, -2.64132859e+00],
       [ 2.63470714e+00, -2.50145062e+00],
       [ 1.46918721e+00,  6.84560495e-01],
       [ 5.54537437e+00, -3.16274317e+00],
       [ 7.09041992e+00, -3.73411603e+00],
       [ 7.19773940e-01,  1.00944090e+00],
       [ 2.35863704e+00, -2.93060431e+00],
       [ 2.93901627e+00, -3.54943466e+00],
       [ 8.15227503e-01, -3.90692765e+00],
       [ 3.92908235e+00, -1.38055389e+00],
       [ 1.30386169e+00,  2.06540288e-01],
       [ 4.15588404e+00, -2.20170044e+00],
       [ 1.95973444e+00, -3.16122414e+00],
       [ 2.53659322e+00, -3.01127512e-01],
       [ 2.31573572e-01,  1.38496845e+00],
       [ 1.33536242e+00, -2.93688705e+00],
       [ 8.43232873e-02, -6.09709213e-02],
       [ 3.19360577e+00, -1.20601746e+00],
       [ 6.29141092e+00, -3.63708483e+00],
       [ 2.22703878e+00, -1.02304464e+00],
       [ 3.90666147e+00, -3.93133227e+00],
       [ 3.92305178e-01, -3.74731306e+00],
       [ 2.36327523e+00, -2.60689261e+00],
       [ 1.33450693e+00, -3.39210867e+00],
       [ 6.28688997e-01, -2.80856152e-01],
       [ 7.86650117e-01,  8.09147889e-01],
       [ 2.33985817e+00, -2.89769844e-01],
       [ 3.20649702e+00, -3.28685429e+00],
       [ 2.19846807e+00, -2.69772773e+00],
       [ 8.43095804e-01, -7.53622379e-01],
       [ 1.69204385e+00, -6.19324844e-01],
       [ 3.06372114e+00, -3.12419555e+00]])
InĀ [28]:
# datat generation

m = 100
w0 = -6
w = np.array([[2], [1]])
X = np.hstack([4*np.random.rand(m,1), 4*np.random.rand(m,1)])

w = np.asmatrix(w)
X = np.asmatrix(X)

y = 1/(1 + np.exp(-w0-X*w)) > 0.5

C1 = np.where(y == True)[0]
C0 = np.where(y == False)[0]

y = np.empty([m,1])
y[C1] = 1
y[C0] = 0

plt.figure(figsize = (6, 4))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()
InĀ [29]:
from sklearn import linear_model

clf = linear_model.LogisticRegression(solver = 'lbfgs')
clf.fit(np.asarray(X), np.ravel(y))
Out[29]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
InĀ [30]:
clf.coef_
Out[30]:
array([[3.21453711, 1.29943009]])
InĀ [31]:
clf.intercept_
Out[31]:
array([-9.18960726])
InĀ [32]:
w0 = clf.intercept_[0]
w1 = clf.coef_[0,0]
w2 = clf.coef_[0,1]

xp = np.linspace(0,4,100).reshape(-1,1)
yp = - w1/w2*xp - w0/w2

plt.figure(figsize = (6, 4))
plt.plot(X[C1,0], X[C1,1], 'ro', alpha = 0.3, label = 'C1')
plt.plot(X[C0,0], X[C0,1], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 4, label = 'Logistic Regression')
plt.title('Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.xlim([0,4])
plt.ylim([0,4])
plt.show()

4. Deep Learning LibrariesĀ¶

Tensorflow

Keras

5. TensorFlowĀ¶

  • TensorFlow is an open-source software library for deep learning.

Itā€™s a framework to perform computation very efficiently, and it can tap into the GPU (Graphics Processor Unit) in order to speed it up even further. This will make a huge effect as we shall see shortly. TensorFlow can be controlled by a simple Python API.

Tensorflow is one of the widely used libraries for implementing machine learning and other algorithms involving large number of mathematical operations. Tensorflow was developed by Google and itā€™s one of the most popular Machine Learning libraries on GitHub. Google uses Tensorflow for implementing Machine learning in almost all applications.

Tensor

TensorFlow gets its name from tensors, which are arrays of arbitrary dimensionality. A vector is a 1-d array and is known as a 1st-order tensor. A matrix is a 2-d array and a 2nd-order tensor. The "flow" part of the name refers to computation flowing through a graph. Training and inference in a neural network, for example, involves the propagation of matrix computations through many nodes in a computational graph.




5.1. TensorFlow with Gradient TapeĀ¶

With eager execution enabled, Tensorflow will calculate the values of tensors as they occur in your code. This means that it won't precompute a static graph for which inputs are fed in through placeholders. This means to back propagate errors, you have to keep track of the gradients of your computation and then apply these gradients to an optimiser.

This is very different from running without eager execution, where you would build a graph and then simply use sess.run to evaluate your loss and then pass this into an optimiser directly.

Fundamentally, because tensors are evaluated immediately, you don't have a graph to calculate gradients and so you need a gradient tape. It is not so much that it is just used for visualisation, but more that you cannot implement a gradient descent in eager mode without it.

Obviously, Tensorflow could just keep track of every gradient for every computation on every tf.Variable. However, that could be a huge performance bottleneck. They expose a gradient tape so that you can control what areas of your code need the gradient information. Note that in non-eager mode, this will be statically determined based on the computational branches that are descendants of your loss but in eager mode there is no static graph and so no way of knowing.



5.2. TensorFlow as Optimization SolverĀ¶



$$\min_{\omega}\;\;(\omega - 4)^2$$

InĀ [33]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline
InĀ [34]:
w = tf.Variable(0, dtype = tf.float32)

LR = 0.05

# Training
cost_record = []
for i in range(50):
    with tf.GradientTape() as tape:
        cost = w*w - 8*w + 16
        w_grad = tape.gradient(cost, w)
    cost_record.append(cost)
    w.assign_sub(LR * w_grad)

print("\n optimal w =", w.numpy())

plt.figure(figsize = (6, 4))
plt.plot(cost_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('cost', fontsize = 15)
plt.show()
 optimal w = 3.979385

6. Machine Learning with TensorFlowĀ¶

6.1. Linear RegressionĀ¶

$$\hat{y} = \omega x + b$$
  • Given $x$ and $y$
  • Want to estimate $\omega$ and $b$

Data generation

InĀ [35]:
# data points in column vector [input, output]
train_x = np.array([0.1, 0.4, 0.7, 1.2, 1.3, 1.7, 2.2, 2.8, 3.0, 4.0, 4.3, 4.4, 4.9]).reshape(-1, 1)
train_y = np.array([0.5, 0.9, 1.1, 1.5, 1.5, 2.0, 2.2, 2.8, 2.7, 3.0, 3.5, 3.7, 3.9]).reshape(-1, 1)

m = train_x.shape[0]

plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y, 'ko')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()
  • Given $(x_i, y_i)$ for $i=1,\cdots, m$
$$ \hat{y}_{i} = \omega x_{i} + b \; \quad \text{ such that }\quad \min\limits_{\omega, b}\sum\limits_{i = 1}^{m} (\hat{y}_{i} - y_{i})^2$$
InĀ [36]:
LR = 0.001
n_iter = 1000

w = tf.Variable([[0]], dtype = tf.float32)
b = tf.Variable([[0]], dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        cost = tf.reduce_mean(tf.square(w*train_x + b - train_y))
        w_grad, b_grad = tape.gradient(cost, [w,b])

    loss_record.append(cost)
    w.assign_sub(LR * w_grad)
    b.assign_sub(LR * b_grad)

w_val = w.numpy()
b_val = b.numpy()
print("\n optimal w =", w_val)
print("\n optimal b =", b_val)

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
 optimal w = [[0.74257565]]

 optimal b = [[0.41717836]]
InĀ [37]:
xp = np.arange(0, 5, 0.01).reshape(-1, 1)
yp = w_val*xp + b_val

plt.figure(figsize = (6, 4))
plt.plot(train_x, train_y, 'ko')
plt.plot(xp, yp, 'r')
plt.title('Data', fontsize = 15)
plt.xlabel('X', fontsize = 15)
plt.ylabel('Y', fontsize = 15)
plt.axis('equal')
plt.grid(alpha = 0.3)
plt.xlim([0, 5])
plt.show()

6.2. Logistic RegressionĀ¶


$$ \begin{align*} \omega &= \begin{bmatrix} \omega_0 \\ \omega_1 \\ \omega_2\end{bmatrix}, \qquad x = \begin{bmatrix} 1 \\ x_1 \\ x_2\end{bmatrix}\\ \\ X &= \begin{bmatrix} \left(x^{(1)}\right)^T \\ \left(x^{(2)}\right)^T \\ \left(x^{(3)}\right)^T \\ \vdots\end{bmatrix} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} \\ 1 & x_1^{(3)} & x_2^{(3)} \\ \vdots & \vdots & \vdots \\\end{bmatrix}, \quad y = \begin{bmatrix} y^{(1)}\\ y^{(2)} \\y^{(3)} \\ \vdots \end{bmatrix} \end{align*} $$

InĀ [38]:
# datat generation

m = 1000
true_w = np.array([[-6], [2], [1]])
train_X = np.hstack([np.ones([m,1]), 5*np.random.rand(m,1), 4*np.random.rand(m,1)])

true_w = np.asmatrix(true_w)
train_X = np.asmatrix(train_X)

train_y = 1/(1 + np.exp(-train_X*true_w)) > 0.5

C1 = np.where(train_y == True)[0]
C0 = np.where(train_y == False)[0]

train_y = np.empty([m,1])
train_y[C1] = 1
train_y[C0] = 0

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
$$ \begin{align*} \ell(\omega) = \log \mathscr{L}(\omega) &= \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right)\\ &\Rightarrow \frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log h_{\omega} \left(x^{(i)} \right) + \left(1-y^{(i)} \right) \log \left(1-h_{\omega} \left(x^{(i)} \right) \right) \end{align*} $$
InĀ [39]:
LR = 0.05
n_iter = 1500

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_y, dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.sigmoid(tf.matmul(train_x, w))
        loss = - train_y*tf.math.log(y_pred) - (1-train_y)*tf.math.log(1-y_pred)
        loss = tf.reduce_mean(loss)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

w_hat = w.numpy()
print(w_hat)

plt.figure(figsize = (6, 4))
plt.plot(loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
[[-4.096146  ]
 [ 1.7331805 ]
 [ 0.47786987]]
InĀ [40]:
xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()

TensorFlow embedded functions

InĀ [41]:
LR = 0.05
n_iter = 1500

w = tf.Variable([[0],[0],[0]], dtype = tf.float32)
train_x = tf.constant(train_X, dtype = tf.float32)
train_y = tf.constant(train_y, dtype = tf.float32)

loss_record = []
for i in range(n_iter):
    with tf.GradientTape() as tape:
        y_pred = tf.matmul(train_x,w)
        loss = tf.nn.sigmoid_cross_entropy_with_logits(labels = train_y, logits = y_pred)
        w_grad = tape.gradient(loss, w)

    loss_record.append(loss)
    w.assign_sub(LR * w_grad)

w_hat = w.numpy()
print(w_hat)

xp = np.arange(0, 4, 0.01).reshape(-1, 1)
yp = - w_hat[1,0]/w_hat[2,0]*xp - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize = (6, 4))
plt.plot(train_X[C1,1], train_X[C1,2], 'ro', alpha = 0.3, label = 'C1')
plt.plot(train_X[C0,1], train_X[C0,2], 'bo', alpha = 0.3, label = 'C0')
plt.plot(xp, yp, 'g', linewidth = 3, label = 'Logistic Regression')
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.axis('equal')
plt.ylim([0,4])
plt.show()
[[-132.96732 ]
 [  44.290165]
 [  22.316422]]
InĀ [42]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')