Artificial Neural Networks (ANN)

Table of Contents

1. Recall Perceptron

Perceptron


XOR Problem

  • Minsky-Papert Controversy on XOR
    • not linearly separable
    • limitation of perceptron





2. From Perceptron to Multi-Layer Perceptron (MLP)

2.1. Perceptron for $h_{\omega}(x)$

  • Neurons compute the weighted sum of their inputs

  • A neuron is activated or fired when the sum $a$ is positive


$$ \begin{align*} a &= \omega_0 + \omega_1 x_1 + \omega_2 x_2 \\ \\ \hat{y} &= g(a) = \begin{cases} 1 & a > 0\\ 0 & \text{otherwise} \end{cases} \end{align*} $$




  • A step function is not differentiable


  • One layer is often not enough
    • One hyperplane

2.2. Multi-layer Perceptron = Artificial Neural Networks (ANN)

Multi-neurons





Differentiable activation function






In a compact representation





Multi-layer perceptron



2.3. Another Perspective: ANN as Kernel Learning

We can represent this “neuron” as follows:

  • The main weakness of linear predictors is their lack of capacity. For classification, the populations have to be linearly separable.

  • The XOR example can be solved by pre-processing the data to make the two populations linearly separable.





3. Logistic Regression in a Form of Neural Network


$$y^{(i)} \in \{1,0\}$$

$$y = \sigma \,(\omega_0 + \omega_1 x_1 + \omega_2 x_2)$$




In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

%matplotlib inline
In [ ]:
#training data gerneration
m = 1000
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])

train_X = np.asmatrix(train_X)
train_y = np.asmatrix(train_y)

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()
No description has been provided for this image
In [ ]:
LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_dim = 2,
                          units = 1,
                          activation = 'sigmoid')
])
In [ ]:
LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
                           loss = 'binary_crossentropy')
In [ ]:
loss = LogisticRegression.fit(train_X, train_y, epochs = 10)
Epoch 1/10
32/32 [==============================] - 1s 2ms/step - loss: 0.5763
Epoch 2/10
32/32 [==============================] - 0s 2ms/step - loss: 0.2164
Epoch 3/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1702
Epoch 4/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1468
Epoch 5/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1293
Epoch 6/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1165
Epoch 7/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1085
Epoch 8/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0994
Epoch 9/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0966
Epoch 10/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0919
In [ ]:
w = LogisticRegression.layers[0].get_weights()[0]
b = LogisticRegression.layers[0].get_weights()[1]
print(w)
print(b)
[[1.8633064]
 [2.4650373]]
[-6.445767]
In [ ]:
x1p = np.arange(0, 8, 0.01).reshape(-1, 1)
x2p = - w[0,0]/w[1,0]*x1p - b[0]/w[1,0]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'g', linewidth = 3, label = '')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()
No description has been provided for this image

4. Looking at Parameters

  • To understand network's behavior

$$y = \sigma \,(b + \omega_1 x_1 + \omega_2 x_2)$$




In [ ]:
# training data gerneration

m = 1000
x1 = 10*np.random.rand(m, 1) - 5
x2 = 8*np.random.rand(m, 1) - 4

g = - 0.5*(x1-1)**2 + 2*x2 + 5

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_X = np.asmatrix(train_X)

train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()
No description has been provided for this image




In [ ]:
LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_dim = 2, units = 2, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 1, activation = 'sigmoid')
])
In [ ]:
LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
                           loss = 'binary_crossentropy')
In [ ]:
loss = LogisticRegression.fit(train_X, train_y, epochs = 10)
Epoch 1/10
32/32 [==============================] - 1s 2ms/step - loss: 0.4869
Epoch 2/10
32/32 [==============================] - 0s 2ms/step - loss: 0.4197
Epoch 3/10
32/32 [==============================] - 0s 2ms/step - loss: 0.4166
Epoch 4/10
32/32 [==============================] - 0s 2ms/step - loss: 0.3905
Epoch 5/10
32/32 [==============================] - 0s 2ms/step - loss: 0.3501
Epoch 6/10
32/32 [==============================] - 0s 2ms/step - loss: 0.2803
Epoch 7/10
32/32 [==============================] - 0s 2ms/step - loss: 0.2313
Epoch 8/10
32/32 [==============================] - 0s 2ms/step - loss: 0.2005
Epoch 9/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1761
Epoch 10/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1684
In [ ]:
w1 = LogisticRegression.layers[0].get_weights()[0]
b1 = LogisticRegression.layers[0].get_weights()[1]

w2 = LogisticRegression.layers[1].get_weights()[0]
b2 = LogisticRegression.layers[1].get_weights()[1]
In [ ]:
H = train_X*w1 + b1
H = 1/(1 + np.exp(-H))

plt.figure(figsize = (6, 4))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
No description has been provided for this image
In [ ]:
x1p = np.arange(0, 1, 0.01).reshape(-1, 1)
x2p = - w2[0,0]/w2[1,0]*x1p - b2[0]/w2[1,0]

plt.figure(figsize = (6, 4))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
No description has been provided for this image
In [ ]:
x1p = np.arange(-5, 5, 0.01).reshape(-1, 1)
x2p = - w1[0,0]/w1[1,0]*x1p - b1[0]/w1[1,0]
x3p = - w1[0,1]/w1[1,1]*x1p - b1[1]/w1[1,1]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.plot(x1p, x3p, 'g', linewidth = 3, label = '')
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()
No description has been provided for this image

5. Artificial Neural Networks

  • Complex/Nonlinear universal function approximator

    • Linearly connected networks
    • Simple nonlinear neurons
  • Hidden layers

    • Autonomous feature learning


6. ANN Learning

6.1. Recursive Algorithm

  • One of the central ideas of computer science

  • Depends on solutions to smaller instances of the same problem ( = subproblem)

  • Function to call itself (it is impossible in the real world)



In [ ]:
%%html
<center><iframe src="https://www.youtube.com/embed/t4MSwiqfLaY?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
  • Factorial example

$$n ! = n \cdot (n-1) \cdots 2 \cdot 1$$
In [ ]:
n = 5

m = 1
for i in range(n):
    m = m*(i+1)

print(m)
120
In [ ]:
def fac(n):
    if n == 1:
        return 1
    else:
        return n*fac(n-1)
In [ ]:
# recursive

fac(5)
Out[ ]:
120

6.2. Dynamic Programming

  • Dynamic Programming: general, powerful algorithm design technique

  • Fibonacci numbers:

In [ ]:
# naive Fibonacci

def fib(n):
    if n <= 2:
        return 1
    else:
        return fib(n-1) + fib(n-2)
In [ ]:
fib(10)
Out[ ]:
55
In [ ]:
# Memorized DP Fibonacci

def mfib(n):
    global memo

    if memo[n-1] != 0:
        return memo[n-1]
    elif n <= 2:
        memo[n-1] = 1
        return memo[n-1]
    else:
        memo[n-1] = mfib(n-1) + mfib(n-2)
        return memo[n-1]
In [ ]:
import numpy as np

n = 10
memo = np.zeros(n)
mfib(n)
Out[ ]:
55.0
In [ ]:
n = 30
%timeit fib(30)
332 ms ± 110 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [ ]:
memo = np.zeros(n)
%timeit mfib(30)
389 ns ± 5.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

6.3. Training Neural Networks

$=$ Learning or estimating weights and biases of multi-layer perceptron from training data

6.3.1. Optimization

3 key components

  1. objective function $f(\cdot)$
  2. decision variable or unknown $\omega$
  3. constraints $g(\cdot)$

In mathematical expression


$$\begin{align*} \min_{\omega} \quad &f(\omega) \end{align*} $$

6.3.2. Loss Function

  • Measures error between target values and predictions

$$ \min_{\omega} \sum_{i=1}^{m}\ell\left( h_{\omega}\left(x^{(i)}\right),y^{(i)}\right)$$
  • Example
    • Squared loss (for regression): $$ \frac{1}{m} \sum_{i=1}^{m} \left(h_{\omega}\left(x^{(i)}\right) - y^{(i)}\right)^2 $$
    • Cross entropy (for classification): $$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log\left(h_{\omega}\left(x^{(i)}\right)\right) + \left(1-y^{(i)}\right)\log\left(1-h_{\omega}\left(x^{(i)}\right)\right)$$

6.3.3. Learning

Learning weights and biases from data using gradient descent


$$\omega \leftarrow \omega - \alpha \nabla_{\omega} \ell \left( h_{\omega}\left(x^{(i)}\right), y^{(i)} \right)$$

  • $\frac{\partial \ell}{\partial \omega}$: too many computations are required for all $\omega$

  • Structural constraints of NN:

    • Composition of functions
    • Chain rule
    • Dynamic programming

Backpropagation

  • Forward propagation

    • the initial information propagates up to the hidden units at each layer and finally produces output
  • Backpropagation

    • allows the information from the cost to flow backwards through the network in order to compute the gradients
  • Chain Rule

    • Computing the derivative of the composition of functions

      • $\space f(g(x))' = f'(g(x))g'(x)$

      • $\space {dz \over dx} = {dz \over dy} \bullet {dy \over dx}$

      • $\space {dz \over dw} = ({dz \over dy} \bullet {dy \over dx}) \bullet {dx \over dw}$

      • $\space {dz \over du} = ({dz \over dy} \bullet {dy \over dx} \bullet {dx \over dw}) \bullet {dw \over du}$

  • Backpropagation

    • Update weights recursively with memory



Optimization procedure


  • It is not easy to numerically compute gradients in network in general.
    • The good news: people have already done all the "hardwork" of developing numerical solvers (or libraries)
    • There are a wide range of tools: TensorFlow

Summary

  • Learning weights and biases from data using gradient descent

6.4. Other Tutorials

In [ ]:
%%html
<center><iframe src="https://www.youtube.com/embed/aircAruvnKk?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [ ]:
%%html
<center><iframe src="https://www.youtube.com/embed/IHZwWFHWa-w?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [ ]:
%%html
<center><iframe src="https://www.youtube.com/embed/Ilg3gGewQ5U?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [ ]:
%%html
<center><iframe src="https://www.youtube.com/embed/tIeHLnjs5U8?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

7. ANN with MNIST

7.1. What's an MNIST?

From Wikipedia

  • The MNIST database (Mixed National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, NIST's complete dataset was too hard.

  • MNIST (Mixed National Institute of Standards and Technology database) database

    • Handwritten digit database
    • $28 \times 28$ gray scaled image
    • (Flattened matrix into a vector of $28 \times 28 = 784$) $\rightarrow$ not for TensorFlow 2



More here

We will be using MNIST to create a Multinomial Classifier that can detect if the MNIST image shown is a member of class 0,1,2,3,4,5,6,7,8 or 9. Susinctly, we're teaching a computer to recognize hand written digets.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline

Let's download and load the dataset.

In [ ]:
mnist = tf.keras.datasets.mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()

train_x, test_x = train_x/255.0, test_x/255.0
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 0s 0us/step
In [ ]:
print ("The training data set is:\n")
print (train_x.shape)
print (train_y.shape)
The training data set is:

(60000, 28, 28)
(60000,)
In [ ]:
print ("The test data set is:")
print (test_x.shape)
print (test_y.shape)
The test data set is:
(10000, 28, 28)
(10000,)

Display a few random samples from it:

In [ ]:
# So now we have a 28x28 matrix, where each element is an intensity level from 0 to 1.

img = train_x[5]
img.shape
Out[ ]:
(28, 28)

Let's visualize what some of these images and their corresponding training labels look like.

In [ ]:
plt.figure(figsize = (4, 4))
plt.imshow(img, 'gray')
plt.xticks([])
plt.yticks([])
plt.show()
No description has been provided for this image
In [ ]:
train_y[5]
Out[ ]:
2

7.2. ANN with TensorFlow

  • Feed a gray image to ANN