Artificial Neural Networks (ANN)

By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Recall Perceptron¶

Perceptron

XOR Problem

Minsky-Papert Controversy on XOR
- not linearly separable
- limitation of perceptron

$x_1$	$x_2$	$x_1$ XOR $x_2$
0	0	0
0	1	1
1	0	1
1	1	0

2. From Perceptron to Multi-Layer Perceptron (MLP)¶

2.1. Perceptron for $h_{\omega}(x)$¶

Neurons compute the weighted sum of their inputs
A neuron is activated or fired when the sum $a$ is positive

$$ \begin{align*} a &= \omega_0 + \omega_1 x_1 + \omega_2 x_2 \\ \\ \hat{y} &= g(a) = \begin{cases} 1 & a > 0\\ 0 & \text{otherwise} \end{cases} \end{align*} $$

A step function is not differentiable

One layer is often not enough
- One hyperplane

2.2. Multi-layer Perceptron = Artificial Neural Networks (ANN)¶

Multi-neurons

Differentiable activation function

In a compact representation

Multi-layer perceptron

2.3. Another Perspective: ANN as Kernel Learning¶

We can represent this “neuron” as follows:

The main weakness of linear predictors is their lack of capacity. For classiﬁcation, the populations have to be linearly separable.
The XOR example can be solved by pre-processing the data to make the two populations linearly separable.

3. Logistic Regression in a Form of Neural Network¶

$$y^{(i)} \in \{1,0\}$$$$y = \sigma \,(\omega_0 + \omega_1 x_1 + \omega_2 x_2)$$

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

%matplotlib inline

#training data gerneration
m = 1000
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])

train_X = np.asmatrix(train_X)
train_y = np.asmatrix(train_y)

plt.figure(figsize = (10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()

LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_dim = 2, 
                          units = 1, 
                          activation = 'sigmoid')
])

LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
                           loss = 'binary_crossentropy')

loss = LogisticRegression.fit(train_X, train_y, epochs = 10)

Epoch 1/10
32/32 [==============================] - 0s 772us/step - loss: 0.4029
Epoch 2/10
32/32 [==============================] - 0s 611us/step - loss: 0.2164
Epoch 3/10
32/32 [==============================] - 0s 611us/step - loss: 0.1679
Epoch 4/10
32/32 [==============================] - 0s 611us/step - loss: 0.1425
Epoch 5/10
32/32 [==============================] - 0s 579us/step - loss: 0.1286
Epoch 6/10
32/32 [==============================] - 0s 611us/step - loss: 0.1171
Epoch 7/10
32/32 [==============================] - 0s 643us/step - loss: 0.1086
Epoch 8/10
32/32 [==============================] - 0s 611us/step - loss: 0.1024
Epoch 9/10
32/32 [==============================] - 0s 1ms/step - loss: 0.1024
Epoch 10/10
32/32 [==============================] - 0s 1ms/step - loss: 0.0933

w = LogisticRegression.layers[0].get_weights()[0]
b = LogisticRegression.layers[0].get_weights()[1]
print(w)
print(b)

[[2.2291846]
 [2.8796294]]
[-8.040295]

x1p = np.arange(0, 8, 0.01).reshape(-1, 1)
x2p = - w[0,0]/w[1,0]*x1p - b[0]/w[1,0]

plt.figure(figsize = (10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'g', linewidth = 3, label = '')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()

4. Looking at Parameters¶

To understand network's behavior

$$y = \sigma \,(b + \omega_1 x_1 + \omega_2 x_2)$$

# training data gerneration

m = 1000
x1 = 10*np.random.rand(m, 1) - 5
x2 = 8*np.random.rand(m, 1) - 4

g = - 0.5*(x1-1)**2 + 2*x2 + 5

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_X = np.asmatrix(train_X)

train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])

plt.figure(figsize = (10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()

LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_dim = 2, units = 2, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 1, activation = 'sigmoid')
])

LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
                           loss = 'binary_crossentropy')

loss = LogisticRegression.fit(train_X, train_y, epochs = 10)

Epoch 1/10
32/32 [==============================] - 3s 2ms/step - loss: 0.5628
Epoch 2/10
32/32 [==============================] - 0s 3ms/step - loss: 0.3563
Epoch 3/10
32/32 [==============================] - 0s 3ms/step - loss: 0.2601
Epoch 4/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1931
Epoch 5/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1578
Epoch 6/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1324
Epoch 7/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1221
Epoch 8/10
32/32 [==============================] - 0s 3ms/step - loss: 0.1043
Epoch 9/10
32/32 [==============================] - 0s 3ms/step - loss: 0.1082
Epoch 10/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1004

w1 = LogisticRegression.layers[0].get_weights()[0]
b1 = LogisticRegression.layers[0].get_weights()[1]

w2 = LogisticRegression.layers[1].get_weights()[0]
b2 = LogisticRegression.layers[1].get_weights()[1]

H = train_X*w1 + b1
H = 1/(1 + np.exp(-H))

plt.figure(figsize = (10, 8))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()

x1p = np.arange(0, 1, 0.01).reshape(-1, 1)
x2p = - w2[0,0]/w2[1,0]*x1p - b2[0]/w2[1,0]

plt.figure(figsize = (10, 8))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()

x1p = np.arange(-5, 5, 0.01).reshape(-1, 1)
x2p = - w1[0,0]/w1[1,0]*x1p - b1[0]/w1[1,0]
x3p = - w1[0,1]/w1[1,1]*x1p - b1[1]/w1[1,1]

plt.figure(figsize = (10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.plot(x1p, x3p, 'g', linewidth = 3, label = '')
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()

5. Artificial Neural Networks¶

Complex/Nonlinear universal function approximator
- Linearly connected networks
- Simple nonlinear neurons

Hidden layers
- Autonomous feature learning

5.1. Training Neural Networks¶

$=$ Learning or estimating weights and biases of multi-layer perceptron from training data

Loss Function

Measures error between target values and predictions

$$ \min_{\omega} \sum_{i=1}^{m}\ell\left( h_{\omega}\left(x^{(i)}\right),y^{(i)}\right)$$

Example
- Squared loss (for regression): $$ \frac{1}{m} \sum_{i=1}^{m} \left(h_{\omega}\left(x^{(i)}\right) - y^{(i)}\right)^2 $$
- Cross entropy (for classification): $$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log\left(h_{\omega}\left(x^{(i)}\right)\right) + \left(1-y^{(i)}\right)\log\left(1-h_{\omega}\left(x^{(i)}\right)\right)$$

Learning

Learning weights and biases from data using gradient descent

$$\omega \Leftarrow \omega - \alpha \nabla_{\omega} \ell \left( h_{\omega}\left(x^{(i)}\right), y^{(i)} \right)$$

Backpropagation

Forward propagation
- the initial information propagates up to the hidden units at each layer and finally produces output

Backpropagation
- allows the information from the cost to flow backwards through the network in order to compute the gradients

6. ANN with MNIST¶

MNIST (Mixed National Institute of Standards and Technology database) database
- Handwritten digit database
- $28 \times 28$ gray scaled image

We will be using MNIST to create a Multinomial Classifier that can detect if the MNIST image shown is a member of class 0,1,2,3,4,5,6,7,8 or 9. Susinctly, we're teaching a computer to recognize hand written digets.

Let's download and load the dataset.

mnist = tf.keras.datasets.mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()

train_x, test_x = train_x/255.0, test_x/255.0

print ("The training data set is:\n")
print (train_x.shape)
print (train_y.shape)

The training data set is:

(60000, 28, 28)
(60000,)

print ("The test data set is:")
print (test_x.shape)
print (test_y.shape)

The test data set is:
(10000, 28, 28)
(10000,)

Let's visualize what some of these images and their corresponding training labels look like.

print('label :', train_y[0])

plt.figure(figsize = (6,6))
plt.imshow(train_x[0], 'gray')
plt.xticks([])
plt.yticks([])
plt.show()

label : 5

Our network model

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape = (28, 28)),
    tf.keras.layers.Dense(units = 100, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'softmax')
])

model.compile(optimizer = 'adam',
              loss = 'sparse_categorical_crossentropy',
              metrics = ['accuracy'])

loss = model.fit(train_x, train_y, epochs = 5)

Epoch 1/5
1875/1875 [==============================] - 11s 4ms/step - loss: 0.2779 - accuracy: 0.9214
Epoch 2/5
1875/1875 [==============================] - 7s 4ms/step - loss: 0.1262 - accuracy: 0.9632
Epoch 3/5
1875/1875 [==============================] - 8s 4ms/step - loss: 0.0886 - accuracy: 0.9735
Epoch 4/5
1875/1875 [==============================] - 7s 4ms/step - loss: 0.0676 - accuracy: 0.9798
Epoch 5/5
1875/1875 [==============================] - 7s 4ms/step - loss: 0.0541 - accuracy: 0.9838

test_loss, test_acc = model.evaluate(test_x, test_y)

313/313 [==============================] - 2s 3ms/step - loss: 0.0850 - accuracy: 0.9746

test_img = test_x[[1495]]

predict = model.predict(test_img)
mypred = np.argmax(predict, axis = 1)

plt.figure(figsize = (12,5))

plt.subplot(1,2,1)
plt.imshow(test_img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()

print('Prediction : {}'.format(mypred[0]))

Prediction : 3

7. Video Lectures¶

%%html
<center><iframe src="https://www.youtube.com/embed/blDtzUuJtiE?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html
<center><iframe src="https://www.youtube.com/embed/6O_WHmBUff4?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html
<center><iframe src="https://www.youtube.com/embed/DZgihzTgVQ8?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')