Artificial Neural Networks (ANN)

By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

0. Video Lectures¶

%%html 
<center><iframe src="https://www.youtube.com/embed/blDtzUuJtiE?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html 
<center><iframe src="https://www.youtube.com/embed/6O_WHmBUff4?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html 
<center><iframe src="https://www.youtube.com/embed/DZgihzTgVQ8?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

1. Recall Perceptron¶

Perceptron

XOR Problem

Minsky-Papert Controversy on XOR
- not linearly separable
- limitation of perceptron

$x_1$	$x_2$	$x_1$ XOR $x_2$
0	0	0
0	1	1
1	0	1
1	1	0

2. From Perceptron to Multi-Layer Perceptron (MLP)¶

2.1. Perceptron for $h_{\omega}(x)$¶

Neurons compute the weighted sum of their inputs
A neuron is activated or fired when the sum $a$ is positive

$$ \begin{align*} a &= \omega_0 + \omega_1 x_1 + \omega_2 x_2 \\ \\ \hat{y} &= g(a) = \begin{cases} 1 & a > 0\\ 0 & \text{otherwise} \end{cases} \end{align*} $$

A step function is not differentiable

One layer is often not enough
- One hyperplane

2.2. Multi-layer Perceptron = Artificial Neural Networks (ANN)¶

Multi-neurons

Differentiable activation function

In a compact representation

Multi-layer perceptron

2.3. Another Perspective: ANN as Kernel Learning¶

We can represent this “neuron” as follows:

The main weakness of linear predictors is their lack of capacity. For classiﬁcation, the populations have to be linearly separable.
The XOR example can be solved by pre-processing the data to make the two populations linearly separable.

3. Logistic Regression in a Form of Neural Network¶

$$y^{(i)} \in \{1,0\}$$$$y = \sigma \,(\omega_0 + \omega_1 x_1 + \omega_2 x_2)$$

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

%matplotlib inline

#training data gerneration
m = 1000
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

# X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
# X0 = np.hstack([np.ones([M,1]), x1[C0], x2[C0]])

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])

train_X = np.asmatrix(train_X)
train_y = np.asmatrix(train_y)

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()

LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units = 1, input_dim = 2, activation = 'sigmoid')
])

LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1), 
                           loss = 'binary_crossentropy')

loss = LogisticRegression.fit(train_X, train_y, epochs = 10)

Epoch 1/10
32/32 [==============================] - 0s 2ms/step - loss: 0.3057
Epoch 2/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1881
Epoch 3/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1445
Epoch 4/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1269
Epoch 5/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1106
Epoch 6/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1010
Epoch 7/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0912
Epoch 8/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0869
Epoch 9/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0799
Epoch 10/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0787

w = LogisticRegression.layers[0].get_weights()[0]
b = LogisticRegression.layers[0].get_weights()[1]
print(w)
print(b)

[[2.4437416]
 [3.1867967]]
[-8.845137]

x1p = np.arange(0, 8, 0.01).reshape(-1, 1)
x2p = - w[0,0]/w[1,0]*x1p - b[0]/w[1,0]

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'g', linewidth = 3, label = '')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()

4. Looking at Parameters¶

To understand network's behavior

$$y = \sigma \,(b + \omega_1 x_1 + \omega_2 x_2)$$

# training data gerneration

m = 1000
x1 = 10*np.random.rand(m, 1) - 5
x2 = 8*np.random.rand(m, 1) - 4

g = - 0.5*(x1-1)**2 + 2*x2 + 5

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_X = np.asmatrix(train_X)

train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])
# ohe = OneHotEncoder(handle_unknown='ignore')
# train_y = ohe.fit_transform(train_y).toarray()

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()

LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units = 2, input_dim = 2, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 1, activation = 'sigmoid')
])

LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1), 
                           loss = 'binary_crossentropy')

loss = LogisticRegression.fit(train_X, train_y, epochs = 10)

Epoch 1/10
32/32 [==============================] - 0s 994us/step - loss: 0.4317
Epoch 2/10
32/32 [==============================] - 0s 2ms/step - loss: 0.3193
Epoch 3/10
32/32 [==============================] - 0s 1ms/step - loss: 0.2465
Epoch 4/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1770
Epoch 5/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1389
Epoch 6/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1127
Epoch 7/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1060
Epoch 8/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0926
Epoch 9/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0841
Epoch 10/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0918

w1 = LogisticRegression.layers[0].get_weights()[0]
b1 = LogisticRegression.layers[0].get_weights()[1]
print(w1)
print(b1)

[[-3.0046053 -1.3579749]
 [-1.8645371  1.7840072]]
[-6.007547   5.3945694]

w2 = LogisticRegression.layers[1].get_weights()[0]
b2 = LogisticRegression.layers[1].get_weights()[1]
print(w2)
print(b2)

[[-9.982354]
 [ 8.395883]]
[-2.9274793]

H = train_X*w1 + b1
H = 1/(1 + np.exp(-H))

plt.figure(figsize=(10, 8))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()

x1p = np.arange(0, 1, 0.01).reshape(-1, 1)
x2p = - w2[0,0]/w2[1,0]*x1p - b2[0]/w2[1,0]

plt.figure(figsize=(10, 8))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()

x1p = np.arange(-5, 5, 0.01).reshape(-1, 1)
x2p = - w1[0,0]/w1[1,0]*x1p - b1[0]/w1[1,0]
x3p = - w1[0,1]/w1[1,1]*x1p - b1[1]/w1[1,1]

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.plot(x1p, x3p, 'g', linewidth = 3, label = '')
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()

5. Artificial Neural Networks¶

Complex/Nonlinear universal function approximator
- Linearly connected networks
- Simple nonlinear neurons

Hidden layers
- Autonomous feature learning

5.1. Training Neural Networks¶

$=$ Learning or estimating weights and biases of multi-layer perceptron from training data

Loss Function

Measures error between target values and predictions

$$ \min_{\omega} \sum_{i=1}^{m}\ell\left( h_{\omega}\left(x^{(i)}\right),y^{(i)}\right)$$

Example
- Squared loss (for regression): $$ \frac{1}{m} \sum_{i=1}^{m} \left(h_{\omega}\left(x^{(i)}\right) - y^{(i)}\right)^2 $$
- Cross entropy (for classification): $$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log\left(h_{\omega}\left(x^{(i)}\right)\right) + \left(1-y^{(i)}\right)\log\left(1-h_{\omega}\left(x^{(i)}\right)\right)$$

Learning

Learning weights and biases from data using gradient descent

$$\omega \Leftarrow \omega - \alpha \nabla_{\omega} \ell \left( h_{\omega}\left(x^{(i)}\right), y^{(i)} \right)$$

Backpropagation

Forward propagation
- the initial information propagates up to the hidden units at each layer and finally produces output

Backpropagation
- allows the information from the cost to flow backwards through the network in order to compute the gradients

6. ANN with MNIST¶

MNIST (Mixed National Institute of Standards and Technology database) database
- Handwritten digit database
- $28 \times 28$ gray scaled image

We will be using MNIST to create a Multinomial Classifier that can detect if the MNIST image shown is a member of class 0,1,2,3,4,5,6,7,8 or 9. Susinctly, we're teaching a computer to recognize hand written digets.

Let's download and load the dataset.

mnist = tf.keras.datasets.mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()

train_x, test_x = train_x/255.0, test_x/255.0

print ("The training data set is:\n")
print (train_x.shape)
print (train_y.shape)

The training data set is:

(60000, 28, 28)
(60000,)

print ("The test data set is:")
print (test_x.shape)
print (test_y.shape)

The test data set is:
(10000, 28, 28)
(10000,)

Let's visualize what some of these images and their corresponding training labels look like.

print('label :', train_y[5])

plt.figure(figsize = (6,6))
plt.imshow(train_x[5], 'gray')
plt.xticks([])
plt.yticks([])
plt.show()

label : 2

Our network model

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape = (28, 28)),
    tf.keras.layers.Dense(units = 100, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'softmax')
])

model.compile(optimizer = 'adam',
              loss = 'sparse_categorical_crossentropy',
              metrics = ['accuracy'])

loss = model.fit(train_x, train_y, epochs = 5)

Epoch 1/5
1875/1875 [==============================] - 5s 3ms/step - accuracy: 0.9199 - loss: 0.2810
Epoch 2/5
1875/1875 [==============================] - 5s 3ms/step - accuracy: 0.9634 - loss: 0.1250
Epoch 3/5
1875/1875 [==============================] - 5s 3ms/step - accuracy: 0.9736 - loss: 0.0877
Epoch 4/5
1875/1875 [==============================] - 5s 3ms/step - accuracy: 0.9798 - loss: 0.0670
Epoch 5/5
1875/1875 [==============================] - 5s 3ms/step - accuracy: 0.9834 - loss: 0.0531

test_loss, test_acc = model.evaluate(test_x, test_y)

313/313 [==============================] - 1s 2ms/step - accuracy: 0.9743 - loss: 0.0821

test_img = test_x[np.random.choice(test_x.shape[0], 1)]

predict = model.predict_on_batch(test_img)
mypred = np.argmax(predict, axis = 1)

plt.figure(figsize = (12,5))

plt.subplot(1,2,1)
plt.imshow(test_img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()

print('Prediction : {}'.format(mypred[0]))

Prediction : 3

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')