Artificial Neural Networks (ANN)


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

0. Video Lectures

In [1]:
%%html 
<center><iframe src="https://www.youtube.com/embed/Y1zSV8UoV_U?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [2]:
%%html 
<center><iframe src="https://www.youtube.com/embed/ZmRwMpMVQV0?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [3]:
%%html 
<center><iframe src="https://www.youtube.com/embed/cbsdvR8t9W0?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [4]:
%%html 
<center><iframe src="https://www.youtube.com/embed/wzwQE1mGB_c?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [5]:
%%html 
<center><iframe src="https://www.youtube.com/embed/Wkb-4BRhSwE?end=240&rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

1. Recall Perceptron

Perceptron


XOR Problem

  • Minsky-Papert Controversy on XOR
    • not linearly separable
    • limitation of perceptron
$x_1$ $x_2$ $x_1$ XOR $x_2$
0 0 0
0 1 1
1 0 1
1 1 0



2. From Perceptron to Multi-Layer Perceptron (MLP)

2.1. Perceptron for $h_{\omega}(x)$

  • Neurons compute the weighted sum of their inputs

  • A neuron is activated or fired when the sum $a$ is positive


$$ \begin{align*} a &= \omega_0 + \omega_1 x_1 + \omega_2 x_2 \\ \\ \hat{y} &= g(a) = \begin{cases} 1 & a > 0\\ 0 & \text{otherwise} \end{cases} \end{align*} $$



  • A step function is not differentiable


  • One layer is often not enough
    • One hyperplane

2.2. Multi-layer Perceptron = Artificial Neural Networks (ANN)

Multi-neurons





Differentiable activation function




In a compact representation



Multi-layer perceptron



2.3. Another Perspective: ANN as Kernel Learning

We can represent this “neuron” as follows:

  • The main weakness of linear predictors is their lack of capacity. For classification, the populations have to be linearly separable.

  • The XOR example can be solved by pre-processing the data to make the two populations linearly separable.



3. Logistic Regression in a Form of Neural Network

$$y^{(i)} \in \{1,0\}$$$$y = \sigma \,(\omega_0 + \omega_1 x_1 + \omega_2 x_2)$$




In [1]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

%matplotlib inline
In [2]:
#training data gerneration
m = 1000
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

# X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
# X0 = np.hstack([np.ones([M,1]), x1[C0], x2[C0]])

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])

train_X = np.asmatrix(train_X)
train_y = np.asmatrix(train_y)

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()
In [3]:
LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_dim = 2, units = 1, activation = 'sigmoid')
])
In [4]:
LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1), 
                           loss = 'binary_crossentropy')
In [5]:
loss = LogisticRegression.fit(train_X, train_y, epochs = 10)
Epoch 1/10
32/32 [==============================] - 0s 781us/step - loss: 0.4028
Epoch 2/10
32/32 [==============================] - 0s 937us/step - loss: 0.1783
Epoch 3/10
32/32 [==============================] - 0s 859us/step - loss: 0.1421
Epoch 4/10
32/32 [==============================] - 0s 813us/step - loss: 0.1244
Epoch 5/10
32/32 [==============================] - 0s 781us/step - loss: 0.1141
Epoch 6/10
32/32 [==============================] - 0s 844us/step - loss: 0.1027
Epoch 7/10
32/32 [==============================] - 0s 812us/step - loss: 0.0946
Epoch 8/10
32/32 [==============================] - 0s 688us/step - loss: 0.0899
Epoch 9/10
32/32 [==============================] - 0s 844us/step - loss: 0.0856
Epoch 10/10
32/32 [==============================] - 0s 625us/step - loss: 0.0807
In [6]:
w = LogisticRegression.layers[0].get_weights()[0]
b = LogisticRegression.layers[0].get_weights()[1]
print(w)
print(b)
[[2.3075073]
 [3.0700607]]
[-8.437275]
In [7]:
x1p = np.arange(0, 8, 0.01).reshape(-1, 1)
x2p = - w[0,0]/w[1,0]*x1p - b[0]/w[1,0]

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'g', linewidth = 3, label = '')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()

4. Looking at Parameters

  • To understand network's behavior
$$y = \sigma \,(b + \omega_1 x_1 + \omega_2 x_2)$$




In [8]:
# training data gerneration

m = 1000
x1 = 10*np.random.rand(m, 1) - 5
x2 = 8*np.random.rand(m, 1) - 4

g = - 0.5*(x1-1)**2 + 2*x2 + 5

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_X = np.asmatrix(train_X)

train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])
# ohe = OneHotEncoder(handle_unknown='ignore')
# train_y = ohe.fit_transform(train_y).toarray()

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()




In [25]:
LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_dim = 2, units = 2, activation = 'sigmoid'),
    tf.keras.layers.Dense(input_dim = 2, units = 1, activation = 'sigmoid')
])
In [26]:
LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1), 
                           loss = 'binary_crossentropy')
In [27]:
loss = LogisticRegression.fit(train_X, train_y, epochs = 10)
Epoch 1/10
32/32 [==============================] - 0s 751us/step - loss: 0.5363
Epoch 2/10
32/32 [==============================] - 0s 781us/step - loss: 0.3852
Epoch 3/10
32/32 [==============================] - 0s 781us/step - loss: 0.2949
Epoch 4/10
32/32 [==============================] - 0s 724us/step - loss: 0.2341
Epoch 5/10
32/32 [==============================] - 0s 953us/step - loss: 0.1786
Epoch 6/10
32/32 [==============================] - 0s 817us/step - loss: 0.1511
Epoch 7/10
32/32 [==============================] - 0s 781us/step - loss: 0.1610
Epoch 8/10
32/32 [==============================] - 0s 719us/step - loss: 0.1272
Epoch 9/10
32/32 [==============================] - 0s 781us/step - loss: 0.1121
Epoch 10/10
32/32 [==============================] - 0s 687us/step - loss: 0.1159
In [28]:
w1 = LogisticRegression.layers[0].get_weights()[0]
b1 = LogisticRegression.layers[0].get_weights()[1]
print(w1)
print(b1)
[[ 3.4930046  1.3582925]
 [ 2.1212444 -1.7032708]]
[ 6.409575  -5.0710745]
In [29]:
w2 = LogisticRegression.layers[1].get_weights()[0]
b2 = LogisticRegression.layers[1].get_weights()[1]
print(w2)
print(b2)
[[ 7.996894]
 [-8.171651]]
[-3.446862]
In [30]:
H = train_X*w1 + b1
H = 1/(1 + np.exp(-H))
In [31]:
plt.figure(figsize=(10, 8))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
In [16]:
x1p = np.arange(0, 1, 0.01).reshape(-1, 1)
x2p = - w2[0,0]/w2[1,0]*x1p - b2[0]/w2[1,0]

plt.figure(figsize=(10, 8))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
In [32]:
x1p = np.arange(-5, 5, 0.01).reshape(-1, 1)
x2p = - w1[0,0]/w1[1,0]*x1p - b1[0]/w1[1,0]
x3p = - w1[0,1]/w1[1,1]*x1p - b1[1]/w1[1,1]
# x4p = - w1[0,2]/w1[1,2]*x1p - b1[2]/w1[1,2]

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.plot(x1p, x3p, 'g', linewidth = 3, label = '')
# plt.plot(x1p, x4p, 'b', linewidth = 3, label = '')
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()

5. Artificial Neural Networks

  • Complex/Nonlinear universal function approximator
    • Linearly connected networks
    • Simple nonlinear neurons
  • Hidden layers
    • Autonomous feature learning



5.1. Training Neural Networks

$=$ Learning or estimating weights and biases of multi-layer perceptron from training data

Loss Function

  • Measures error between target values and predictions


$$ \min_{\omega} \sum_{i=1}^{m}\ell\left( h_{\omega}\left(x^{(i)}\right),y^{(i)}\right)$$

  • Example
    • Squared loss (for regression): $$ \frac{1}{m} \sum_{i=1}^{m} \left(h_{\omega}\left(x^{(i)}\right) - y^{(i)}\right)^2 $$
    • Cross entropy (for classification): $$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log\left(h_{\omega}\left(x^{(i)}\right)\right) + \left(1-y^{(i)}\right)\log\left(1-h_{\omega}\left(x^{(i)}\right)\right)$$

Learning

Learning weights and biases from data using gradient descent


$$\omega \Leftarrow \omega - \alpha \nabla_{\omega} \ell \left( h_{\omega}\left(x^{(i)}\right), y^{(i)} \right)$$


Backpropagation

  • Forward propagation
    • the initial information propagates up to the hidden units at each layer and finally produces output
  • Backpropagation
    • allows the information from the cost to flow backwards through the network in order to compute the gradients

6. ANN with MNIST

  • MNIST (Mixed National Institute of Standards and Technology database) database
    • Handwritten digit database
    • $28 \times 28$ gray scaled image



We will be using MNIST to create a Multinomial Classifier that can detect if the MNIST image shown is a member of class 0,1,2,3,4,5,6,7,8 or 9. Susinctly, we're teaching a computer to recognize hand written digets.

Let's download and load the dataset.

In [33]:
mnist = tf.keras.datasets.mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()

train_x, test_x = train_x/255.0, test_x/255.0
In [34]:
print ("The training data set is:\n")
print (train_x.shape)
print (train_y.shape)
The training data set is:

(60000, 28, 28)
(60000,)
In [35]:
print ("The test data set is:")
print (test_x.shape)
print (test_y.shape)
The test data set is:
(10000, 28, 28)
(10000,)

Let's visualize what some of these images and their corresponding training labels look like.

In [38]:
print('label :', train_y[0])

plt.figure(figsize = (6,6))
plt.imshow(train_x[0], 'gray')
plt.xticks([])
plt.yticks([])
plt.show()
label : 5
  • Our network model



In [39]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape = (28, 28)),
    tf.keras.layers.Dense(units = 100, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'softmax')
])

model.compile(optimizer = 'adam',
              loss = 'sparse_categorical_crossentropy',
              metrics = ['accuracy'])

loss = model.fit(train_x, train_y, epochs = 5)
Epoch 1/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.2706 - accuracy: 0.9232
Epoch 2/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1214 - accuracy: 0.9636
Epoch 3/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0836 - accuracy: 0.9748
Epoch 4/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0646 - accuracy: 0.9801
Epoch 5/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0508 - accuracy: 0.9840
In [40]:
test_loss, test_acc = model.evaluate(test_x, test_y)
313/313 [==============================] - 0s 866us/step - loss: 0.0765 - accuracy: 0.9776
In [50]:
test_img = test_x[np.random.choice(test_x.shape[0], 1)]

predict = model.predict_on_batch(test_img)
mypred = np.argmax(predict, axis = 1)

plt.figure(figsize = (12,5))

plt.subplot(1,2,1)
plt.imshow(test_img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()

print('Prediction : {}'.format(mypred[0]))
Prediction : 3
In [25]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')