Artificial Neural Networks (ANN)
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST
Table of Contents
1. Recall Perceptron¶
from IPython.display import YouTubeVideo
YouTubeVideo('W5i9OA0bW-A', width="560", height="315", frameborder="0")
Perceptron
XOR Problem
- Minsky-Papert Controversy on XOR
- not linearly separable
- limitation of perceptron
2. From Perceptron to Multi-Layer Perceptron (MLP)¶
2.1. Perceptron for $h_{\omega}(x)$¶
Neurons compute the weighted sum of their inputs
A neuron is activated or fired when the sum $a$ is positive
$$ \begin{align*} a &= \omega_0 + \omega_1 x_1 + \omega_2 x_2 \\ \\ \hat{y} &= g(a) = \begin{cases} 1 & a > 0\\ 0 & \text{otherwise} \end{cases} \end{align*} $$
- A step function (or sign function) is not differentiable
- One layer is often not enough
- One hyperplane
2.2. Multi-layer Perceptron = Artificial Neural Networks (ANN)¶
- Multi-neurons
- Differentiable activation function (for example, sigmoid function)
- In a compact representation
- Multi-layer perceptron
2.3. Another Perspective: ANN as Kernel Learning¶
The main weakness of linear predictors is their lack of capacity.
For classification, the populations have to be linearly separable.
The XOR example can be solved by pre-processing the data to make the two populations linearly separable.
3. Logistic Regression in a Form of Neural Network¶
$$ \begin{align*} y^{(i)} &\in \{1,0\}\\\\ y &= \sigma (\omega_0 + \omega_1 x_1 + \omega_2 x_2) \end{align*} $$
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline
#training data gerneration
m = 1000
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4
g = 0.8*x1 + x2 - 3
C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M
X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
train_X = np.vstack([X1, X0])
train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])
train_X = np.asmatrix(train_X)
train_y = np.asmatrix(train_y)
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()
LogisticRegression = tf.keras.models.Sequential([
tf.keras.layers.Dense(input_dim = 2,
units = 1,
activation = 'sigmoid')
])
LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
loss = 'binary_crossentropy')
loss = LogisticRegression.fit(train_X, train_y, epochs = 10)
w = LogisticRegression.layers[0].get_weights()[0]
b = LogisticRegression.layers[0].get_weights()[1]
print(w)
print(b)
x1p = np.arange(0, 8, 0.01).reshape(-1, 1)
x2p = - w[0,0]/w[1,0]*x1p - b[0]/w[1,0]
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'g', linewidth = 3, label = '')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()
3.1. Looking at Parameters in Nonlinear Classification¶
- To understand network's behavior
$$y = \sigma(b + \omega_1 x_1 + \omega_2 x_2)$$
# training data gerneration
m = 1000
x1 = 10*np.random.rand(m, 1) - 5
x2 = 8*np.random.rand(m, 1) - 4
g = - 0.5*(x1-1)**2 + 2*x2 + 5
C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M
X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
train_X = np.vstack([X1, X0])
train_X = np.asmatrix(train_X)
train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()
LogisticRegression = tf.keras.models.Sequential([
tf.keras.layers.Dense(input_dim = 2, units = 2, activation = 'sigmoid'),
tf.keras.layers.Dense(units = 1, activation = 'sigmoid')
])
LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
loss = 'binary_crossentropy')
loss = LogisticRegression.fit(train_X, train_y, epochs = 10)
w1 = LogisticRegression.layers[0].get_weights()[0]
b1 = LogisticRegression.layers[0].get_weights()[1]
w2 = LogisticRegression.layers[1].get_weights()[0]
b2 = LogisticRegression.layers[1].get_weights()[1]
H = train_X*w1 + b1
H = 1/(1 + np.exp(-H))
plt.figure(figsize = (6, 4))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
x1p = np.arange(0, 1, 0.01).reshape(-1, 1)
x2p = - w2[0,0]/w2[1,0]*x1p - b2[0]/w2[1,0]
plt.figure(figsize = (6, 4))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
x1p = np.arange(-5, 5, 0.01).reshape(-1, 1)
x2p = - w1[0,0]/w1[1,0]*x1p - b1[0]/w1[1,0]
x3p = - w1[0,1]/w1[1,1]*x1p - b1[1]/w1[1,1]
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.plot(x1p, x3p, 'g', linewidth = 3, label = '')
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()
4. Regression in a Form of Neural Network¶
- Rectified linear unit (ReLU activation function)
$$h(x) = \max(0, x)$$
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
def relu(x):
return np.maximum(0, x)
xp = np.linspace(-1.5, 1.5, 100)
yp = relu(xp)
plt.figure(figsize = (6, 4))
plt.plot(xp, yp, '--', color = 'red', alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.axis('equal')
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.grid(alpha = 0.3)
plt.show()
$$h(x-0.5)$$
xp = np.linspace(-1.5, 1.5, 100)
yp = relu(xp - 0.5)
plt.figure(figsize = (6, 4))
plt.plot(xp, yp, '--', color = 'red', alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.axis('equal')
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.grid(alpha = 0.3)
plt.show()
$$h(-2x)$$
xp = np.linspace(-1.5, 1.5, 100)
yp = relu(-2*xp)
plt.figure(figsize = (6, 4))
plt.plot(xp, yp, '--', color = 'red', alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.axis('equal')
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.grid(alpha = 0.3)
plt.show()
$$- h(-2x-1)$$
xp = np.linspace(-1.5, 1.5, 100)
yp = -relu(-2*xp - 1)
plt.figure(figsize = (6, 4))
plt.plot(xp, yp, '--', color = 'red', alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.axis('equal')
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.grid(alpha = 0.3)
plt.show()
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
np.random.seed(0)
tf.random.set_seed(0)
def function(x):
return x**3 + 0.1*x**2 - x + 0.1
x = np.linspace(-1.5, 1.5, 1000)
y = function(x)
plt.figure(figsize = (6, 4))
plt.plot(x, y, '--', color = 'red', alpha = 0.5)
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()
train_x = np.random.uniform(-1.5, 1.5, 1000)
train_y = function(train_x)
Single Neuron with ReLU
$$\hat{y} = \omega^{(2)} \left( h \left( \omega^{(1)} x + b^{(1)} \right) \right)$$
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 1, activation = 'relu'),
tf.keras.layers.Dense(units = 1, use_bias = False)
])
model.compile(optimizer = 'adam',
loss = 'mse')
train_x = train_x.reshape(-1, 1)
train_y = train_y.reshape(-1, 1)
model.fit(train_x, train_y, epochs = 500, verbose = 0)
weights = model.layers[0].get_weights()[0]
biases = model.layers[0].get_weights()[1]
weights2 = model.layers[1].get_weights()[0]
print("Coefficients (Weights):", weights)
print("Intercepts (Biases):", biases)
print("Coefficients (Weights):", weights2)
real_x = np.linspace(-1.5, 1.5, 100)
real_y = function(real_x)
def relu(x):
return np.maximum(0, x)
x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2*(relu(weights*x1p + biases))
plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'c', linewidth = 3)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()
Two Neurons with ReLU
$$\hat{y} = \omega_1^{(2)} \left( h \left( \omega_1^{(1)} x + b_1^{(1)} \right) \right) + \omega_2^{(2)} \left( h \left( \omega_2^{(1)} x + b_2^{(1)} \right) \right)$$
model_2 = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 2, activation = 'relu'),
tf.keras.layers.Dense(units = 1, use_bias = False)
])
model_2.compile(optimizer = 'adam',
loss = 'mse')
train_x = train_x.reshape(-1, 1)
train_y = train_y.reshape(-1, 1)
model_2.fit(train_x, train_y, epochs = 3000, verbose = 0)
weights = model_2.layers[0].get_weights()[0]
biases = model_2.layers[0].get_weights()[1]
weights2 = model_2.layers[1].get_weights()[0]
print("Coefficients (Weights):", weights)
print("Intercepts (Biases):", biases)
print("Coefficients (Weights):", weights2)
x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2[0]*relu(weights[0][0]*x1p + biases[0])
x3p = weights2[1]*relu(weights[0][1]*x1p + biases[1])
plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'b', linewidth = 3, alpha = 0.5)
plt.plot(x1p, x3p, 'k', linewidth = 3, alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()
x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2[0]*relu(weights[0][0]*x1p + biases[0]) + weights2[1]*relu(weights[0][1]*x1p + biases[1])
plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'c', linewidth = 3)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()
Four Neurons with ReLU
$$\hat{y} = \omega_1^{(2)} \left( h \left( \omega_1^{(1)} x + b_1^{(1)} \right) \right) + \omega_2^{(2)} \left( h \left( \omega_2^{(1)} x + b_2^{(1)} \right) \right) + \omega_3^{(2)} \left( h \left( \omega_3^{(1)} x + b_3^{(1)} \right)\right) + \omega_4^{(2)} \left( h \left( \omega_4^{(1)} x + b_4^{(1)} \right) \right) $$
model_4 = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 4, activation = 'relu'),
tf.keras.layers.Dense(units = 1, use_bias = False)
])
model_4.compile(optimizer = 'adam',
loss = 'mse')
model_4.fit(train_x, train_y, epochs = 4000, verbose = 0)
weights = model_4.layers[0].get_weights()[0]
biases = model_4.layers[0].get_weights()[1]
weights2 = model_4.layers[1].get_weights()[0]
x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2[0]*relu(weights[0][0]*x1p + biases[0])
x3p = weights2[1]*relu(weights[0][1]*x1p + biases[1])
x4p = weights2[2]*relu(weights[0][2]*x1p + biases[2])
x5p = weights2[3]*relu(weights[0][3]*x1p + biases[3])
plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'b', linewidth = 3, alpha = 0.5)
plt.plot(x1p, x3p, 'k', linewidth = 3, alpha = 0.5)
plt.plot(x1p, x4p, 'g', linewidth = 3, alpha = 0.5)
plt.plot(x1p, x5p, 'y', linewidth = 3, alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()
x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2[0]*relu(weights[0][0]*x1p + biases[0]) + weights2[1]*relu(weights[0][1]*x1p + biases[1]) + weights2[2]*relu(weights[0][2]*x1p + biases[2]) + weights2[3]*relu(weights[0][3]*x1p + biases[3])
plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'c', linewidth = 3)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()
100 Neurons with ReLU
model_100 = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 100, activation = 'relu'),
tf.keras.layers.Dense(units = 1, use_bias = False)
])
model_100.compile(optimizer = 'adam',
loss = 'mse')
model_100.fit(train_x, train_y, epochs = 1000, verbose = 0)
weights = model_100.layers[0].get_weights()[0]
biases = model_100.layers[0].get_weights()[1]
weights2 = model_100.layers[1].get_weights()[0]
x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = 0
for i in range(100):
x2p += weights2[i]*relu(weights[0][i]*x1p + biases[i])
plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'c', linewidth = 3)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()
5. Artificial Neural Networks¶
Complex/Nonlinear universal function approximator
- Linearly connected networks
- Simple nonlinear neurons
Hidden layers
- Autonomous feature learning
6. ANN Learning¶
from IPython.display import YouTubeVideo
YouTubeVideo('mGuXwSbantc', width="560", height="315", frameborder="0")
6.1. Recursive Algorithm¶
One of the central ideas of computer science
Depends on solutions to smaller instances of the same problem (= subproblem)
Function to call itself (it is impossible in the real world)
%%html
<iframe src="https://www.youtube.com/embed/t4MSwiqfLaY?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>
- Factorial example
$$n ! = n \cdot (n-1) \cdots 2 \cdot 1$$
n = 5
m = 1
for i in range(n):
m = m*(i+1)
print(m)
def fac(n):
if n == 1:
return 1
else:
return n*fac(n-1)
# recursive
fac(5)
6.2. Dynamic Programming¶
Dynamic Programming: general, powerful algorithm design technique
Fibonacci numbers:
# naive Fibonacci
def fib(n):
if n <= 2:
return 1
else:
return fib(n-1) + fib(n-2)
fib(10)
# Memorized DP Fibonacci
def mfib(n):
global memo
if memo[n-1] != 0:
return memo[n-1]
elif n <= 2:
memo[n-1] = 1
return memo[n-1]
else:
memo[n-1] = mfib(n-1) + mfib(n-2)
return memo[n-1]
import numpy as np
n = 10
memo = np.zeros(n)
mfib(n)
n = 30
%timeit fib(30)
memo = np.zeros(n)
%timeit mfib(30)
6.3. Training Neural Networks¶
- Learning or estimating weights and biases of multi-layer perceptron from training data
6.3.1. Optimization¶
- 3 key components
- objective function $f(\cdot)$
- decision variable or unknown $\omega$
- constraints $g(\cdot)$
- In mathematical expression
$$\begin{align*} \min_{\omega} \quad &f(\omega) \end{align*} $$
6.3.2. Loss Function¶
- Measures error between target values and predictions
$$ \min_{\omega} \sum_{i=1}^{m}\ell\left( h_{\omega}\left(x^{(i)}\right),y^{(i)}\right)$$
- Example
Squared loss (for regression):
$$ \frac{1}{m} \sum_{i=1}^{m} \left(h_{\omega}\left(x^{(i)}\right) - y^{(i)}\right)^2 $$
Cross-entropy (for classification):
$$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log\left(h_{\omega}\left(x^{(i)}\right)\right) + \left(1-y^{(i)}\right)\log\left(1-h_{\omega}\left(x^{(i)}\right)\right)$$
6.3.3. Learning¶
- Learning weights and biases from data using gradient descent
$$\omega \leftarrow \omega - \alpha \nabla_{\omega} \ell \left( h_{\omega}\left(x^{(i)}\right), y^{(i)} \right)$$
$\frac{\partial \ell}{\partial \omega}$: too many computations are required for all $\omega$
Structural constraints of NN:
- Composition of functions
- Chain rule
- Dynamic programming
Backpropagation
Forward propagation
- the initial information propagates up to the hidden units at each layer and finally produces output
Backpropagation
- allows the information from the cost to flow backwards through the network in order to compute the gradients
- Chain rule
- Computing the derivative of the composition of functions
$$ \begin{align*} f(g(x))' &= f'(g(x))g'(x)\\\\ {dz \over dx} &= {dz \over dy} \cdot {dy \over dx}\\ {dz \over dw} &= \left({dz \over dy} \cdot {dy \over dx}\right) \cdot {dx \over dw}\\ {dz \over du} &= \left({dz \over dy} \cdot {dy \over dx} \cdot {dx \over dw}\right) \cdot {dw \over du} \end{align*} $$
- Backpropagation
- Update weights recursively with memory
Optimization procedure
- It is not easy to numerically compute gradients in network in general.
- The good news: people have already done all the "hardwork" of developing numerical solvers (or libraries)
- There are a wide range of tools: TensorFlow
Summary
- Learning weights and biases from data using gradient descent
7. ANN with MNIST¶
from IPython.display import YouTubeVideo
YouTubeVideo('z-ZhKdQpF7I', width="560", height="315", frameborder="0")
7.1. What's an MNIST?¶
From Wikipedia
The MNIST database (Mixed National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, NIST's complete dataset was too hard.
MNIST (Mixed National Institute of Standards and Technology database) database
- Handwritten digit database
- $28 \times 28$ gray scaled image
- (Flattened matrix into a vector of $28 \times 28 = 784$) $\rightarrow$ not for TensorFlow 2
More here
We will be using MNIST to create a multinomial classifier that can detect if the MNIST image shown is a member of class 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9. Susinctly, we're teaching a computer to recognize hand written digits.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline
Let's download and load the dataset.
mnist = tf.keras.datasets.mnist
(train_x, train_y), (test_x, test_y) = mnist.load_data()
train_x, test_x = train_x/255.0, test_x/255.0
print ("The training data set is:\n")
print (train_x.shape)
print (train_y.shape)
print ("The test data set is:")
print (test_x.shape)
print (test_y.shape)
Display a few random samples from it:
# So now we have a 28x28 matrix, where each element is an intensity level from 0 to 1.
img = train_x[5]
img.shape
Let's visualize what some of these images and their corresponding training labels look like.
plt.figure(figsize = (4, 4))
plt.imshow(img, 'gray')
plt.xticks([])
plt.yticks([])
plt.show()
train_y[5]
7.2. ANN with TensorFlow¶
- Feed a gray image to ANN
- Our network model
- Network training (learning)
$$\omega:= \omega - \alpha \nabla_{\omega} \left( h_{\omega} \left(x^{(i)}\right),y^{(i)}\right)$$
7.2.1. Import Library¶
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
7.2.2. Load MNIST Data¶
- Download MNIST data from tensorflow tutorial example
mnist = tf.keras.datasets.mnist
(train_x, train_y), (test_x, test_y) = mnist.load_data()
train_x, test_x = train_x/255.0, test_x/255.0
7.2.4. Define Weights, Biases, and Placeholder¶
- Define parameters based on predefined layer size
- Initialize with normal distribution with $\mu = 0$ and $\sigma = 0.1$
7.2.5. Build a Model¶
First, the layer performs several matrix multiplication to produce a set of linear activations
$$y_j = \left(\sum\limits_i \omega_{ij}x_i\right) + b_j$$
$$\mathcal{y} = \omega^T \mathcal{x} + \mathcal{b}$$
Second, each linear activation is running through a nonlinear activation function
Third, predict values with an affine transformation
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape = (28, 28)),
tf.keras.layers.Dense(units = 100, activation = 'relu'),
tf.keras.layers.Dense(units = 10, activation = 'softmax')
])
7.2.6. Define Loss and Optimizer¶
Loss
- This defines how we measure how accurate the model is during training. As was covered in lecture, during training we want to minimize this function, which will "steer" the model in the right direction.
- Classification: Cross entropy
- Equivalent to apply logistic regression
$$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(h_{\theta}\left(x^{(i)}\right)) + (1-y^{(i)})\log(1-h_{\theta}\left(x^{(i)}\right)) $$
Optimizer
- This defines how the model is updated based on the data it sees and its loss function.
- AdamOptimizer: the most popular optimizer
7.2.7. Define Optimization Configuration and Then Optimize¶
Define parameters for training ANN
n_batch
: batch size for mini-batch gradient descentn_iter
: the number of iteration steps per epochn_epoch
: iteration over the entire x and y data provided
Metrics
- Here we can define metrics used to monitor the training and testing steps. In this example, we'll look at the accuracy, the fraction of the images that are correctly classified.
Initializer
- Initialize all the variables
model.compile(optimizer = 'adam',
loss = 'sparse_categorical_crossentropy',
metrics = ['accuracy'])
# Train Model
loss = model.fit(train_x, train_y, epochs = 5)
# Evaluate Test Data
test_loss, test_acc = model.evaluate(test_x, test_y)
7.2.8. Test or Evaluate¶
test_img = test_x[np.random.choice(test_x.shape[0], 1)]
predict = model.predict_on_batch(test_img)
mypred = np.argmax(predict, axis = 1)
plt.figure(figsize = (8,4))
plt.subplot(1,2,1)
plt.imshow(test_img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()
print('Prediction : {}'.format(mypred[0]))
You may observe that the accuracy on the test dataset is a little lower than the accuracy on the training dataset. This gap between training accuracy and test accuracy is an example of overfitting, when a machine learning model performs worse on new data than on its training data.
What is the highest accuracy you can achieve with this first fully connected model? Since the handwritten digit classification task is pretty straightforward, you may be wondering how we can do better...
$\Rightarrow$ As we saw in lecture, convolutional neural networks (CNNs) are particularly well-suited for a variety of tasks in computer vision, and have achieved near-perfect accuracies on the MNIST dataset. We will build a CNN and ultimately output a probability distribution over the 10 digit classes (0-9) in the next lectures.
8. More in ANN¶
from IPython.display import YouTubeVideo
YouTubeVideo('Jz7zrwoesBw', width="560", height="315", frameborder="0")
8.1. Nonlinear Activation Function¶
The Vanishing Gradient Problem
As more layers using certain activation functions are added to neural networks, the gradients of the loss function approaches zero, making the network hard to train.
For example,
$$\frac{z}{u} = \frac{z}{y} \cdot \frac{y}{x} \cdot \frac{x}{\omega} \cdot \frac{\omega}{u} $$
- Rectifiers
- The use of the ReLU activation function was a great improvement compared to the historical tanh.
- This can be explained by the derivative of ReLU itself not vanishing, and by the resulting coding being sparse (Glorot et al., 2011).
8.2. Batch Normalization¶
Batch normalization is a technique for improving the performance and stability of artificial neural networks.
It is used to normalize the input layer by adjusting and scaling the activations.
During training batch normalization shifts and rescales according to the mean and variance estimated on the batch.
During test, it simply shifts and rescales according to the empirical moments estimated during training.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
Overfitting in Regression
N = 10
data_x = np.linspace(-4.5, 4.5, N)
data_y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512])
data_x = data_x.reshape(-1,1)
data_y = data_y.reshape(-1,1)
plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.grid(alpha = 0.3)
plt.show()
base_model = tf.keras.models.Sequential([
tf.keras.layers.Dense(input_shape = (1,),
units = 30, activation = 'sigmoid'),
tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
tf.keras.layers.Dense(units = 30, activation = 'sigmoid'),
tf.keras.layers.Dense(units = 1, activation = None)
])
base_model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
loss = 'mse',
metrics = ['mse'])
# Train Model & Evaluate Test Data
training = base_model.fit(data_x, data_y, epochs = 5000, verbose = 0)
xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = base_model.predict(xp)
plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()
Batch Normalization Implementation
- This example is not for demonstrating the improvement of the performance and stability of artificial neural networks with the batch normalization, but for demonstrating how to implement the batch normalization in TensorFlow 2.
bn_model = tf.keras.models.Sequential([
tf.keras.layers.Dense(units = 30, activation = None, input_shape = (1,)),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('sigmoid'),
tf.keras.layers.Dense(units = 100, activation = None),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('sigmoid'),
tf.keras.layers.Dense(units = 100, activation = None),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('sigmoid'),
tf.keras.layers.Dense(units = 30, activation = None),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('sigmoid'),
tf.keras.layers.Dense(units = 1, activation = None)
])
bn_model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
loss = 'mse',
metrics = ['mse'])
training = bn_model.fit(data_x, data_y, epochs = 4000, verbose = 0)
xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = bn_model.predict(xp)
plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()
8.3. Dropout as Regularization¶
8.3.1. Regularization (Shrinkage Methods)¶
Often, overfitting associated with very large estimated parameters $\omega$
We want to balance
how well function fits data
magnitude of coefficients
$$ \begin{align*} \text{Total loss } = \;&\underbrace{\text{measure of fit}}_{RSS(\omega)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{ \lVert \omega \rVert_d} \\ \\ \implies &\min\; \lVert h_{\omega} (x_i) - y \rVert_2^2 + \lambda \lVert \omega \rVert_d \end{align*} $$
where $ RSS(\omega) = \lVert h_{\omega} (x_i) - y \rVert^2_2 $, ( = Rresidual Sum of Squares) and $\lambda$ is a tuning parameter to be determined separately
the second term, $\lambda \, \lVert \omega \rVert_d$, called a shrinkage penalty, is small when $\omega_1, \cdots,\omega_n$ are close to zeros, and so it has the effect of shrinking the estimates of $\omega_j$ towards zero
The tuning parameter $\lambda$ serves to control the relative impact of these two terms on the weights' estimates
8.3.2. Different Regularization Techniques¶
Big Data
Data augmentation
- The simplest way to reduce overfitting is to increase the size of the training data
- Early stopping
- When we see that the performance on the validation set is getting worse, we immediately stop the training on the model
8.3.3. Dropout¶
- This is the one of the most interesting types of regularization techniques.
- It also produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.
- At every iteration, it randomly selects some nodes and removes them.
- It can also be thought of as an ensemble technique in machine learning.
tf.keras.layers.Dropout(rate = p)
For training
- rate: the probability that each element is dropped. For example, setting rate = 0.1 would drop 10% of input elements with probability rate, drops elements of layers. Input that are kept are scaled up by $\frac{1}{1−\text{rate}}$, otherwise outputs 0. The scaling is so that the expected sum is unchanged.
For testing
- All the elements are kept
Dropout Implementation
dropout_model = tf.keras.models.Sequential([
tf.keras.layers.Dense(input_shape = (1,),
units = 30, activation = 'sigmoid'),
tf.keras.layers.Dropout(rate = 0.2),
tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
tf.keras.layers.Dropout(rate = 0.2),
tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
tf.keras.layers.Dropout(rate = 0.2),
tf.keras.layers.Dense(units = 30, activation = 'sigmoid'),
tf.keras.layers.Dropout(rate = 0.2),
tf.keras.layers.Dense(units = 1, activation = None)
])
dropout_model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
loss = 'mse',
metrics = ['mse'])
training = dropout_model.fit(data_x, data_y, epochs = 200, verbose = 0)
xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = dropout_model.predict(xp)
plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()
9. Other Tutorials¶
%%html
<iframe src="https://www.youtube.com/embed/aircAruvnKk?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>
%%html
<iframe src="https://www.youtube.com/embed/IHZwWFHWa-w?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>
%%html
<iframe src="https://www.youtube.com/embed/Ilg3gGewQ5U?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>
%%html
<iframe src="https://www.youtube.com/embed/tIeHLnjs5U8?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')