Artificial Neural Networks (ANN)
Table of Contents
1. Recall Perceptron¶
Perceptron
XOR Problem
- Minsky-Papert Controversy on XOR
- not linearly separable
- limitation of perceptron
2. From Perceptron to Multi-Layer Perceptron (MLP)¶
2.1. Perceptron for $h_{\omega}(x)$¶
Neurons compute the weighted sum of their inputs
A neuron is activated or fired when the sum $a$ is positive
$$ \begin{align*} a &= \omega_0 + \omega_1 x_1 + \omega_2 x_2 \\ \\ \hat{y} &= g(a) = \begin{cases} 1 & a > 0\\ 0 & \text{otherwise} \end{cases} \end{align*} $$
- A step function is not differentiable
- One layer is often not enough
- One hyperplane
2.2. Multi-layer Perceptron = Artificial Neural Networks (ANN)¶
Multi-neurons
Differentiable activation function
In a compact representation
Multi-layer perceptron
2.3. Another Perspective: ANN as Kernel Learning¶
We can represent this “neuron” as follows:
The main weakness of linear predictors is their lack of capacity. For classification, the populations have to be linearly separable.
The XOR example can be solved by pre-processing the data to make the two populations linearly separable.
3. Logistic Regression in a Form of Neural Network¶
$$y^{(i)} \in \{1,0\}$$
$$y = \sigma \,(\omega_0 + \omega_1 x_1 + \omega_2 x_2)$$
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline
#training data gerneration
m = 1000
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4
g = 0.8*x1 + x2 - 3
C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M
X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
train_X = np.vstack([X1, X0])
train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])
train_X = np.asmatrix(train_X)
train_y = np.asmatrix(train_y)
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()
LogisticRegression = tf.keras.models.Sequential([
tf.keras.layers.Dense(input_dim = 2,
units = 1,
activation = 'sigmoid')
])
LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
loss = 'binary_crossentropy')
loss = LogisticRegression.fit(train_X, train_y, epochs = 10)
w = LogisticRegression.layers[0].get_weights()[0]
b = LogisticRegression.layers[0].get_weights()[1]
print(w)
print(b)
x1p = np.arange(0, 8, 0.01).reshape(-1, 1)
x2p = - w[0,0]/w[1,0]*x1p - b[0]/w[1,0]
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'g', linewidth = 3, label = '')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()
4. Looking at Parameters¶
- To understand network's behavior
$$y = \sigma \,(b + \omega_1 x_1 + \omega_2 x_2)$$
# training data gerneration
m = 1000
x1 = 10*np.random.rand(m, 1) - 5
x2 = 8*np.random.rand(m, 1) - 4
g = - 0.5*(x1-1)**2 + 2*x2 + 5
C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M
X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])
train_X = np.vstack([X1, X0])
train_X = np.asmatrix(train_X)
train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()
LogisticRegression = tf.keras.models.Sequential([
tf.keras.layers.Dense(input_dim = 2, units = 2, activation = 'sigmoid'),
tf.keras.layers.Dense(units = 1, activation = 'sigmoid')
])
LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
loss = 'binary_crossentropy')
loss = LogisticRegression.fit(train_X, train_y, epochs = 10)
w1 = LogisticRegression.layers[0].get_weights()[0]
b1 = LogisticRegression.layers[0].get_weights()[1]
w2 = LogisticRegression.layers[1].get_weights()[0]
b2 = LogisticRegression.layers[1].get_weights()[1]
H = train_X*w1 + b1
H = 1/(1 + np.exp(-H))
plt.figure(figsize = (6, 4))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
x1p = np.arange(0, 1, 0.01).reshape(-1, 1)
x2p = - w2[0,0]/w2[1,0]*x1p - b2[0]/w2[1,0]
plt.figure(figsize = (6, 4))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
x1p = np.arange(-5, 5, 0.01).reshape(-1, 1)
x2p = - w1[0,0]/w1[1,0]*x1p - b1[0]/w1[1,0]
x3p = - w1[0,1]/w1[1,1]*x1p - b1[1]/w1[1,1]
plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.plot(x1p, x3p, 'g', linewidth = 3, label = '')
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()
5. Artificial Neural Networks¶
Complex/Nonlinear universal function approximator
- Linearly connected networks
- Simple nonlinear neurons
Hidden layers
- Autonomous feature learning
6. ANN Learning¶
6.1. Recursive Algorithm¶
One of the central ideas of computer science
Depends on solutions to smaller instances of the same problem ( = subproblem)
Function to call itself (it is impossible in the real world)
%%html
<center><iframe src="https://www.youtube.com/embed/t4MSwiqfLaY?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
- Factorial example
$$n ! = n \cdot (n-1) \cdots 2 \cdot 1$$
n = 5
m = 1
for i in range(n):
m = m*(i+1)
print(m)
def fac(n):
if n == 1:
return 1
else:
return n*fac(n-1)
# recursive
fac(5)
6.2. Dynamic Programming¶
Dynamic Programming: general, powerful algorithm design technique
Fibonacci numbers:
# naive Fibonacci
def fib(n):
if n <= 2:
return 1
else:
return fib(n-1) + fib(n-2)
fib(10)
# Memorized DP Fibonacci
def mfib(n):
global memo
if memo[n-1] != 0:
return memo[n-1]
elif n <= 2:
memo[n-1] = 1
return memo[n-1]
else:
memo[n-1] = mfib(n-1) + mfib(n-2)
return memo[n-1]
import numpy as np
n = 10
memo = np.zeros(n)
mfib(n)
n = 30
%timeit fib(30)
memo = np.zeros(n)
%timeit mfib(30)
6.3. Training Neural Networks¶
$=$ Learning or estimating weights and biases of multi-layer perceptron from training data
6.3.1. Optimization¶
3 key components
- objective function $f(\cdot)$
- decision variable or unknown $\omega$
- constraints $g(\cdot)$
In mathematical expression
$$\begin{align*} \min_{\omega} \quad &f(\omega) \end{align*} $$
6.3.2. Loss Function¶
- Measures error between target values and predictions
$$ \min_{\omega} \sum_{i=1}^{m}\ell\left( h_{\omega}\left(x^{(i)}\right),y^{(i)}\right)$$
- Example
- Squared loss (for regression):
$$ \frac{1}{m} \sum_{i=1}^{m} \left(h_{\omega}\left(x^{(i)}\right) - y^{(i)}\right)^2 $$
- Cross entropy (for classification): $$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log\left(h_{\omega}\left(x^{(i)}\right)\right) + \left(1-y^{(i)}\right)\log\left(1-h_{\omega}\left(x^{(i)}\right)\right)$$
- Squared loss (for regression):
$$ \frac{1}{m} \sum_{i=1}^{m} \left(h_{\omega}\left(x^{(i)}\right) - y^{(i)}\right)^2 $$
6.3.3. Learning¶
Learning weights and biases from data using gradient descent
$$\omega \leftarrow \omega - \alpha \nabla_{\omega} \ell \left( h_{\omega}\left(x^{(i)}\right), y^{(i)} \right)$$
$\frac{\partial \ell}{\partial \omega}$: too many computations are required for all $\omega$
Structural constraints of NN:
- Composition of functions
- Chain rule
- Dynamic programming
Backpropagation
Forward propagation
- the initial information propagates up to the hidden units at each layer and finally produces output
Backpropagation
- allows the information from the cost to flow backwards through the network in order to compute the gradients
Chain Rule
Computing the derivative of the composition of functions
$\space f(g(x))' = f'(g(x))g'(x)$
$\space {dz \over dx} = {dz \over dy} \bullet {dy \over dx}$
$\space {dz \over dw} = ({dz \over dy} \bullet {dy \over dx}) \bullet {dx \over dw}$
$\space {dz \over du} = ({dz \over dy} \bullet {dy \over dx} \bullet {dx \over dw}) \bullet {dw \over du}$
Backpropagation
- Update weights recursively with memory
Optimization procedure
- It is not easy to numerically compute gradients in network in general.
- The good news: people have already done all the "hardwork" of developing numerical solvers (or libraries)
- There are a wide range of tools: TensorFlow
Summary
- Learning weights and biases from data using gradient descent
6.4. Other Tutorials¶
%%html
<center><iframe src="https://www.youtube.com/embed/aircAruvnKk?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/IHZwWFHWa-w?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/Ilg3gGewQ5U?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/tIeHLnjs5U8?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
7. ANN with MNIST¶
7.1. What's an MNIST?¶
From Wikipedia
The MNIST database (Mixed National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, NIST's complete dataset was too hard.
MNIST (Mixed National Institute of Standards and Technology database) database
- Handwritten digit database
- $28 \times 28$ gray scaled image
- (Flattened matrix into a vector of $28 \times 28 = 784$) $\rightarrow$ not for TensorFlow 2
More here
We will be using MNIST to create a Multinomial Classifier that can detect if the MNIST image shown is a member of class 0,1,2,3,4,5,6,7,8 or 9. Susinctly, we're teaching a computer to recognize hand written digets.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline
Let's download and load the dataset.
mnist = tf.keras.datasets.mnist
(train_x, train_y), (test_x, test_y) = mnist.load_data()
train_x, test_x = train_x/255.0, test_x/255.0
print ("The training data set is:\n")
print (train_x.shape)
print (train_y.shape)
print ("The test data set is:")
print (test_x.shape)
print (test_y.shape)
Display a few random samples from it:
# So now we have a 28x28 matrix, where each element is an intensity level from 0 to 1.
img = train_x[5]
img.shape
Let's visualize what some of these images and their corresponding training labels look like.
plt.figure(figsize = (4, 4))
plt.imshow(img, 'gray')
plt.xticks([])
plt.yticks([])
plt.show()
train_y[5]
7.2. ANN with TensorFlow¶
- Feed a gray image to ANN