Artificial Neural Networks (ANN)

Table of Contents

1. Recall Perceptron¶

Perceptron

XOR Problem

Minsky-Papert Controversy on XOR
- not linearly separable
- limitation of perceptron

2. From Perceptron to Multi-Layer Perceptron (MLP)¶

2.1. Perceptron for $h_{\omega}(x)$¶

Neurons compute the weighted sum of their inputs
A neuron is activated or fired when the sum $a$ is positive

$$ \begin{align*} a &= \omega_0 + \omega_1 x_1 + \omega_2 x_2 \\ \\ \hat{y} &= g(a) = \begin{cases} 1 & a > 0\\ 0 & \text{otherwise} \end{cases} \end{align*} $$

A step function is not differentiable

One layer is often not enough
- One hyperplane

2.2. Multi-layer Perceptron = Artificial Neural Networks (ANN)¶

Multi-neurons

Differentiable activation function

In a compact representation

Multi-layer perceptron

2.3. Another Perspective: ANN as Kernel Learning¶

We can represent this “neuron” as follows:

The main weakness of linear predictors is their lack of capacity. For classiﬁcation, the populations have to be linearly separable.
The XOR example can be solved by pre-processing the data to make the two populations linearly separable.

3. Logistic Regression in a Form of Neural Network¶

$$y^{(i)} \in \{1,0\}$$

$$y = \sigma \,(\omega_0 + \omega_1 x_1 + \omega_2 x_2)$$

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

%matplotlib inline

#training data gerneration
m = 1000
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])

train_X = np.asmatrix(train_X)
train_y = np.asmatrix(train_y)

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()

LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_dim = 2,
                          units = 1,
                          activation = 'sigmoid')
])

LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
                           loss = 'binary_crossentropy')

loss = LogisticRegression.fit(train_X, train_y, epochs = 10)

Epoch 1/10
32/32 [==============================] - 1s 2ms/step - loss: 0.5763
Epoch 2/10
32/32 [==============================] - 0s 2ms/step - loss: 0.2164
Epoch 3/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1702
Epoch 4/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1468
Epoch 5/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1293
Epoch 6/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1165
Epoch 7/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1085
Epoch 8/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0994
Epoch 9/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0966
Epoch 10/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0919

w = LogisticRegression.layers[0].get_weights()[0]
b = LogisticRegression.layers[0].get_weights()[1]
print(w)
print(b)

[[1.8633064]
 [2.4650373]]
[-6.445767]

x1p = np.arange(0, 8, 0.01).reshape(-1, 1)
x2p = - w[0,0]/w[1,0]*x1p - b[0]/w[1,0]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'g', linewidth = 3, label = '')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()

4. Looking at Parameters¶

To understand network's behavior

$$y = \sigma \,(b + \omega_1 x_1 + \omega_2 x_2)$$

# training data gerneration

m = 1000
x1 = 10*np.random.rand(m, 1) - 5
x2 = 8*np.random.rand(m, 1) - 4

g = - 0.5*(x1-1)**2 + 2*x2 + 5

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_X = np.asmatrix(train_X)

train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()

LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_dim = 2, units = 2, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 1, activation = 'sigmoid')
])

LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
                           loss = 'binary_crossentropy')

loss = LogisticRegression.fit(train_X, train_y, epochs = 10)

Epoch 1/10
32/32 [==============================] - 1s 2ms/step - loss: 0.4869
Epoch 2/10
32/32 [==============================] - 0s 2ms/step - loss: 0.4197
Epoch 3/10
32/32 [==============================] - 0s 2ms/step - loss: 0.4166
Epoch 4/10
32/32 [==============================] - 0s 2ms/step - loss: 0.3905
Epoch 5/10
32/32 [==============================] - 0s 2ms/step - loss: 0.3501
Epoch 6/10
32/32 [==============================] - 0s 2ms/step - loss: 0.2803
Epoch 7/10
32/32 [==============================] - 0s 2ms/step - loss: 0.2313
Epoch 8/10
32/32 [==============================] - 0s 2ms/step - loss: 0.2005
Epoch 9/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1761
Epoch 10/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1684

w1 = LogisticRegression.layers[0].get_weights()[0]
b1 = LogisticRegression.layers[0].get_weights()[1]

w2 = LogisticRegression.layers[1].get_weights()[0]
b2 = LogisticRegression.layers[1].get_weights()[1]

H = train_X*w1 + b1
H = 1/(1 + np.exp(-H))

plt.figure(figsize = (6, 4))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()

x1p = np.arange(0, 1, 0.01).reshape(-1, 1)
x2p = - w2[0,0]/w2[1,0]*x1p - b2[0]/w2[1,0]

plt.figure(figsize = (6, 4))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()

x1p = np.arange(-5, 5, 0.01).reshape(-1, 1)
x2p = - w1[0,0]/w1[1,0]*x1p - b1[0]/w1[1,0]
x3p = - w1[0,1]/w1[1,1]*x1p - b1[1]/w1[1,1]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.plot(x1p, x3p, 'g', linewidth = 3, label = '')
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()

5. Artificial Neural Networks¶

Complex/Nonlinear universal function approximator
- Linearly connected networks
- Simple nonlinear neurons
Hidden layers
- Autonomous feature learning

6. ANN Learning¶

6.1. Recursive Algorithm¶

One of the central ideas of computer science
Depends on solutions to smaller instances of the same problem ( = subproblem)
Function to call itself (it is impossible in the real world)

%%html
<center><iframe src="https://www.youtube.com/embed/t4MSwiqfLaY?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

Factorial example

$$n ! = n \cdot (n-1) \cdots 2 \cdot 1$$

n = 5

m = 1
for i in range(n):
    m = m*(i+1)

print(m)

120

def fac(n):
    if n == 1:
        return 1
    else:
        return n*fac(n-1)

# recursive

fac(5)

120

6.2. Dynamic Programming¶

Dynamic Programming: general, powerful algorithm design technique
Fibonacci numbers:

# naive Fibonacci

def fib(n):
    if n <= 2:
        return 1
    else:
        return fib(n-1) + fib(n-2)

fib(10)

55

# Memorized DP Fibonacci

def mfib(n):
    global memo

    if memo[n-1] != 0:
        return memo[n-1]
    elif n <= 2:
        memo[n-1] = 1
        return memo[n-1]
    else:
        memo[n-1] = mfib(n-1) + mfib(n-2)
        return memo[n-1]

import numpy as np

n = 10
memo = np.zeros(n)
mfib(n)

55.0

n = 30
%timeit fib(30)

332 ms ± 110 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

memo = np.zeros(n)
%timeit mfib(30)

389 ns ± 5.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

6.3. Training Neural Networks¶

$=$ Learning or estimating weights and biases of multi-layer perceptron from training data

6.3.1. Optimization¶

3 key components

objective function $f(\cdot)$
decision variable or unknown $\omega$
constraints $g(\cdot)$

In mathematical expression

$$\begin{align*} \min_{\omega} \quad &f(\omega) \end{align*} $$

6.3.2. Loss Function¶

Measures error between target values and predictions

$$ \min_{\omega} \sum_{i=1}^{m}\ell\left( h_{\omega}\left(x^{(i)}\right),y^{(i)}\right)$$

Example
- Squared loss (for regression): $$ \frac{1}{m} \sum_{i=1}^{m} \left(h_{\omega}\left(x^{(i)}\right) - y^{(i)}\right)^2 $$
- Cross entropy (for classification): $$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log\left(h_{\omega}\left(x^{(i)}\right)\right) + \left(1-y^{(i)}\right)\log\left(1-h_{\omega}\left(x^{(i)}\right)\right)$$

6.3.3. Learning¶

Learning weights and biases from data using gradient descent

$$\omega \leftarrow \omega - \alpha \nabla_{\omega} \ell \left( h_{\omega}\left(x^{(i)}\right), y^{(i)} \right)$$

$\frac{\partial \ell}{\partial \omega}$: too many computations are required for all $\omega$
Structural constraints of NN:
- Composition of functions
- Chain rule
- Dynamic programming

Backpropagation

Forward propagation
- the initial information propagates up to the hidden units at each layer and finally produces output
Backpropagation
- allows the information from the cost to flow backwards through the network in order to compute the gradients

Chain Rule
- Computing the derivative of the composition of functions
  - $\space f(g(x))' = f'(g(x))g'(x)$
  - $\space {dz \over dx} = {dz \over dy} \bullet {dy \over dx}$
  - $\space {dz \over dw} = ({dz \over dy} \bullet {dy \over dx}) \bullet {dx \over dw}$
  - $\space {dz \over du} = ({dz \over dy} \bullet {dy \over dx} \bullet {dx \over dw}) \bullet {dw \over du}$
Backpropagation
- Update weights recursively with memory

Optimization procedure

It is not easy to numerically compute gradients in network in general.
- The good news: people have already done all the "hardwork" of developing numerical solvers (or libraries)
- There are a wide range of tools: TensorFlow

Summary

Learning weights and biases from data using gradient descent

6.4. Other Tutorials¶

%%html
<center><iframe src="https://www.youtube.com/embed/aircAruvnKk?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html
<center><iframe src="https://www.youtube.com/embed/IHZwWFHWa-w?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html
<center><iframe src="https://www.youtube.com/embed/Ilg3gGewQ5U?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html
<center><iframe src="https://www.youtube.com/embed/tIeHLnjs5U8?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

7. ANN with MNIST¶

7.1. What's an MNIST?¶

From Wikipedia

The MNIST database (Mixed National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, NIST's complete dataset was too hard.
MNIST (Mixed National Institute of Standards and Technology database) database
- Handwritten digit database
- $28 \times 28$ gray scaled image
- (Flattened matrix into a vector of $28 \times 28 = 784$) $\rightarrow$ not for TensorFlow 2

More here

http://yann.lecun.com/exdb/mnist/

We will be using MNIST to create a Multinomial Classifier that can detect if the MNIST image shown is a member of class 0,1,2,3,4,5,6,7,8 or 9. Susinctly, we're teaching a computer to recognize hand written digets.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline

Let's download and load the dataset.

mnist = tf.keras.datasets.mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()

train_x, test_x = train_x/255.0, test_x/255.0

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 0s 0us/step

print ("The training data set is:\n")
print (train_x.shape)
print (train_y.shape)

The training data set is:

(60000, 28, 28)
(60000,)

print ("The test data set is:")
print (test_x.shape)
print (test_y.shape)

The test data set is:
(10000, 28, 28)
(10000,)

Display a few random samples from it:

# So now we have a 28x28 matrix, where each element is an intensity level from 0 to 1.

img = train_x[5]
img.shape

(28, 28)

Let's visualize what some of these images and their corresponding training labels look like.

plt.figure(figsize = (4, 4))
plt.imshow(img, 'gray')
plt.xticks([])
plt.yticks([])
plt.show()

train_y[5]

2

7.2. ANN with TensorFlow¶

Feed a gray image to ANN

Our network model

Network training (learning)

$$\omega:= \omega - \alpha \nabla_{\omega} \left( h_{\omega} \left(x^{(i)}\right),y^{(i)}\right)$$

7.2.1. Import Library¶

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

7.2.2. Load MNIST Data¶

Download MNIST data from tensorflow tutorial example

mnist = tf.keras.datasets.mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()
train_x, test_x = train_x/255.0, test_x/255.0

7.2.3. Define an ANN Structure¶

Input size
Hidden layer size
The number of classes

7.2.4. Define Weights, Biases, and Placeholder¶

Define parameters based on predefined layer size
Initialize with normal distribution with $\mu = 0$ and $\sigma = 0.1$

7.2.5. Build a Model¶

First, the layer performs several matrix multiplication to produce a set of linear activations

$$y_j = \left(\sum\limits_i \omega_{ij}x_i\right) + b_j$$

$$\mathcal{y} = \omega^T \mathcal{x} + \mathcal{b}$$

Second, each linear activation is running through a nonlinear activation function

Third, predict values with an affine transformation

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape = (28, 28)),
    tf.keras.layers.Dense(units = 100, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'softmax')
])

7.2.6. Define Loss and Optimizer¶

Loss

This defines how we measure how accurate the model is during training. As was covered in lecture, during training we want to minimize this function, which will "steer" the model in the right direction.
Classification: Cross entropy
- Equivalent to apply logistic regression

$$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(h_{\theta}\left(x^{(i)}\right)) + (1-y^{(i)})\log(1-h_{\theta}\left(x^{(i)}\right)) $$

Optimizer

This defines how the model is updated based on the data it sees and its loss function.
AdamOptimizer: the most popular optimizer

7.2.7. Define Optimization Configuration and Then Optimize¶

Define parameters for training ANN
- n_batch: batch size for mini-batch gradient descent
- n_iter: the number of iteration steps per epoch
- n_epoch: iteration over the entire x and y data provided
Metrics
- Here we can define metrics used to monitor the training and testing steps. In this example, we'll look at the accuracy, the fraction of the images that are correctly classified.

Initializer

Initialize all the variables

model.compile(optimizer = 'adam',
              loss = 'sparse_categorical_crossentropy',
              metrics = ['accuracy'])

# Train Model

loss = model.fit(train_x, train_y, epochs = 5)

Epoch 1/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.2736 - accuracy: 0.9215
Epoch 2/5
1875/1875 [==============================] - 5s 3ms/step - loss: 0.1223 - accuracy: 0.9642
Epoch 3/5
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0848 - accuracy: 0.9751
Epoch 4/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0647 - accuracy: 0.9803
Epoch 5/5
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0510 - accuracy: 0.9842

# Evaluate Test Data

test_loss, test_acc = model.evaluate(test_x, test_y)

313/313 [==============================] - 1s 2ms/step - loss: 0.0837 - accuracy: 0.9749

7.2.8. Test or Evaluate¶

test_img = test_x[np.random.choice(test_x.shape[0], 1)]

predict = model.predict_on_batch(test_img)
mypred = np.argmax(predict, axis = 1)

plt.figure(figsize = (8,4))

plt.subplot(1,2,1)
plt.imshow(test_img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()

print('Prediction : {}'.format(mypred[0]))

Prediction : 0

You may observe that the accuracy on the test dataset is a little lower than the accuracy on the training dataset. This gap between training accuracy and test accuracy is an example of overfitting, when a machine learning model performs worse on new data than on its training data.

What is the highest accuracy you can achieve with this first fully connected model? Since the handwritten digit classification task is pretty straightforward, you may be wondering how we can do better...

$\Rightarrow$ As we saw in lecture, convolutional neural networks (CNNs) are particularly well-suited for a variety of tasks in computer vision, and have achieved near-perfect accuracies on the MNIST dataset. We will build a CNN and ultimately output a probability distribution over the 10 digit classes (0-9) in the next lectures.

8. More in ANN¶

8.1. Nonlinear Activation Function¶

The Vanishing Gradient Problem
As more layers using certain activation functions are added to neural networks, the gradients of the loss function approaches zero, making the network hard to train.
For example,

$$\frac{z}{u} = \frac{z}{y} \cdot \frac{y}{x} \cdot \frac{x}{\omega} \cdot \frac{\omega}{u} $$

Rectifiers
The use of the ReLU activation function was a great improvement compared to the historical tanh.

This can be explained by the derivative of ReLU itself not vanishing, and by the resulting coding being sparse (Glorot et al., 2011).

8.2. Batch Normalization¶

Batch normalization is a technique for improving the performance and stability of artificial neural networks.

It is used to normalize the input layer by adjusting and scaling the activations.

During training batch normalization shifts and rescales according to the mean and variance estimated on the batch.
During test, it simply shifts and rescales according to the empirical moments estimated during training.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

Overfitting in Regression

N = 10
data_x = np.linspace(-4.5, 4.5, N)
data_y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512])

data_x = data_x.reshape(-1,1)
data_y = data_y.reshape(-1,1)

plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.grid(alpha = 0.3)
plt.show()

base_model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_shape = (1,),
                          units = 30, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 30, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 1, activation = None)
])

base_model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
                   loss = 'mse',
                   metrics = ['mse'])

# Train Model & Evaluate Test Data

training = base_model.fit(data_x, data_y, epochs = 5000, verbose = 0)

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = base_model.predict(xp)

plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()

4/4 [==============================] - 0s 3ms/step

Batch Normalization Implementation

This example is not for demonstrating the improvement of the performance and stability of artificial neural networks with the batch normalization, but for demonstrating how to implement the batch normalization in TensorFlow 2.

bn_model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units = 30, activation = None, input_shape = (1,)),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(units = 100, activation = None),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(units = 100, activation = None),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(units = 30, activation = None),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(units = 1, activation = None)
])

bn_model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
                 loss = 'mse',
                 metrics = ['mse'])

training = bn_model.fit(data_x, data_y, epochs = 4000, verbose = 0)

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = bn_model.predict(xp)

plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()

4/4 [==============================] - 0s 3ms/step

8.3. Dropout as Regularization¶

8.3.1. Regularization (Shrinkage Methods)¶

Often, overfitting associated with very large estimated parameters $\omega$

We want to balance

how well function fits data
magnitude of coefficients

$$ \begin{align*} \text{Total loss } = \;&\underbrace{\text{measure of fit}}_{RSS(\omega)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{\lambda \cdot \lVert \omega \rVert_d} \\ \\ \implies &\min\; \lVert h_{\omega} (x_i) - y \rVert_2^2 + \lambda \lVert \omega \rVert_d \end{align*} $$
where $ RSS(\omega) = \lVert h_{\omega} (x_i) - y \rVert^2_2 $, ( = Rresidual Sum of Squares) and $\lambda$ is a tuning parameter to be determined separately

the second term, $\lambda \, \lVert \omega \rVert_d$, called a shrinkage penalty, is small when $\omega_1, \cdots,\omega_n$ are close to zeros, and so it has the effect of shrinking the estimates of $\omega_j$ towards zero
The tuning parameter $\lambda$ serves to control the relative impact of these two terms on the weights' estimates

8.3.2. Different Regularization Techniques¶

Big Data
Data augmentation
- The simplest way to reduce overfitting is to increase the size of the training data

Early stopping
- When we see that the performance on the validation set is getting worse, we immediately stop the training on the model

8.3.3. Dropout¶

This is the one of the most interesting types of regularization techniques.
It also produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.
At every iteration, it randomly selects some nodes and removes them.
It can also be thought of as an ensemble technique in machine learning.

tf.keras.layers.Dropout(rate = p)
For training
- rate: the probability that each element is dropped. For example, setting rate = 0.1 would drop 10% of input elements with probability rate, drops elements of layers. Input that are kept are scaled up by $\frac{1}{1−\text{rate}}$, otherwise outputs 0. The scaling is so that the expected sum is unchanged.
For testing
- All the elements are kept

Dropout Implementation

dropout_model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_shape = (1,),
                          units = 30, activation = 'sigmoid'),
    tf.keras.layers.Dropout(rate = 0.2),
    tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
    tf.keras.layers.Dropout(rate = 0.2),
    tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
    tf.keras.layers.Dropout(rate = 0.2),
    tf.keras.layers.Dense(units = 30, activation = 'sigmoid'),
    tf.keras.layers.Dropout(rate = 0.2),
    tf.keras.layers.Dense(units = 1, activation = None)
])

dropout_model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
                      loss = 'mse',
                      metrics = ['mse'])

training = dropout_model.fit(data_x, data_y, epochs = 200, verbose = 0)

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = dropout_model.predict(xp)

plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()

4/4 [==============================] - 0s 3ms/step

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')