Artificial Neural Networks (ANN)


By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents

1. Recall Perceptron

Perceptron


XOR Problem

  • Minsky-Papert Controversy on XOR
    • not linearly separable
    • limitation of perceptron





2. From Perceptron to Multi-Layer Perceptron (MLP)

2.1. Perceptron for $h_{\omega}(x)$

  • Neurons compute the weighted sum of their inputs

  • A neuron is activated or fired when the sum $a$ is positive


$$ \begin{align*} a &= \omega_0 + \omega_1 x_1 + \omega_2 x_2 \\ \\ \hat{y} &= g(a) = \begin{cases} 1 & a > 0\\ 0 & \text{otherwise} \end{cases} \end{align*} $$




  • A step function is not differentiable



  • One layer is often not enough
    • One hyperplane

2.2. Multi-layer Perceptron = Artificial Neural Networks (ANN)

Multi-neurons





Differentiable activation function






In a compact representation





Multi-layer perceptron



2.3. Another Perspective: ANN as Kernel Learning

We can represent this “neuron” as follows:

  • The main weakness of linear predictors is their lack of capacity. For classification, the populations have to be linearly separable.

  • The XOR example can be solved by pre-processing the data to make the two populations linearly separable.





3. Logistic Regression in a Form of Neural Network


$$y^{(i)} \in \{1,0\}$$$$y = \sigma \,(\omega_0 + \omega_1 x_1 + \omega_2 x_2)$$




In [1]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

%matplotlib inline
In [2]:
#training data gerneration
m = 1000
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])

train_X = np.asmatrix(train_X)
train_y = np.asmatrix(train_y)

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()
In [3]:
LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_dim = 2,
                          units = 1,
                          activation = 'sigmoid')
])
In [4]:
LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
                           loss = 'binary_crossentropy')
In [5]:
loss = LogisticRegression.fit(train_X, train_y, epochs = 10)
Epoch 1/10
32/32 [==============================] - 1s 2ms/step - loss: 0.5763
Epoch 2/10
32/32 [==============================] - 0s 2ms/step - loss: 0.2164
Epoch 3/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1702
Epoch 4/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1468
Epoch 5/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1293
Epoch 6/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1165
Epoch 7/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1085
Epoch 8/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0994
Epoch 9/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0966
Epoch 10/10
32/32 [==============================] - 0s 2ms/step - loss: 0.0919
In [6]:
w = LogisticRegression.layers[0].get_weights()[0]
b = LogisticRegression.layers[0].get_weights()[1]
print(w)
print(b)
[[1.8633064]
 [2.4650373]]
[-6.445767]
In [7]:
x1p = np.arange(0, 8, 0.01).reshape(-1, 1)
x2p = - w[0,0]/w[1,0]*x1p - b[0]/w[1,0]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'g', linewidth = 3, label = '')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()

4. Looking at Parameters

  • To understand network's behavior
$$y = \sigma \,(b + \omega_1 x_1 + \omega_2 x_2)$$




In [8]:
# training data gerneration

m = 1000
x1 = 10*np.random.rand(m, 1) - 5
x2 = 8*np.random.rand(m, 1) - 4

g = - 0.5*(x1-1)**2 + 2*x2 + 5

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_X = np.asmatrix(train_X)

train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()




In [9]:
LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_dim = 2, units = 2, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 1, activation = 'sigmoid')
])
In [10]:
LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
                           loss = 'binary_crossentropy')
In [11]:
loss = LogisticRegression.fit(train_X, train_y, epochs = 10)
Epoch 1/10
32/32 [==============================] - 1s 2ms/step - loss: 0.4869
Epoch 2/10
32/32 [==============================] - 0s 2ms/step - loss: 0.4197
Epoch 3/10
32/32 [==============================] - 0s 2ms/step - loss: 0.4166
Epoch 4/10
32/32 [==============================] - 0s 2ms/step - loss: 0.3905
Epoch 5/10
32/32 [==============================] - 0s 2ms/step - loss: 0.3501
Epoch 6/10
32/32 [==============================] - 0s 2ms/step - loss: 0.2803
Epoch 7/10
32/32 [==============================] - 0s 2ms/step - loss: 0.2313
Epoch 8/10
32/32 [==============================] - 0s 2ms/step - loss: 0.2005
Epoch 9/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1761
Epoch 10/10
32/32 [==============================] - 0s 2ms/step - loss: 0.1684
In [12]:
w1 = LogisticRegression.layers[0].get_weights()[0]
b1 = LogisticRegression.layers[0].get_weights()[1]

w2 = LogisticRegression.layers[1].get_weights()[0]
b2 = LogisticRegression.layers[1].get_weights()[1]
In [13]:
H = train_X*w1 + b1
H = 1/(1 + np.exp(-H))

plt.figure(figsize = (6, 4))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
In [14]:
x1p = np.arange(0, 1, 0.01).reshape(-1, 1)
x2p = - w2[0,0]/w2[1,0]*x1p - b2[0]/w2[1,0]

plt.figure(figsize = (6, 4))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
In [15]:
x1p = np.arange(-5, 5, 0.01).reshape(-1, 1)
x2p = - w1[0,0]/w1[1,0]*x1p - b1[0]/w1[1,0]
x3p = - w1[0,1]/w1[1,1]*x1p - b1[1]/w1[1,1]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.plot(x1p, x3p, 'g', linewidth = 3, label = '')
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()

5. Artificial Neural Networks

  • Complex/Nonlinear universal function approximator
    • Linearly connected networks
    • Simple nonlinear neurons
  • Hidden layers
    • Autonomous feature learning



6. ANN Learning

6.1. Recursive Algorithm

  • One of the central ideas of computer science

  • Depends on solutions to smaller instances of the same problem ( = subproblem)

  • Function to call itself (it is impossible in the real world)



In [16]:
%%html
<center><iframe src="https://www.youtube.com/embed/t4MSwiqfLaY?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
  • Factorial example


$$n ! = n \cdot (n-1) \cdots 2 \cdot 1$$

In [17]:
n = 5

m = 1
for i in range(n):
    m = m*(i+1)

print(m)
120
In [18]:
def fac(n):
    if n == 1:
        return 1
    else:
        return n*fac(n-1)
In [19]:
# recursive

fac(5)
Out[19]:
120

6.2. Dynamic Programming

  • Dynamic Programming: general, powerful algorithm design technique

  • Fibonacci numbers:

In [20]:
# naive Fibonacci

def fib(n):
    if n <= 2:
        return 1
    else:
        return fib(n-1) + fib(n-2)
In [21]:
fib(10)
Out[21]:
55
In [22]:
# Memorized DP Fibonacci

def mfib(n):
    global memo

    if memo[n-1] != 0:
        return memo[n-1]
    elif n <= 2:
        memo[n-1] = 1
        return memo[n-1]
    else:
        memo[n-1] = mfib(n-1) + mfib(n-2)
        return memo[n-1]
In [23]:
import numpy as np

n = 10
memo = np.zeros(n)
mfib(n)
Out[23]:
55.0
In [24]:
n = 30
%timeit fib(30)
332 ms ± 110 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [25]:
memo = np.zeros(n)
%timeit mfib(30)
389 ns ± 5.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

6.3. Training Neural Networks

$=$ Learning or estimating weights and biases of multi-layer perceptron from training data

6.3.1. Optimization

3 key components

  1. objective function $f(\cdot)$
  2. decision variable or unknown $\omega$
  3. constraints $g(\cdot)$

In mathematical expression


$$\begin{align*} \min_{\omega} \quad &f(\omega) \end{align*} $$

6.3.2. Loss Function

  • Measures error between target values and predictions


$$ \min_{\omega} \sum_{i=1}^{m}\ell\left( h_{\omega}\left(x^{(i)}\right),y^{(i)}\right)$$

  • Example
    • Squared loss (for regression): $$ \frac{1}{m} \sum_{i=1}^{m} \left(h_{\omega}\left(x^{(i)}\right) - y^{(i)}\right)^2 $$
    • Cross entropy (for classification): $$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log\left(h_{\omega}\left(x^{(i)}\right)\right) + \left(1-y^{(i)}\right)\log\left(1-h_{\omega}\left(x^{(i)}\right)\right)$$

6.3.3. Learning

Learning weights and biases from data using gradient descent


$$\omega \leftarrow \omega - \alpha \nabla_{\omega} \ell \left( h_{\omega}\left(x^{(i)}\right), y^{(i)} \right)$$
  • $\frac{\partial \ell}{\partial \omega}$: too many computations are required for all $\omega$
  • Structural constraints of NN:
    • Composition of functions
    • Chain rule
    • Dynamic programming


Backpropagation

  • Forward propagation
    • the initial information propagates up to the hidden units at each layer and finally produces output
  • Backpropagation
    • allows the information from the cost to flow backwards through the network in order to compute the gradients
  • Chain Rule

    • Computing the derivative of the composition of functions

      • $\space f(g(x))' = f'(g(x))g'(x)$

      • $\space {dz \over dx} = {dz \over dy} \bullet {dy \over dx}$

      • $\space {dz \over dw} = ({dz \over dy} \bullet {dy \over dx}) \bullet {dx \over dw}$

      • $\space {dz \over du} = ({dz \over dy} \bullet {dy \over dx} \bullet {dx \over dw}) \bullet {dw \over du}$

  • Backpropagation
    • Update weights recursively with memory



Optimization procedure


  • It is not easy to numerically compute gradients in network in general.
    • The good news: people have already done all the "hardwork" of developing numerical solvers (or libraries)
    • There are a wide range of tools: TensorFlow

Summary

  • Learning weights and biases from data using gradient descent


6.4. Other Tutorials

In [26]:
%%html
<center><iframe src="https://www.youtube.com/embed/aircAruvnKk?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [27]:
%%html
<center><iframe src="https://www.youtube.com/embed/IHZwWFHWa-w?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [28]:
%%html
<center><iframe src="https://www.youtube.com/embed/Ilg3gGewQ5U?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [29]:
%%html
<center><iframe src="https://www.youtube.com/embed/tIeHLnjs5U8?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

7. ANN with MNIST

7.1. What's an MNIST?

From Wikipedia

  • The MNIST database (Mixed National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, NIST's complete dataset was too hard.
  • MNIST (Mixed National Institute of Standards and Technology database) database
    • Handwritten digit database
    • $28 \times 28$ gray scaled image
    • (Flattened matrix into a vector of $28 \times 28 = 784$) $\rightarrow$ not for TensorFlow 2



More here

We will be using MNIST to create a Multinomial Classifier that can detect if the MNIST image shown is a member of class 0,1,2,3,4,5,6,7,8 or 9. Susinctly, we're teaching a computer to recognize hand written digets.

In [30]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline

Let's download and load the dataset.

In [31]:
mnist = tf.keras.datasets.mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()

train_x, test_x = train_x/255.0, test_x/255.0
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 0s 0us/step
In [32]:
print ("The training data set is:\n")
print (train_x.shape)
print (train_y.shape)
The training data set is:

(60000, 28, 28)
(60000,)
In [33]:
print ("The test data set is:")
print (test_x.shape)
print (test_y.shape)
The test data set is:
(10000, 28, 28)
(10000,)

Display a few random samples from it:

In [34]:
# So now we have a 28x28 matrix, where each element is an intensity level from 0 to 1.

img = train_x[5]
img.shape
Out[34]:
(28, 28)

Let's visualize what some of these images and their corresponding training labels look like.

In [35]:
plt.figure(figsize = (4, 4))
plt.imshow(img, 'gray')
plt.xticks([])
plt.yticks([])
plt.show()
In [36]:
train_y[5]
Out[36]:
2

7.2. ANN with TensorFlow

  • Feed a gray image to ANN


  • Our network model



  • Network training (learning)
$$\omega:= \omega - \alpha \nabla_{\omega} \left( h_{\omega} \left(x^{(i)}\right),y^{(i)}\right)$$


7.2.1. Import Library

In [37]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

7.2.2. Load MNIST Data

  • Download MNIST data from tensorflow tutorial example
In [38]:
mnist = tf.keras.datasets.mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()
train_x, test_x = train_x/255.0, test_x/255.0

7.2.3. Define an ANN Structure

  • Input size
  • Hidden layer size
  • The number of classes


7.2.4. Define Weights, Biases, and Placeholder

  • Define parameters based on predefined layer size
  • Initialize with normal distribution with $\mu = 0$ and $\sigma = 0.1$

7.2.5. Build a Model

First, the layer performs several matrix multiplication to produce a set of linear activations



$$y_j = \left(\sum\limits_i \omega_{ij}x_i\right) + b_j$$$$\mathcal{y} = \omega^T \mathcal{x} + \mathcal{b}$$


Second, each linear activation is running through a nonlinear activation function




Third, predict values with an affine transformation



In [39]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape = (28, 28)),
    tf.keras.layers.Dense(units = 100, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'softmax')
])

7.2.6. Define Loss and Optimizer

Loss

  • This defines how we measure how accurate the model is during training. As was covered in lecture, during training we want to minimize this function, which will "steer" the model in the right direction.
  • Classification: Cross entropy
    • Equivalent to apply logistic regression
$$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(h_{\theta}\left(x^{(i)}\right)) + (1-y^{(i)})\log(1-h_{\theta}\left(x^{(i)}\right)) $$

Optimizer

  • This defines how the model is updated based on the data it sees and its loss function.
  • AdamOptimizer: the most popular optimizer

7.2.7. Define Optimization Configuration and Then Optimize




  • Define parameters for training ANN
    • n_batch: batch size for mini-batch gradient descent
    • n_iter: the number of iteration steps per epoch
    • n_epoch: iteration over the entire x and y data provided
  • Metrics
    • Here we can define metrics used to monitor the training and testing steps. In this example, we'll look at the accuracy, the fraction of the images that are correctly classified.

Initializer

  • Initialize all the variables
In [40]:
model.compile(optimizer = 'adam',
              loss = 'sparse_categorical_crossentropy',
              metrics = ['accuracy'])
In [41]:
# Train Model

loss = model.fit(train_x, train_y, epochs = 5)
Epoch 1/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.2736 - accuracy: 0.9215
Epoch 2/5
1875/1875 [==============================] - 5s 3ms/step - loss: 0.1223 - accuracy: 0.9642
Epoch 3/5
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0848 - accuracy: 0.9751
Epoch 4/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0647 - accuracy: 0.9803
Epoch 5/5
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0510 - accuracy: 0.9842
In [42]:
# Evaluate Test Data

test_loss, test_acc = model.evaluate(test_x, test_y)
313/313 [==============================] - 1s 2ms/step - loss: 0.0837 - accuracy: 0.9749

7.2.8. Test or Evaluate

In [43]:
test_img = test_x[np.random.choice(test_x.shape[0], 1)]

predict = model.predict_on_batch(test_img)
mypred = np.argmax(predict, axis = 1)

plt.figure(figsize = (8,4))

plt.subplot(1,2,1)
plt.imshow(test_img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()

print('Prediction : {}'.format(mypred[0]))
Prediction : 0

You may observe that the accuracy on the test dataset is a little lower than the accuracy on the training dataset. This gap between training accuracy and test accuracy is an example of overfitting, when a machine learning model performs worse on new data than on its training data.

What is the highest accuracy you can achieve with this first fully connected model? Since the handwritten digit classification task is pretty straightforward, you may be wondering how we can do better...

$\Rightarrow$ As we saw in lecture, convolutional neural networks (CNNs) are particularly well-suited for a variety of tasks in computer vision, and have achieved near-perfect accuracies on the MNIST dataset. We will build a CNN and ultimately output a probability distribution over the 10 digit classes (0-9) in the next lectures.

8. More in ANN

8.1. Nonlinear Activation Function

  • The Vanishing Gradient Problem

  • As more layers using certain activation functions are added to neural networks, the gradients of the loss function approaches zero, making the network hard to train.

  • For example,

$$\frac{z}{u} = \frac{z}{y} \cdot \frac{y}{x} \cdot \frac{x}{\omega} \cdot \frac{\omega}{u} $$




  • Rectifiers
  • The use of the ReLU activation function was a great improvement compared to the historical tanh.




  • This can be explained by the derivative of ReLU itself not vanishing, and by the resulting coding being sparse (Glorot et al., 2011).




8.2. Batch Normalization

Batch normalization is a technique for improving the performance and stability of artificial neural networks.

It is used to normalize the input layer by adjusting and scaling the activations.




  • During training batch normalization shifts and rescales according to the mean and variance estimated on the batch.

  • During test, it simply shifts and rescales according to the empirical moments estimated during training.

In [44]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

Overfitting in Regression

In [45]:
N = 10
data_x = np.linspace(-4.5, 4.5, N)
data_y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512])

data_x = data_x.reshape(-1,1)
data_y = data_y.reshape(-1,1)

plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.grid(alpha = 0.3)
plt.show()
In [46]:
base_model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_shape = (1,),
                          units = 30, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 30, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 1, activation = None)
])
In [47]:
base_model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
                   loss = 'mse',
                   metrics = ['mse'])
In [48]:
# Train Model & Evaluate Test Data

training = base_model.fit(data_x, data_y, epochs = 5000, verbose = 0)

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = base_model.predict(xp)

plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()
4/4 [==============================] - 0s 3ms/step

Batch Normalization Implementation

  • This example is not for demonstrating the improvement of the performance and stability of artificial neural networks with the batch normalization, but for demonstrating how to implement the batch normalization in TensorFlow 2.
In [49]:
bn_model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units = 30, activation = None, input_shape = (1,)),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(units = 100, activation = None),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(units = 100, activation = None),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(units = 30, activation = None),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(units = 1, activation = None)
])
In [50]:
bn_model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
                 loss = 'mse',
                 metrics = ['mse'])
In [51]:
training = bn_model.fit(data_x, data_y, epochs = 4000, verbose = 0)

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = bn_model.predict(xp)

plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()
4/4 [==============================] - 0s 3ms/step

8.3. Dropout as Regularization

8.3.1. Regularization (Shrinkage Methods)

Often, overfitting associated with very large estimated parameters $\omega$

We want to balance

  • how well function fits data

  • magnitude of coefficients

    $$ \begin{align*} \text{Total loss } = \;&\underbrace{\text{measure of fit}}_{RSS(\omega)} + \;\lambda \cdot \underbrace{\text{measure of magnitude of coefficients}}_{\lambda \cdot \lVert \omega \rVert_d} \\ \\ \implies &\min\; \lVert h_{\omega} (x_i) - y \rVert_2^2 + \lambda \lVert \omega \rVert_d \end{align*} $$
    where $ RSS(\omega) = \lVert h_{\omega} (x_i) - y \rVert^2_2 $, ( = Rresidual Sum of Squares) and $\lambda$ is a tuning parameter to be determined separately


  • the second term, $\lambda \, \lVert \omega \rVert_d$, called a shrinkage penalty, is small when $\omega_1, \cdots,\omega_n$ are close to zeros, and so it has the effect of shrinking the estimates of $\omega_j$ towards zero
  • The tuning parameter $\lambda$ serves to control the relative impact of these two terms on the weights' estimates

8.3.2. Different Regularization Techniques

  • Big Data
  • Data augmentation
    • The simplest way to reduce overfitting is to increase the size of the training data




  • Early stopping
    • When we see that the performance on the validation set is getting worse, we immediately stop the training on the model




8.3.3. Dropout

  • This is the one of the most interesting types of regularization techniques.
  • It also produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.
  • At every iteration, it randomly selects some nodes and removes them.
  • It can also be thought of as an ensemble technique in machine learning.




  • tf.keras.layers.Dropout(rate = p)
  • For training
    • rate: the probability that each element is dropped. For example, setting rate = 0.1 would drop 10% of input elements with probability rate, drops elements of layers. Input that are kept are scaled up by $\frac{1}{1−\text{rate}}$, otherwise outputs 0. The scaling is so that the expected sum is unchanged.
  • For testing
    • All the elements are kept

Dropout Implementation

In [52]:
dropout_model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_shape = (1,),
                          units = 30, activation = 'sigmoid'),
    tf.keras.layers.Dropout(rate = 0.2),
    tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
    tf.keras.layers.Dropout(rate = 0.2),
    tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
    tf.keras.layers.Dropout(rate = 0.2),
    tf.keras.layers.Dense(units = 30, activation = 'sigmoid'),
    tf.keras.layers.Dropout(rate = 0.2),
    tf.keras.layers.Dense(units = 1, activation = None)
])
In [53]:
dropout_model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
                      loss = 'mse',
                      metrics = ['mse'])
In [54]:
training = dropout_model.fit(data_x, data_y, epochs = 200, verbose = 0)

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = dropout_model.predict(xp)

plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()
4/4 [==============================] - 0s 3ms/step
In [55]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')