Artificial Neural Networks (ANN)
= Multi-Layer Perceptron (MLP)



By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents



1. Recall Perceptron

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('W5i9OA0bW-A', width="560", height="315", frameborder="0")
Out[ ]:

Perceptron


A perceptron models the decision boundary as a linear equation:


$$\omega_0 + \omega_1 x_1 + \omega_2 x_2 + \cdots + \omega_d x_d = 0$$


This means that a perceptron can only create a straight-line (hyperplane) to separate the data. It is instructive to conceptualize a single perceptron as representing a hyperplane in a geometric space.

XOR Problem


Minsky-Papert Controversy on XOR

  • not linearly separable
  • limitation of perceptron

A perceptron is a simple linear classifier that separates data using a single hyperplane. However, there are some classification tasks that a perceptron cannot solve due to its linear nature. A classic example of this limitation is the XOR (exclusive OR) problem.




2. From Perceptron to Multi-Layer Perceptron (MLP)

To explain the concept of a multi-layer perceptron (MLP), we start with a simple perceptron and establish its connection to a single hyperplane. A perceptron functions as a linear classifier by learning a hyperplane that separates data into distinct classes. However, when the data is non-linearly separable, a single hyperplane is insufficient.

In such cases, two possible approaches arise:

  1. Using multiple hyperplanes (or multiple dividers): This can be achieved by stacking multiple layers of perceptrons, allowing the model to combine hyperplanes and approximate more complex decision boundaries.
  2. Learning non-linear boundaries (or curved surface): By introducing nonlinear activation functions and multiple layers, the model can learn non-linear transformations that capture complex patterns in the data.

This progression from a single hyperplane to non-linear boundaries forms the foundation of multi-layer perceptrons (MLPs), enabling them to handle non-linear separability.


2.1. Perceptron for $h_{\omega}(x)$

  • Neurons compute the weighted sum of their inputs

  • A neuron is activated or fired when the sum $a$ is positive


$$ \begin{align*} a &= \omega_0 + \omega_1 x_1 + \omega_2 x_2 \\ \\ \hat{y} &= g(a) = \begin{cases} 1 & a > 0\\ 0 & \text{otherwise} \end{cases} \end{align*} $$




  • A step function (or sign function) is non-differentiable.
    • To address this limitation, we later replace the step function with differentiable non-linear activation functions such as the sigmoid, or tanh functions.


  • A single layer is insufficient to solve the XOR problem.

2.2. Multi-Layer Perceptron (MLP) = Artificial Neural Networks (ANN)

  • Multi-neurons
    • A single perceptron is often insufficient for complex tasks, as it can only learn a single hyperplane to separate the data. When the data is not linearly separable, a single hyperplane cannot capture the intricate boundaries or patterns in the feature space, necessitating deeper or more complex architectures to model the underlying relationships effectively.
    • For example, if two perceptrons are stacked, it represents two hyperplanes.





  • Switch to differentiable activation function (for example, the sigmoid function)






  • In a compact representation
    • Combining the summation and the sigmoid function forms another neuron.





  • Multi-layer perceptron
    • Can be viewed as a sequence of feature extraction layers, where each layer transforms the input into a higher-level, more abstract representation. This process of hierarchical feature extraction allows the MLP to learn complex patterns from data.


Multi-Layer Perceptron as a Sequence of Feature Extraction

A multi-layer perceptron is not merely a collection of neurons but a structured sequence of feature extraction steps. Each layer extracts increasingly abstract and relevant features, allowing the model to handle complex, non-linear tasks that a single-layer perceptron cannot solve.


Intuition Behind Feature Extraction in MLP

  • The first layer extracts basic features from the input (e.g., edges or simple patterns in images).
  • The hidden layers combine these basic features to create more complex representations (e.g., shapes, contours).
  • The final layer maps the abstract features to the output (e.g., predicting a class label or regression value).

By stacking layers, the MLP gradually transforms simple input features into complex features that capture relationships in non-linear data.

Each layer can be thought of as performing a mapping from one feature space to another, progressively refining the representation to make the final decision easier.


This hierarchy of transformations allows the network to move from raw input data to task-specific representations.


2.3. Hidden Layers as Kernel Learning

The XOR example can be solved by pre-processing the data to make the two populations linearly separable.



In machine learning, defining the appropriate kernels for non-linear mapping is crucial for models like support vector machines (SVMs) and kernel-based methods. However, in deep learning, the kernel function is effectively represented by the hidden layers of the neural network, which are learned directly from the data rather than being predefined.

  • In traditional machine learning: Kernels (such as polynomial or RBF kernels) are manually chosen to project the input into a higher-dimensional space for better separation.
  • In deep learning: The hidden layers act as adaptive feature extractors, learning the non-linear transformations (analogous to kernels) automatically during training.

This flexibility allows deep learning models, such as multi-layer perceptrons (MLPs), to discover the best representations and transformations for the data without requiring explicit kernel design.



3. Logistic Regression in a Form of Neural Network


Here, we will demonstrate a multi-layer perceptron (MLP) using a logistic regression example. In this demonstration, we will observe the hyperplanes and the features learned in the hidden layer at the end of the training process.

  • When we extend logistic regression to a multi-layer perceptron, the hidden layer transforms the input features through non-linear activations, creating intermediate feature spaces.
  • At the end of learning, we can visualize how the hidden layer features and multiple hyperplanes evolve to form more complex decision boundaries.

This approach will illustrate how the MLP builds upon logistic regression by stacking multiple transformations to handle non-linear separability.


Let's start with logistic regression for a linearly separable case.


$$ \begin{align*} y^{(i)} &\in \{0, 1\}\\\\ y &= \sigma (\omega_0 + \omega_1 x_1 + \omega_2 x_2) \end{align*} $$


After training, $\omega_0 + \omega_1 x_1 + \omega_2 x_2 = 0$ will represent a linear classification boundary.





In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

%matplotlib inline
In [ ]:
#training data gerneration
m = 1000
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])

train_X = np.asmatrix(train_X)
train_y = np.asmatrix(train_y)

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1)
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.show()
No description has been provided for this image
In [ ]:
LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_dim = 2,
                          units = 1,
                          activation = 'sigmoid')
])
In [ ]:
LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
                           loss = 'binary_crossentropy')
In [ ]:
loss = LogisticRegression.fit(train_X, train_y, epochs = 10)
Epoch 1/10
32/32 [==============================] - 0s 1ms/step - loss: 0.9861
Epoch 2/10
32/32 [==============================] - 0s 1ms/step - loss: 0.2784
Epoch 3/10
32/32 [==============================] - 0s 1ms/step - loss: 0.2044
Epoch 4/10
32/32 [==============================] - 0s 1ms/step - loss: 0.1674
Epoch 5/10
32/32 [==============================] - 0s 1ms/step - loss: 0.1478
Epoch 6/10
32/32 [==============================] - 0s 1ms/step - loss: 0.1305
Epoch 7/10
32/32 [==============================] - 0s 1ms/step - loss: 0.1211
Epoch 8/10
32/32 [==============================] - 0s 1ms/step - loss: 0.1101
Epoch 9/10
32/32 [==============================] - 0s 1ms/step - loss: 0.1029
Epoch 10/10
32/32 [==============================] - 0s 1ms/step - loss: 0.0979
In [ ]:
w = LogisticRegression.layers[0].get_weights()[0]
b = LogisticRegression.layers[0].get_weights()[1]

print(w)
print("\n")
print(b)
[[1.6916244]
 [2.370146 ]]


[-6.223168]
In [ ]:
x1p = np.arange(0, 8, 0.01).reshape(-1, 1)
x2p = - w[0,0]/w[1,0]*x1p - b[0]/w[1,0]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'g', linewidth = 3, label = '')
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.legend(loc = 1)
plt.show()
No description has been provided for this image

3.1. Looking at Parameters in Nonlinear Classification

Now, let's move on to a non-linearly separable case.

In this scenario, the dataset cannot be separated by a single linear hyperplane. Therefore, we use a multi-layer perceptron (MLP), which approximates complex non-linear decision boundaries by learning multiple hyperplanes and combining them using non-linear activation functions.

Let's see how the MLP performs in this case!


Before that, let's first mention the notation changes commonly used in neural network conventions. In neural network notation, it is common to represent: $\omega_0 \rightarrow b$

  • Weights as $\omega$
  • Bias as $b$ (previously denoted as $\omega_0$)

$$y = \sigma(\omega_0 + \omega_1 x_1 + \omega_2 x_2) \quad \longrightarrow \quad y = \sigma(b + \omega_1 x_1 + \omega_2 x_2)$$




In [ ]:
# training data gerneration

m = 1000
x1 = 10*np.random.rand(m, 1) - 5
x2 = 8*np.random.rand(m, 1) - 4

g = - 0.5*(x1-1)**2 + 2*x2 + 5

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_X = np.asmatrix(train_X)

train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()
No description has been provided for this image

As illustrated in the figure, the data is non-linearly distributed. To approximate the non-linear decision boundary using two linear boundaries, a hidden layer with two neurons (plus a bias neuron) is intentionally added.



In [ ]:
LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_dim = 2, units = 2, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 1, activation = 'sigmoid')
])
In [ ]:
LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
                           loss = 'binary_crossentropy')
In [ ]:
loss = LogisticRegression.fit(train_X, train_y, epochs = 10)
Epoch 1/10
32/32 [==============================] - 1s 2ms/step - loss: 0.5916
Epoch 2/10
32/32 [==============================] - 0s 2ms/step - loss: 0.4435
Epoch 3/10
32/32 [==============================] - 0s 2ms/step - loss: 0.3970
Epoch 4/10
32/32 [==============================] - 0s 1ms/step - loss: 0.3365
Epoch 5/10
32/32 [==============================] - 0s 1ms/step - loss: 0.2901
Epoch 6/10
32/32 [==============================] - 0s 1ms/step - loss: 0.2514
Epoch 7/10
32/32 [==============================] - 0s 2ms/step - loss: 0.2084
Epoch 8/10
32/32 [==============================] - 0s 1ms/step - loss: 0.1801
Epoch 9/10
32/32 [==============================] - 0s 1ms/step - loss: 0.1604
Epoch 10/10
32/32 [==============================] - 0s 1ms/step - loss: 0.1428
In [ ]:
w1 = LogisticRegression.layers[0].get_weights()[0]
b1 = LogisticRegression.layers[0].get_weights()[1]

w2 = LogisticRegression.layers[1].get_weights()[0]
b2 = LogisticRegression.layers[1].get_weights()[1]
In [ ]:
H = train_X*w1 + b1
H = 1/(1 + np.exp(-H))

plt.figure(figsize = (6, 4))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
No description has been provided for this image

Here the features $z_1$ and $z_2$ learned in the hidden layer have two notable characteristics:

  • Bounded Output Range: Both features are restricted to the interval $(0,1)$ due to the use of the sigmoid activation function. This non-linear transformation ensures that the outputs remain within a normalized range, regardless of the input values.

  • Feature Redistribution: The hidden layer redistributes the input data into a new feature space where the transformed data becomes approximately linearly separable. This transformation allows the multi-layer perceptron to form a linear decision boundary in the higher-dimensional space that corresponds to a non-linear boundary in the original input space.

These learned features (or feature extractions) demonstrate how the hidden layer facilitates the approximation of complex patterns by creating new, informative representations of the input data.


In [ ]:
x1p = np.arange(0, 1, 0.01).reshape(-1, 1)
x2p = - w2[0,0]/w2[1,0]*x1p - b2[0]/w2[1,0]

plt.figure(figsize = (6, 4))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
No description has been provided for this image

The result aligns with the linear boundary in the $z$-space, indicating that the transformed features $z_1$ and $z_2$ successfully reshape the input data into a feature space where a linear decision boundary can separate the classes. This transformation validates the role of the hidden layer in projecting non-linearly distributed data into a higher-dimensional space, making it linearly separable.


In [ ]:
x1p = np.arange(-5, 5, 0.01).reshape(-1, 1)
x2p = - w1[0,0]/w1[1,0]*x1p - b1[0]/w1[1,0]
x3p = - w1[0,1]/w1[1,1]*x1p - b1[1]/w1[1,1]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.plot(x1p, x3p, 'g', linewidth = 3, label = '')
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()
No description has been provided for this image

After training, two linear boundaries in the $x$-space (input space) are learned, corresponding to the linear boundaries formed by the hidden layer neurons. These boundaries result in multiple lines that, when combined, approximate the non-linear classification boundary. This demonstrates how the multi-layer perceptron (MLP) constructs a complex decision boundary in the input space by learning and integrating multiple hyperplanes from the hidden layer.


4. Regression in a Form of Neural Network

A multi-layer perceptron (MLP) can also be applied to nonlinear regression tasks. Unlike linear regression, which fits a straight line, MLPs can model complex, nonlinear relationships between input features and the output due to their use of multiple layers and nonlinear activation functions.

In this example, the Rectified Linear Unit (ReLU) will be used as the non-linear activation function. Therefore, let’s first take a closer look at ReLU before proceeding to the example.


Rectified linear unit (ReLU activation function)


$$h(x) = \max(0, x) = \begin{cases} 0, &\text{if}\;\; x \leq 0 \\ 1, &\text{if} \;\; x > 0 \end{cases}$$


Key Characteristics of ReLU:

  • Piecewise Linear: ReLU outputs zero for negative values of $x$ and passes the value of $x$ unchanged for positive values.
  • Sparsity: ReLU introduces sparsity by setting some neuron activations to zero, which can make the network more efficient.
  • Avoids Vanishing Gradients: Unlike sigmoid or tanh, ReLU does not squash values to a narrow range, thus avoiding the issue of vanishing gradients in deep networks.
  • Non-saturating: The gradient remains constant for positive values, making optimization faster.

$$h(x)$$

In [ ]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

def relu(x):
    return np.maximum(0, x)

xp = np.linspace(-1.5, 1.5, 100)
yp = relu(xp)

plt.figure(figsize = (6, 4))
plt.plot(xp, yp, '--', color = 'red', alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.axis('equal')
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image

$$h(x-0.5)$$

In [ ]:
xp = np.linspace(-1.5, 1.5, 100)
yp = relu(xp - 0.5)

plt.figure(figsize = (6, 4))
plt.plot(xp, yp, '--', color = 'red', alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.axis('equal')
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image

$$h(-2x)$$

In [ ]:
xp = np.linspace(-1.5, 1.5, 100)
yp = relu(-2*xp)

plt.figure(figsize = (6, 4))
plt.plot(xp, yp, '--', color = 'red', alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.axis('equal')
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image

$$- h(-2x-1)$$

In [ ]:
xp = np.linspace(-1.5, 1.5, 100)
yp = -relu(-2*xp - 1)

plt.figure(figsize = (6, 4))
plt.plot(xp, yp, '--', color = 'red', alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.axis('equal')
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image

The expression $\omega_2 h(\omega_1 x + b_1) + b_2 $, where $h(\cdot)$ represents the ReLU activation function, can describe a variety of transformations of the ReLU function, including horizontal and vertical shifts, as well as reflections across the $x$-axis or $y$-axis, depending on the values and signs of $\omega_1, \omega_2, b_1$, and $b_2$.


Let us consider the task of modeling a multi-layer perceptron (MLP) to approximate the given non-linear functions.


In [ ]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

np.random.seed(0)
tf.random.set_seed(0)

def function(x):
    return x**3 + 0.1*x**2 - x + 0.1

x = np.linspace(-1.5, 1.5, 1000)
y = function(x)

plt.figure(figsize = (6, 4))
plt.plot(x, y, '--', color = 'red', alpha = 0.5)
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image
In [ ]:
train_x = np.random.uniform(-1.5, 1.5, 1000)
train_y = function(train_x)

Nonlinear Regression as a Linear Combination of ReLU Functions

In an MLP with ReLU activation, the output of the model can be interpreted as a linear combination of ReLU-transformed features. Each hidden layer neuron applies a ReLU transformation to a weighted sum of the input features, and the final output is a linear combination of these transformed features.


Key Insight:

Nonlinear regression with MLPs can be thought of as creating a piecewise linear approximation of the true function, where the linear segments are defined by the ReLU activations in the hidden layer. The network learns how to place these segments and their slopes to best fit the target function.

For a neural network with one hidden layer, the predicted output $ \hat{y} $ is:


$$ \hat{y} = \sum_{j=1}^{m} \omega_j^{(2)} h\left( \sum_{i=1}^{n} \omega_{ij}^{(1)} x_i + b_j^{(1)} \right) + b^{(2)} $$


$\quad$where:

  • $ \omega_{ij}^{(1)} $ and $ b_j^{(1)} $ are the weights and biases for the hidden layer.
  • $ \omega_j^{(2)} $ and $ b^{(2)} $ are the weights and bias for the output layer.

In the following examples, $b^{(2)}$, the bias term for the output layer, is omitted for simplicity. In other words, the vertical shift transformation is not applied.


Single Neuron with ReLU



$$\hat{y} = \omega^{(2)} \left( h \left( \omega^{(1)} x + b^{(1)} \right) \right)$$


In [ ]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units = 1, activation = 'relu'),
    tf.keras.layers.Dense(units = 1, use_bias = False)
])
In [ ]:
model.compile(optimizer = 'adam',
              loss = 'mse')
In [ ]:
train_x = train_x.reshape(-1, 1)
train_y = train_y.reshape(-1, 1)

model.fit(train_x, train_y, epochs = 500, verbose = 0)
Out[ ]:
<keras.src.callbacks.History at 0x7cb22461d150>
In [ ]:
weights = model.layers[0].get_weights()[0]
biases = model.layers[0].get_weights()[1]
weights2 = model.layers[1].get_weights()[0]

print("Coefficients (Weights):", weights)
print("Intercepts (Biases):", biases)
print("Coefficients (Weights):", weights2)
Coefficients (Weights): [[-0.4981401]]
Intercepts (Biases): [-0.7469519]
Coefficients (Weights): [[1.1054615]]
In [ ]:
real_x = np.linspace(-1.5, 1.5, 100)
real_y = function(real_x)
In [ ]:
def relu(x):
    return np.maximum(0, x)
In [ ]:
x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2*(relu(weights*x1p + biases))

plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'c', linewidth = 3)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image

Two Neurons with ReLU



$$\hat{y} = \omega_1^{(2)} \left( h \left( \omega_1^{(1)} x + b_1^{(1)} \right) \right) + \omega_2^{(2)} \left( h \left( \omega_2^{(1)} x + b_2^{(1)} \right) \right)$$


In [ ]:
model_2 = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units = 2, activation = 'relu'),
    tf.keras.layers.Dense(units = 1, use_bias = False)
])

model_2.compile(optimizer = 'adam',
                loss = 'mse')

train_x = train_x.reshape(-1, 1)
train_y = train_y.reshape(-1, 1)

model_2.fit(train_x, train_y, epochs = 3000, verbose = 0)
Out[ ]:
<keras.src.callbacks.History at 0x78ea90ed0c70>
In [ ]:
weights = model_2.layers[0].get_weights()[0]
biases = model_2.layers[0].get_weights()[1]
weights2 = model_2.layers[1].get_weights()[0]

print("Coefficients (Weights):", weights)
print("Intercepts (Biases):", biases)
print("Coefficients (Weights):", weights2)
Coefficients (Weights): [[ 1.4730617 -1.5367993]]
Intercepts (Biases): [-1.463359  -1.7338109]
Coefficients (Weights): [[ 2.707407 ]
 [-2.5364404]]
In [ ]:
x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2[0]*relu(weights[0][0]*x1p + biases[0])
x3p = weights2[1]*relu(weights[0][1]*x1p + biases[1])

plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'b', linewidth = 3, alpha = 0.5)
plt.plot(x1p, x3p, 'k', linewidth = 3, alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image
In [ ]:
x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2[0]*relu(weights[0][0]*x1p + biases[0]) + weights2[1]*relu(weights[0][1]*x1p + biases[1])

plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'c', linewidth = 3)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image

Four Neurons with ReLU



$$\hat{y} = \omega_1^{(2)} \left( h \left( \omega_1^{(1)} x + b_1^{(1)} \right) \right) + \omega_2^{(2)} \left( h \left( \omega_2^{(1)} x + b_2^{(1)} \right) \right) + \omega_3^{(2)} \left( h \left( \omega_3^{(1)} x + b_3^{(1)} \right)\right) + \omega_4^{(2)} \left( h \left( \omega_4^{(1)} x + b_4^{(1)} \right) \right) $$


In [ ]:
model_4 = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units = 4, activation = 'relu'),
    tf.keras.layers.Dense(units = 1, use_bias = False)
])

model_4.compile(optimizer = 'adam',
              loss = 'mse')

model_4.fit(train_x, train_y, epochs = 4000, verbose = 0)
Out[ ]:
<keras.src.callbacks.History at 0x78ea68189b70>
In [ ]:
weights = model_4.layers[0].get_weights()[0]
biases = model_4.layers[0].get_weights()[1]
weights2 = model_4.layers[1].get_weights()[0]
In [ ]:
x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2[0]*relu(weights[0][0]*x1p + biases[0])
x3p = weights2[1]*relu(weights[0][1]*x1p + biases[1])
x4p = weights2[2]*relu(weights[0][2]*x1p + biases[2])
x5p = weights2[3]*relu(weights[0][3]*x1p + biases[3])

plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'b', linewidth = 3, alpha = 0.5)
plt.plot(x1p, x3p, 'k', linewidth = 3, alpha = 0.5)
plt.plot(x1p, x4p, 'g', linewidth = 3, alpha = 0.5)
plt.plot(x1p, x5p, 'y', linewidth = 3, alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image
In [ ]:
x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2[0]*relu(weights[0][0]*x1p + biases[0]) + weights2[1]*relu(weights[0][1]*x1p + biases[1]) + weights2[2]*relu(weights[0][2]*x1p + biases[2]) + weights2[3]*relu(weights[0][3]*x1p + biases[3])

plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'c', linewidth = 3)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image

100 Neurons with ReLU

In [ ]:
model_100 = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units = 100, activation = 'relu'),
    tf.keras.layers.Dense(units = 1, use_bias = False)
])

model_100.compile(optimizer = 'adam',
              loss = 'mse')

model_100.fit(train_x, train_y, epochs = 1000, verbose = 0)
Out[ ]:
<keras.src.callbacks.History at 0x7cba3e13b7f0>
In [ ]:
weights = model_100.layers[0].get_weights()[0]
biases = model_100.layers[0].get_weights()[1]
weights2 = model_100.layers[1].get_weights()[0]
In [ ]:
x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = 0

for i in range(100):
    x2p += weights2[i]*relu(weights[0][i]*x1p + biases[i])
In [ ]:
plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-1.7, 2.2])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'c', linewidth = 3)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image

Key Points:

  • Each curve represents the output of a single ReLU neuron in response to the input feature $x$.
  • The activation values are zero for certain ranges of $x$, indicating that the neuron is "inactive" in those regions.
  • The piecewise linear segments created by the ReLU activations contribute to forming the overall non-linear regression curve.

This illustrates how the hidden layer's neurons, each applying ReLU activation, form different regions of the input space and collectively approximate the non-linear function.


5. Artificial Neural Networks


So far, we have conducted an in-depth examination of how an artificial neural network (ANN) operates at a micro level, focusing on the underlying mechanisms of individual neurons, layers, and their interactions. Having developed these foundational insights and intuitions, we are now in a position to broaden our perspective and analyze the ANN as a cohesive system, considering its overall architecture, functionality, and the role each component plays in contributing to the network's collective decision-making process.


Complex/Nonlinear Universal Function Approximator

  • ANNs are powerful universal function approximators that can model both linear and nonlinear relationships.

  • By stacking layers of neurons and using non-linear activation functions, ANNs can represent highly complex mappings, making them suitable for a wide range of tasks, such as image recognition, speech processing, and time-series predictions.


ANN Architecture

  • ANNs are typically organized as feedforward networks with layers that are fully connected. In these networks:

    • Each neuron in a layer receives inputs from all neurons in the previous layer.
    • The network propagates information forward, from input to output, without loops.
  • Each layer performs a linear transformation of the input.

  • Linear connections alone cannot capture complex relationships; this is where non-linear activation functions become essential.

  • Each neuron applies a non-linear activation function to the weighted sum of its inputs

  • These nonlinear neurons allow ANNs to learn non-linear decision boundaries, making them capable of solving complex, non-linearly separable problems.


Hidden Layers and Autonomous Feature Learning

  • Hidden layers are the intermediate layers between the input and output layers in a neural network.

  • Each hidden layer learns intermediate representations or features from the input data.

    • Shallow networks have fewer hidden layers and may struggle with highly complex patterns.

    • Deep networks with many hidden layers (Deep Neural Networks) can model highly intricate relationships but require large datasets and longer training times.

  • The neurons in the hidden layers create intermediate transformations that allow the network to construct hierarchical feature representations, leading to improved performance for complex tasks.

  • Unlike traditional machine learning models that rely on manual feature extraction, ANNs can automatically learn features from raw data.

    • Each layer in an ANN learns increasingly abstract features.

    • This autonomous feature learning makes ANNs robust across diverse domains and applications, eliminating the need for domain-specific feature engineering.





6. ANN Learning

Now that we understand how an Artificial Neural Network (ANN) works, the next step is to determine the unknown parameters (the weights and biases) from the data. Once the structure of the ANN is designed, these parameters must be learned through training. In the following, we will discuss how this learning process takes place.


To begin with the conclusion, the learning process is performed using the backpropagation algorithm. Before diving into the details of backpropagation, it is essential to first review "recursive algorithms" and "dynamic programming", as they form the foundational concepts necessary for understanding backpropagation.


In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('mGuXwSbantc', width="560", height="315", frameborder="0")
Out[ ]:

6.1. Recursive Algorithm

  • A fundamental concept in computer science:

    • Recursive algorithms play a pivotal role in solving problems by breaking them down into smaller, manageable subproblems.
  • Subproblem-based approach:

    • The solution to the main problem depends on solving smaller instances of the same problem (subproblems).
  • Self-referential function calls:

    • A recursive function calls itself with a modified input until reaching a base case.
    • This mechanism does not directly occur in the physical world, making recursion a unique and abstract computational concept.



In [ ]:
%%html
<iframe src="https://www.youtube.com/embed/t4MSwiqfLaY?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>

  • Factorial example

$$n ! = n \cdot (n-1) \cdots 2 \cdot 1$$

In [ ]:
n = 5

m = 1
for i in range(n):
    m = m*(i+1)

print(m)
120
In [ ]:
def fac(n):
    if n == 1:
        return 1
    else:
        return n*fac(n-1)
In [ ]:
# recursive

fac(5)
Out[ ]:
120

6.2. Dynamic Programming

Dynamic programming (DP) is a powerful algorithmic technique used to solve optimization problems by breaking them down into "overlapping" subproblems. It is particularly effective for problems with overlapping subproblems and optimal substructure, where solutions to smaller subproblems can be reused to build the solution to the overall problem.

  • Optimal Substructure

    • A problem exhibits optimal substructure if the optimal solution to the problem can be constructed from the optimal solutions of its subproblems.
    • Example: In the shortest path problem, the shortest path from point A to point C passing through point B consists of the shortest path from A to B combined with the shortest path from B to C.
  • Overlapping Subproblems

    • A problem has overlapping subproblems if the same subproblems are solved multiple times during the computation.
    • Instead of solving the same subproblem repeatedly, dynamic programming stores the results of solved subproblems in a lookup table (memoization) for future reuse.

Let’s discuss dynamic programming using the Fibonacci sequence as an example:


$$ \begin{align*} F_1 &= F_2 = 1 \\ F_n &= F_{n-1} + F_{n-2} \end{align*} $$


In [ ]:
# naive Fibonacci

def fib(n):
    if n <= 2:
        return 1
    else:
        return fib(n-1) + fib(n-2)
In [ ]:
fib(10)
Out[ ]:
55

The naive recursive method performs redundant calculations, repeatedly solving subproblems even when their solutions have already been computed. In contrast, dynamic programming allocates memory to store previously computed results. When a subproblem is encountered, DP first checks whether the result has already been calculated. If so, it retrieves the value from memory, avoiding redundant computations and significantly reducing the overall computational time. This approach exemplifies the trade-off where computational time is saved at the expense of additional memory usage - an illustration of the 'space-for-time' principle.


In [ ]:
# Memorized DP Fibonacci

def mfib(n):
    global memo

    if memo[n-1] != 0:
        return memo[n-1]
    elif n <= 2:
        memo[n-1] = 1
        return memo[n-1]
    else:
        memo[n-1] = mfib(n-1) + mfib(n-2)
        return memo[n-1]
In [ ]:
import numpy as np

n = 10
memo = np.zeros(n)
mfib(n)
Out[ ]:
55.0

Dynamic Programming (DP) is much faster than a naive recursive algorithm due to its use of memoization to avoid redundant computations.


In [ ]:
n = 30
%timeit fib(30)
332 ms ± 110 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [ ]:
memo = np.zeros(n)
%timeit mfib(30)
389 ns ± 5.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Dynamic Programming (DP) $\approx$ Recursion + Memorization

Dynamic Programming can be thought of as an optimization technique that enhances recursion by storing the results of previously computed subproblems (memorization). This avoids redundant calculations and improves efficiency.

  • Recursion: Breaks a problem down into smaller subproblems and solves them recursively.
  • Memorization: Stores the solutions to subproblems in a cache (e.g., a dictionary or array) to reuse them instead of recalculating.

This combination allows DP to solve problems with overlapping subproblems and optimal substructure more efficiently.



6.3. Training Neural Networks


Now, let us return to our original problem: learning the weights and biases of a multi-layer perceptron (MLP).


  • Learning or estimating weights and biases of multi-layer perceptron from training data
    1. Forward Pass: Input $X$ passes through the network to generate $\hat Y$.
    2. Loss Calculation: The loss function compares $\hat Y$ and $Y$.
    3. Backward Pass: The optimizer computes gradients of the loss concerning the weights.
    4. Weight Update: Weights are adjusted based on the gradients.
    5. Repeat: The process continues until the loss converges or the training reaches a predefined number of iterations.


Loss Function

  • Measures error between target values and predictions

$$ \min_{\omega} \sum_{i=1}^{m}\ell\left( h_{\omega}\left(x^{(i)}\right),y^{(i)}\right)$$


  • Example
    • Squared loss (for regression):

      $$ \frac{1}{m} \sum_{i=1}^{m} \left(h_{\omega}\left(x^{(i)}\right) - y^{(i)}\right)^2 $$

    • Cross-entropy (for classification):

      $$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log\left(h_{\omega}\left(x^{(i)}\right)\right) + \left(1-y^{(i)}\right)\log\left(1-h_{\omega}\left(x^{(i)}\right)\right)$$


Learning

  • Learning weights and biases from data using gradient descent

$$\omega \leftarrow \omega - \alpha \nabla_{\omega} \ell \left( h_{\omega}\left(x^{(i)}\right), y^{(i)} \right)$$


  • $\frac{\partial \ell}{\partial \omega}$: too many computations are required for all $\omega$

  • Structural constraints of NN:

    • Composition of functions
    • Chain rule
    • Dynamic programming

Backpropagation

Backpropagation is a fundamental algorithm used for training neural networks by adjusting their weights to minimize the loss function. It operates by propagating errors backward through the network, from the output layer to the input layer, using the chain rule of differentiation.


  • Forward propagation

    • the initial information propagates up to the hidden units at each layer and finally produces output
  • Backpropagation

    • allows the information from the cost to flow backwards through the network in order to compute the gradients
  • Chain rule

    • Computing the derivative of the composition of functions

$$ \begin{align*} f(g(x))' &= f'(g(x))g'(x)\\\\ {dz \over dx} &= {dz \over dy} \cdot {dy \over dx}\\ {dz \over dw} &= \left({dz \over dy} \cdot {dy \over dx}\right) \cdot {dx \over dw}\\ {dz \over du} &= \left({dz \over dy} \cdot {dy \over dx} \cdot {dx \over dw}\right) \cdot {dw \over du} \end{align*} $$


  • Backpropagation
    • Update weights recursively with memory




Learning the weights and biases in an artificial neural network (ANN) can be viewed as an optimization process using gradient descent or its variants. However, a significant challenge arises due to the large number of parameters (or unknowns) that need to be estimated, as well as the extensive computations of gradients required during each iteration.

The uniqueness of this problem lies in the sequentially stacked structure of ANN layers, which can be interpreted as a composition of functions. As a result, the computation of gradients involves the application of the chain rule, leading to repeated calculations for intermediate layers.

By computing these gradients in a backward manner, previously computed results can be reused in subsequent steps. This is the fundamental concept behind backpropagation, making it analogous to dynamic programming. The term 'back' in 'backpropagation' highlights this backward computation process, where intermediate results are stored and revisited to avoid redundant computations.


Summary

  • Learning weights and biases from data using gradient descent $\approx$ backpropagation


  • It is not easy to numerically compute gradients in network in general.
    • The good news: people have already done all the "hardwork" of developing numerical solvers (or libraries)
    • There are a wide range of tools: TensorFlow


6.4. Historical Notes

Backpropagation and Geoffrey Hinton

Backpropagation (short for "backward propagation of errors") is a core algorithm used to train artificial neural networks (ANNs). It calculates gradients of the loss function with respect to the network’s weights by applying the chain rule of calculus in a systematic, layer-by-layer manner from the output layer back to the input layer. This enables the efficient updating of weights using optimization methods such as gradient descent.

While the mathematical foundations of backpropagation date back to the 1960s (notably by Werbos in 1974, who applied it to neural networks), it was largely unrecognized until Geoffrey Hinton, along with David Rumelhart and Ronald J. Williams, demonstrated its practical potential for deep learning in their seminal 1986 paper, "Learning Representations by Back-Propagating Errors".


Why Hinton's Contribution Was Pivotal

Revival of Neural Networks:

  • During the 1980s, symbolic AI dominated the field, and neural networks were seen as outdated and impractical. Hinton’s work with backpropagation demonstrated that neural networks could learn complex representations and outperformed traditional AI approaches on several tasks.

Scalability and Deep Networks:

  • Hinton's research showed that backpropagation could handle multi-layer (deep) networks, allowing networks to automatically learn hierarchical representations of features.

The Legacy of Backpropagation:

  • The 1986 breakthrough laid the groundwork for modern deep learning.
  • Backpropagation became the standard training algorithm for neural networks.
  • Hinton continued to push the field forward, contributing to the resurgence of neural networks in the 2000s and co-developing groundbreaking architectures like deep belief networks (DBNs) and capsule networks.

The Broader Impact on Modern AI

Backpropagation remains essential in modern deep learning frameworks (such as TensorFlow and PyTorch), enabling the training of massive neural networks for applications in language models (e.g., GPT), computer vision, and beyond. Hinton's contributions earned him the nickname "the godfather of deep learning" and prestigious accolades, including the Turing Award (2018) and the Nobel Prize in Physics (2024).



7. ANN with MNIST

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('z-ZhKdQpF7I', width="560", height="315", frameborder="0")
Out[ ]:

7.1. What's an MNIST?

From Wikipedia

  • The MNIST database (Mixed National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, NIST's complete dataset was too hard.

  • MNIST (Mixed National Institute of Standards and Technology database) database

    • Handwritten digit database
    • $28 \times 28$ gray scaled image
    • (Flattened matrix into a vector of $28 \times 28 = 784$) $\rightarrow$ not for TensorFlow 2



More here


We will use the MNIST dataset to create a multinomial classifier that can detect whether the image belongs to one of the classes: 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. In short, we're teaching a computer to recognize handwritten digits.


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline

Let's download and load the dataset.

In [ ]:
mnist = tf.keras.datasets.mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()

train_x, test_x = train_x/255.0, test_x/255.0
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 0s 0us/step
In [ ]:
print ("The training data set is:\n")
print (train_x.shape)
print (train_y.shape)
The training data set is:

(60000, 28, 28)
(60000,)
In [ ]:
print ("The test data set is:")
print (test_x.shape)
print (test_y.shape)
The test data set is:
(10000, 28, 28)
(10000,)

Display a few random samples from it:

In [ ]:
# So now we have a 28x28 matrix, where each element is an intensity level from 0 to 1.

img = train_x[5]
img.shape
Out[ ]:
(28, 28)

Let's visualize what some of these images and their corresponding training labels look like.

In [ ]:
plt.figure(figsize = (4, 4))
plt.imshow(img, 'gray')
plt.xticks([])
plt.yticks([])
plt.show()
No description has been provided for this image
In [ ]:
train_y[5]
Out[ ]:
2

7.2. ANN with TensorFlow

  • Feed a gray image to ANN

  • Our network model
    • The hyperparameters of the network have not been optimized.
    • This example is intended solely to demonstrate how to implement the network in Python based on the provided structure.


  • Network training (learning)

    • Forward Pass: Input $X$ passes through the network to generate $\hat Y$.

    • Loss Calculation: The loss function compares $\hat Y$ and $Y$.

    • Backward Pass: The optimizer computes gradients of the loss concerning the weights.

    • Weight Update: Weights are adjusted based on the gradients.

    • Repeat: The process continues until the loss converges or the training reaches a predefined number of iterations.


$$\omega:= \omega - \alpha \nabla_{\omega} \left( h_{\omega} \left(x^{(i)}\right),y^{(i)}\right)$$

7.2.1. Import Library

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

7.2.2. Load MNIST Data

  • Download MNIST data from tensorflow tutorial example
In [ ]:
mnist = tf.keras.datasets.mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()
train_x, test_x = train_x/255.0, test_x/255.0

7.2.3. Define an ANN Structure

  • Input size
  • Hidden layer size
  • The number of classes


In [ ]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape = (28, 28)),
    tf.keras.layers.Dense(units = 100, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'softmax')
])

7.2.4. Define Loss and Optimizer


Optimizer

  • This defines how the model is updated based on the data it sees and its loss function.
  • AdamOptimizer: the most popular optimizer

Loss

  • This defines how we measure how accurate the model is during training. As was covered in lecture, during training we want to minimize this function, which will "steer" the model in the right direction.
  • Classification: Cross-entropy
    • Equivalent to apply logistic regression

$$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(h_{\theta}\left(x^{(i)}\right)) + (1-y^{(i)})\log(1-h_{\theta}\left(x^{(i)}\right)) $$


Metrics

  • Here we can define metrics used to monitor the training and testing steps. In this example, we'll look at the accuracy, the fraction of the images that are correctly classified.

In [ ]:
model.compile(optimizer = 'adam',
              loss = 'sparse_categorical_crossentropy',
              metrics = ['accuracy'])

7.2.5. Train Model




In [ ]:
# Train Model

loss = model.fit(train_x, train_y, epochs = 5)
Epoch 1/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.2736 - accuracy: 0.9215
Epoch 2/5
1875/1875 [==============================] - 5s 3ms/step - loss: 0.1223 - accuracy: 0.9642
Epoch 3/5
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0848 - accuracy: 0.9751
Epoch 4/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0647 - accuracy: 0.9803
Epoch 5/5
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0510 - accuracy: 0.9842

7.2.6. Test or Evaluate

In [ ]:
# Evaluate Test Data

test_loss, test_acc = model.evaluate(test_x, test_y)
313/313 [==============================] - 1s 2ms/step - loss: 0.0837 - accuracy: 0.9749
In [ ]:
test_img = test_x[np.random.choice(test_x.shape[0], 1)]

predict = model.predict_on_batch(test_img)
mypred = np.argmax(predict, axis = 1)

plt.figure(figsize = (8,4))

plt.subplot(1,2,1)
plt.imshow(test_img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()

print('Prediction : {}'.format(mypred[0]))
No description has been provided for this image
Prediction : 0

What is the highest accuracy you can achieve with this first fully connected model? Since the handwritten digit classification task is pretty straightforward, you may be wondering how we can do better...

$\Rightarrow$ As we saw in lecture, convolutional neural networks (CNNs) are particularly well-suited for a variety of tasks in computer vision, and have achieved near-perfect accuracies on the MNIST dataset. We will build a CNN and ultimately output a probability distribution over the 10 digit classes (0-9) in the next lectures.



7.3. Historial Notes


The MNIST Dataset and Yann LeCun

The MNIST (Modified National Institute of Standards and Technology) dataset is one of the most famous benchmark datasets in machine learning, consisting of 60,000 training images and 10,000 test images of handwritten digits (0 to 9), each in a 28x28 grayscale pixel format. It is widely used for training and evaluating classification models, especially in the field of neural networks.


Yann LeCun’s Role in the Creation of MNIST

Yann LeCun (Turing Award, 2018), a pioneer in the field of deep learning, created the MNIST dataset in collaboration with Corinna Cortes and Christopher J.C. Burges in 1998. LeCun, known for his groundbreaking work in neural networks, particularly convolutional neural networks (CNNs), developed MNIST as an easily accessible and standardized dataset to evaluate machine learning algorithms.

Before MNIST, many machine learning models struggled with inconsistent and poorly standardized datasets for real-world pattern recognition tasks. MNIST solved this by providing a well-curated, balanced, and preprocessed dataset that became the de facto benchmark for evaluating classification models.


Legacy of MNIST

Despite its simplicity, MNIST remains a crucial stepping stone in the history of artificial intelligence. It demonstrated the power of neural networks in pattern recognition and inspired the development of modern deep learning architectures. Yann LeCun’s creation of MNIST not only provided a foundational benchmark for machine learning but also played a pivotal role in popularizing neural networks and shaping the trajectory of modern AI research.



8. More in ANN

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('Jz7zrwoesBw', width="560", height="315", frameborder="0")
Out[ ]:

8.1. Vanishing Gradient Problem

The vanishing gradient problem is a common issue in deep neural networks, where the gradients of the loss function become increasingly small as they are backpropagated through the network's layers. This leads to extremely small weight updates during training, especially for the earlier layers of the network, making learning inefficient or causing it to stall altogether.


How Nonlinear Activation Functions Contribute to the Vanishing Gradient Problem

Sigmoid and Tanh Activation Functions:

  • The sigmoid function squashes its input into a range between 0 and 1:

$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$


  • The tanh function squashes its input into a range between -1 and 1:

$$ \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} $$


  • In both cases, when the input values are very large (positive or negative), the gradients (derivatives) approach zero due to the flatness of the function at the extremes:

$$ \begin{align*} \sigma'(x) &= \sigma(x)(1 - \sigma(x)) \\\\ \tanh'(x) &= 1 - \tanh^2(x) \end{align*} $$


  • For values close to 1 or 0, these derivatives become extremely small, which leads to small gradients during backpropagation.





Effect on Deep Networks:

  • As gradients propagate backward through the layers, they are multiplied by the derivatives of the activation functions at each layer.

  • If the derivatives are small (close to zero), the product of gradients diminishes exponentially as the number of layers increases.

  • This results in gradients that "vanish," making weight updates negligible, especially for early layers, effectively preventing the network from learning meaningful features.

  • Chain rule


$$\frac{z}{u} = \frac{z}{y} \cdot \frac{y}{x} \cdot \frac{x}{\omega} \cdot \frac{\omega}{u} $$


8.2. Solutions to the Vanishing Gradient Problem

(1) ReLU (Rectified Linear Unit)

  • The ReLU activation function:

$$\text{ReLU}(x) = \max(0,x)$$


  • The derivative is 1 for positive inputs and 0 for negative inputs, avoiding the problem of very small gradients in most cases.

  • ReLU allows gradients to flow more effectively in deep networks, improving convergence during training.

  • The use of the ReLU activation function was a great improvement compared to the tanh.








(2) Batch Normalization

Batch Normalization is a technique used to normalize the inputs to each layer in a neural network during training. By maintaining the mean and variance of the inputs at stable levels, Batch Normalization helps prevent issues like the vanishing gradient problem, which can severely affect the training of deep neural networks.

  • Normalizes the input to each layer, keeping activations within a range that prevents gradients from becoming too small or too large.

  • It is used to normalize the input layer by adjusting and scaling the activations.


How Batch Normalization Works

Batch Normalization normalizes the input for each mini-batch to have a mean of 0 and a standard deviation of 1. This helps keep the activations within a range where gradients can flow effectively, mitigating the vanishing gradient issue.

  1. Normalization: For each mini-batch, compute the mean $ \mu_B $ and variance $ \sigma_B^2 $ of the batch:

    $$ \mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i \quad \text{and} \quad \sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2 $$
    where $ m $ is the number of examples in the batch.

  2. Standardization: Normalize each input $ x_i $:


    $$ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} $$


    where $ \epsilon $ is a small constant added for numerical stability.

  3. Scale and Shift: Introduce learnable parameters $ \gamma $ (scale) and $ \beta $ (shift) to allow the network to recover flexibility:


    $$ y_i = \gamma \hat{x}_i + \beta $$


    This enables the network to learn the appropriate distribution of the normalized outputs.





  • During training batch normalization shifts and rescales according to the mean and variance estimated on the batch.

  • During test, it simply shifts and rescales according to the empirical moments estimated during training.


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

Overfitting in Regression

In [ ]:
N = 10
data_x = np.linspace(-4.5, 4.5, N)
data_y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512])

data_x = data_x.reshape(-1,1)
data_y = data_y.reshape(-1,1)

plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.grid(alpha = 0.3)
plt.show()
No description has been provided for this image
In [ ]:
base_model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_shape = (1,),
                          units = 30, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 30, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 1, activation = None)
])
In [ ]:
base_model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
                   loss = 'mse',
                   metrics = ['mse'])
In [ ]:
# Train Model & Evaluate Test Data

training = base_model.fit(data_x, data_y, epochs = 5000, verbose = 0)

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = base_model.predict(xp)

plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()
4/4 [==============================] - 0s 3ms/step
No description has been provided for this image

Batch Normalization Implementation

This example is not intended to demonstrate the improvement in performance and stability of artificial neural networks achieved through batch normalization, but rather to illustrate how to implement batch normalization using TensorFlow 2

In [ ]:
bn_model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units = 30, activation = None, input_shape = (1,)),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(units = 100, activation = None),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(units = 100, activation = None),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(units = 30, activation = None),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('sigmoid'),
    tf.keras.layers.Dense(units = 1, activation = None)
])
In [ ]:
bn_model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
                 loss = 'mse',
                 metrics = ['mse'])
In [ ]:
training = bn_model.fit(data_x, data_y, epochs = 4000, verbose = 0)

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = bn_model.predict(xp)

plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()
4/4 [==============================] - 0s 3ms/step
No description has been provided for this image

Batch Normalization is primarily known for stabilizing and accelerating the training of deep neural networks by normalizing the inputs to each layer. Interestingly, Batch Normalization also has an impact on overfitting — a phenomenon where the model performs well on the training data but poorly on unseen data.

  • Since BatchNorm normalizes activations, the network does not need excessively large weights to produce meaningful outputs.

  • This reduces the risk of overfitting caused by excessively large parameter values.


8.3. Dropout as Regularization

We have learned that there are numerous techniques available to prevent overfitting, such as regularization, dropout, data augmentation, and early stopping, which are commonly used strategies in deep learning.

Dropout is a regularization technique used in neural networks to prevent overfitting by randomly "dropping" a fraction of neurons during training. This forces the network to learn more robust and distributed representations, rather than relying too heavily on specific neurons.

  • This is the one of the most interesting types of regularization techniques.
  • It also produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.
  • At every iteration, it randomly selects some nodes and removes them.
  • It can also be thought of as an ensemble technique in machine learning.




How Dropout Works

Training Phase:

  • During each training iteration, a random set of neurons in the network is temporarily "dropped" with a probability $p$ (known as the dropout rate).

  • This means that the corresponding connections to those neurons do not participate in forward or backward passes for that iteration.

Testing/Inference Phase:

  • During testing or inference, dropout is turned off, and the full network is used. However, the weights of the remaining neurons are scaled to account for the dropped connections during training:

$$\omega_{\text{test}} = p \times \omega_{\text{train}}$$


  • This ensures that the outputs remain consistent even when dropout is no longer applied.

Why Dropout Prevents Overfitting

  • Breaks Co-Adaptation:

    • By randomly disabling neurons, dropout prevents the network from becoming overly reliant on specific neurons, forcing the remaining neurons to learn more robust features.
  • Reduces Model Complexity:

    • Effectively, dropout creates an ensemble of different "sub-networks" during training, making it harder for the model to overfit to the training data.
  • Adds Noise to Training:

    • The randomness introduced by dropout acts as noise, which can improve the network's generalization performance.

Dropout Implementation

In [ ]:
dropout_model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(input_shape = (1,),
                          units = 30, activation = 'sigmoid'),
    tf.keras.layers.Dropout(rate = 0.2),
    tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
    tf.keras.layers.Dropout(rate = 0.2),
    tf.keras.layers.Dense(units = 100, activation = 'sigmoid'),
    tf.keras.layers.Dropout(rate = 0.2),
    tf.keras.layers.Dense(units = 30, activation = 'sigmoid'),
    tf.keras.layers.Dropout(rate = 0.2),
    tf.keras.layers.Dense(units = 1, activation = None)
])
In [ ]:
dropout_model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
                      loss = 'mse',
                      metrics = ['mse'])
In [ ]:
training = dropout_model.fit(data_x, data_y, epochs = 200, verbose = 0)

xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = dropout_model.predict(xp)

plt.figure(figsize = (6, 4))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()
4/4 [==============================] - 0s 3ms/step
No description has been provided for this image

  • In this example, a 20% dropout rate is applied to all layers, meaning that 20% of the neurons are randomly disabled during training.

  • During inference, all neurons are utilized, but their outputs are scaled appropriately to account for the dropout during training.


9. Other Tutorials

In [ ]:
%%html
<iframe src="https://www.youtube.com/embed/aircAruvnKk?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>
In [ ]:
%%html
<iframe src="https://www.youtube.com/embed/IHZwWFHWa-w?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>
In [ ]:
%%html
<iframe src="https://www.youtube.com/embed/Ilg3gGewQ5U?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>
In [ ]:
%%html
<iframe src="https://www.youtube.com/embed/tIeHLnjs5U8?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>
In [ ]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')