Artificial Neural Networks (ANN)

Table of Contents

1. Recall Perceptron¶

from IPython.display import YouTubeVideo
YouTubeVideo('W5i9OA0bW-A', width="560", height="315", frameborder="0")

Perceptron

Binary Linear Classification in 2D

Binary Linear Classification in High Dimensions

A perceptron models the decision boundary as a linear equation:

$$\omega_0 + \omega_1 x_1 + \omega_2 x_2 + \cdots + \omega_d x_d = 0$$

This means that a perceptron can only create a straight-line (hyperplane) to separate the data. It is instructive to conceptualize a single perceptron as representing a hyperplane in a geometric space.

2. From Perceptron to Multi-Layer Perceptron (MLP)¶

To explain the concept of a multi-layer perceptron (MLP), we start with a simple perceptron and establish its connection to a single hyperplane. A perceptron functions as a linear classifier by learning a hyperplane that separates data into distinct classes. However, when the data is non-linearly separable, a single hyperplane is insufficient.

In such cases, two possible approaches arise:

Using multiple hyperplanes (or multiple dividers): This can be achieved by stacking multiple layers of perceptrons, allowing the model to combine hyperplanes and approximate more complex decision boundaries.
Learning non-linear boundaries (or curved surface): By introducing nonlinear activation functions and multiple layers, the model can learn non-linear transformations that capture complex patterns in the data.

This progression from a single hyperplane to non-linear boundaries forms the foundation of multi-layer perceptrons (MLPs), enabling them to handle non-linear separability.

2.1. Perceptron¶

Neurons compute the weighted sum of their inputs
A neuron is activated or fired when the sum $a$ is positive

$$ \begin{align*} a &= \omega_0 + \omega_1 x_1 + \omega_2 x_2 \\ \\ \hat{y} &= g(a) = \begin{cases} 1 & a > 0\\ 0 & \text{otherwise} \end{cases} \end{align*} $$

A step function (or sign function) is non-differentiable.
- To address this limitation, we later replace the step function with differentiable non-linear activation functions such as the sigmoid, or tanh functions.

XOR Problem

Minsky-Papert Controversy on XOR

not linearly separable
limitation of perceptron

A perceptron is a simple linear classifier that separates data using a single hyperplane. However, there are some classification tasks that a perceptron cannot solve due to its linear nature. A classic example of this limitation is the XOR (exclusive OR) problem.

Idea: Nonlinear Curve Approximated by Multiple Lines

2.2. Multiple Perceptron¶

A single perceptron is often insufficient for complex tasks, as it can only learn a single hyperplane to separate the data. When the data is not linearly separable, a single hyperplane cannot capture the intricate boundaries or patterns in the feature space, necessitating deeper or more complex architectures to model the underlying relationships effectively.

For example, if two perceptrons are stacked, it represents two hyperplanes.

Switch to differentiable activation function (for example, the sigmoid function)

In a compact representation
- Combining the summation and the sigmoid function forms another neuron.

2.3. Hidden Layers as Kernel Learning¶

Each neuron applies a nonlinear activation function to its inputs, effectively performing a nonlinear transformation of the data. As a result, the hidden layers can be interpreted as a nonlinear mapping between the input and output spaces, similar to the role of kernel functions in classical machine learning methods.

The Second Way of Looking at Multiple Perceptrons

A nonlinear activation function enables a perceptron to model nonlinear relationships between input and output.

Suppose that data is not linearly separable

Nonlinear mapping + neuron

User-defined Kernel
For example,

$$\phi: (x_1, x_2) \rightarrow (x_1, x_2, x_1 x_2)$$

Nonlinear mapping can be represented by another layer (or neurons)

Learnable Kernel
Nonlinear activation functions

In machine learning, defining the appropriate kernels for non-linear mapping is crucial for models like logistic regression and kernel-based methods. However, in deep learning, the kernel function is effectively represented by the hidden layers of the neural network, which are learned directly from the data rather than being predefined.

In traditional machine learning: Kernels (such as polynomial or RBF kernels) are manually chosen to project the input into a higher-dimensional space for better separation.
In deep learning: The hidden layers act as adaptive feature extractors, learning the non-linear transformations (analogous to kernels) automatically during training.

This flexibility allows deep learning models to discover the best representations and transformations for the data without requiring explicit kernel design.

Multi-Layer Perceptron as a Sequence of Feature Extraction

When multiple hidden layers are stacked, the output of one hidden layer becomes the input to the next. This structure can be viewed as a sequence of feature extraction stages, where each layer transforms its input into a higher-level, more abstract representation.
This process of hierarchical feature extraction enables the Multi-Layer Perceptron (MLP) to capture and learn increasingly complex patterns within the data. Lower layers typically extract basic features, while deeper layers combine these to form more sophisticated and semantically meaningful representations.

A multi-layer perceptron is not merely a collection of neurons but a structured sequence of feature extraction steps. Each layer extracts increasingly abstract and relevant features, allowing the model to handle complex, non-linear tasks that a single-layer perceptron cannot solve.

Intuition Behind Feature Extraction in MLP

The first layer extracts basic features from the input (e.g., edges or simple patterns in images).
The hidden layers combine these basic features to create more complex representations (e.g., shapes, contours).
The final layer maps the abstract features to the output (e.g., predicting a class label or regression value).

By stacking layers, the MLP gradually transforms simple input features into complex features that capture relationships in non-linear data.

Each layer can be thought of as performing a mapping from one feature space to another, progressively refining the representation to make the final decision easier.

This hierarchy of transformations allows the network to move from raw input data to task-specific representations.

2.4. Summary: Two Ways of Looking at Artificial Neural Networks¶

(1) Can represent multiple lines

(2) Can represent nonlinear relationship between input and outputs due to nonlinear activation function

3. Logistic Regression in a Form of Neural Network¶

Here, we will demonstrate a multi-layer perceptron (MLP) using a logistic regression example. In this demonstration, we will observe the hyperplanes and the features learned in the hidden layer at the end of the training process.

When we extend logistic regression to a multi-layer perceptron, the hidden layer transforms the input features through non-linear activations, creating intermediate feature spaces.
At the end of learning, we can visualize how the hidden layer features and multiple hyperplanes evolve to form more complex decision boundaries.

This approach will illustrate how the MLP builds upon logistic regression by stacking multiple transformations to handle non-linear separability.

Let's start with logistic regression for a linearly separable case.

$$ \begin{align*} y^{(i)} &\in \{0, 1\}\\\\ y &= \sigma (\omega_0 + \omega_1 x_1 + \omega_2 x_2) \end{align*} $$

After training, $\omega_0 + \omega_1 x_1 + \omega_2 x_2 = 0$ will represent a linear classification boundary.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

%matplotlib inline

#training data gerneration
m = 1000
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])

train_X = np.asmatrix(train_X)
train_y = np.asmatrix(train_y)

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1)
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.show()

LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(2,)),
    tf.keras.layers.Dense(units = 1, activation = 'sigmoid')
])

LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
                           loss = 'binary_crossentropy')

loss = LogisticRegression.fit(train_X, train_y, epochs = 10, verbose = 0)

w = LogisticRegression.layers[0].get_weights()[0]
b = LogisticRegression.layers[0].get_weights()[1]

print(w)
print("\n")
print(b)

[[1.8908017]
 [2.36824  ]]


[-6.979349]

x1p = np.arange(0, 8, 0.01).reshape(-1, 1)
x2p = - w[0,0]/w[1,0]*x1p - b[0]/w[1,0]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'g', linewidth = 3, label = '')
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.legend(loc = 1)
plt.show()

3.1. Looking at Parameters in Nonlinear Classification¶

Now, let's move on to a non-linearly separable case.

In this scenario, the dataset cannot be separated by a single linear hyperplane. Therefore, we use a multi-layer perceptron (MLP), which approximates complex non-linear decision boundaries by learning multiple hyperplanes and combining them using non-linear activation functions.

Let's see how the MLP performs in this case!

Before that, let's first mention the notation changes commonly used in neural network conventions. In neural network notation, it is common to represent: $\omega_0 \rightarrow b$

Weights as $\omega$
Bias as $b$ (previously denoted as $\omega_0$)

$$y = \sigma(\omega_0 + \omega_1 x_1 + \omega_2 x_2) \quad \longrightarrow \quad y = \sigma(b + \omega_1 x_1 + \omega_2 x_2)$$

# training data gerneration

m = 1000
x1 = 10*np.random.rand(m, 1) - 5
x2 = 8*np.random.rand(m, 1) - 4

g = - 0.5*(x1-1)**2 + 2*x2 + 5

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_X = np.asmatrix(train_X)

train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.axis('equal')
plt.ylim([-4, 4])
plt.show()

As illustrated in the figure, the data is non-linearly distributed. To approximate the non-linear decision boundary using two linear boundaries, a hidden layer with two neurons (plus a bias neuron) is intentionally added.

LogisticRegression = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(2,)),
    tf.keras.layers.Dense(units = 2, activation = 'sigmoid'),
    tf.keras.layers.Dense(units = 1, activation = 'sigmoid')
])

LogisticRegression.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.1),
                           loss = 'binary_crossentropy')

loss = LogisticRegression.fit(train_X, train_y, epochs = 10, verbose = 0)

w1 = LogisticRegression.layers[0].get_weights()[0]
b1 = LogisticRegression.layers[0].get_weights()[1]

w2 = LogisticRegression.layers[1].get_weights()[0]
b2 = LogisticRegression.layers[1].get_weights()[1]

H = train_X*w1 + b1
H = 1/(1 + np.exp(-H))

plt.figure(figsize = (6, 4))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.ylim([0, 1])
plt.show()

Here the features $z_1$ and $z_2$ learned in the hidden layer have two notable characteristics:

Bounded Output Range: Both features are restricted to the interval $(0,1)$ due to the use of the sigmoid activation function. This non-linear transformation ensures that the outputs remain within a normalized range, regardless of the input values.
Feature Redistribution: The hidden layer redistributes the input data into a new feature space where the transformed data becomes approximately linearly separable. This transformation allows the multi-layer perceptron to form a linear decision boundary in the higher-dimensional space that corresponds to a non-linear boundary in the original input space.

These learned features (or feature extractions) demonstrate how the hidden layer facilitates the approximation of complex patterns by creating new, informative representations of the input data.

x1p = np.arange(0, 1, 0.01).reshape(-1, 1)
x2p = - w2[0,0]/w2[1,0]*x1p - b2[0]/w2[1,0]

plt.figure(figsize = (6, 4))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.ylim([0, 1])
plt.show()

The result aligns with the linear boundary in the $z$-space, indicating that the transformed features $z_1$ and $z_2$ successfully reshape the input data into a feature space where a linear decision boundary can separate the classes. This transformation validates the role of the hidden layer in projecting non-linearly distributed data into a higher-dimensional space, making it linearly separable.

x1p = np.arange(-5, 5, 0.01).reshape(-1, 1)
x2p = - w1[0,0]/w1[1,0]*x1p - b1[0]/w1[1,0]
x3p = - w1[0,1]/w1[1,1]*x1p - b1[1]/w1[1,1]

plt.figure(figsize = (6, 4))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.plot(x1p, x3p, 'g', linewidth = 3, label = '')
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.ylim([-4, 4])
plt.show()

After training, two linear boundaries in the $x$-space (input space) are learned, corresponding to the linear boundaries formed by the hidden layer neurons. These boundaries result in multiple lines that, when combined, approximate the non-linear classification boundary. This demonstrates how the multi-layer perceptron (MLP) constructs a complex decision boundary in the input space by learning and integrating multiple hyperplanes from the hidden layer.

4. Regression in a Form of Neural Network¶

A multi-layer perceptron (MLP) can also be applied to nonlinear regression tasks. Unlike linear regression, which fits a straight line, MLPs can model complex, nonlinear relationships between input features and the output due to their use of multiple layers and nonlinear activation functions.

In this example, the Rectified Linear Unit (ReLU) will be used as the non-linear activation function. Therefore, let’s first take a closer look at ReLU before proceeding to the example.

Rectified linear unit (ReLU activation function)

$$h(x) = \max(0, x) = \begin{cases} 0, &\text{if}\;\; x \leq 0 \\ 1, &\text{if} \;\; x > 0 \end{cases}$$

Key Characteristics of ReLU:

Piecewise Linear: ReLU outputs zero for negative values of $x$ and passes the value of $x$ unchanged for positive values.
Sparsity: ReLU introduces sparsity by setting some neuron activations to zero, which can make the network more efficient.
Avoids Vanishing Gradients: Unlike sigmoid or tanh, ReLU does not squash values to a narrow range, thus avoiding the issue of vanishing gradients in deep networks.
Non-saturating: The gradient remains constant for positive values, making optimization faster.

$$h(x)$$

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

def relu(x):
    return np.maximum(0, x)

xp = np.linspace(-1.5, 1.5, 100)
yp = relu(xp)

plt.figure(figsize = (6, 4))
plt.plot(xp, yp, '--', color = 'red', lw = 3, alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.xlim([-2, 2])
plt.ylim([-4/3, 4/3])
plt.grid(alpha = 0.3)
plt.show()

$$h(x-0.5)$$

xp = np.linspace(-1.5, 1.5, 100)
yp = relu(xp - 0.5)

plt.figure(figsize = (6, 4))
plt.plot(xp, yp, '--', color = 'red', lw = 3, alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.xlim([-2, 2])
plt.ylim([-4/3, 4/3])
plt.grid(alpha = 0.3)
plt.show()

$$h(-2x)$$

xp = np.linspace(-1.5, 1.5, 100)
yp = relu(-2*xp)

plt.figure(figsize = (6, 4))
plt.plot(xp, yp, '--', color = 'red', lw = 3, alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.xlim([-2, 2])
plt.ylim([-4/3, 4/3])
plt.grid(alpha = 0.3)
plt.show()

$$- h(-2x-1)$$

xp = np.linspace(-1.5, 1.5, 100)
yp = -relu(-2*xp - 1)

plt.figure(figsize = (6, 4))
plt.plot(xp, yp, '--', color = 'red', lw = 3,  alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.xlim([-2, 2])
plt.ylim([-4/3, 4/3])
plt.grid(alpha = 0.3)
plt.show()

The expression $\omega_2 h(\omega_1 x + b_1) + b_2 $, where $h(\cdot)$ represents the ReLU activation function, can describe a variety of transformations of the ReLU function, including horizontal and vertical shifts, as well as reflections across the $x$-axis or $y$-axis, depending on the values and signs of $\omega_1, \omega_2, b_1$, and $b_2$.

Let us consider the task of modeling a multi-layer perceptron (MLP) to approximate the given non-linear functions.

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

np.random.seed(0)
tf.random.set_seed(0)

def function(x):
    return x**3 + 0.1*x**2 - x + 0.1

x = np.linspace(-1.5, 1.5, 1000)
y = function(x)

plt.figure(figsize = (6, 4))
plt.plot(x, y, '--', color = 'red', lw = 3, alpha = 0.5)
plt.xlim([-2, 2])
plt.ylim([-4/3, 4/3])
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()

train_x = np.random.uniform(-1.5, 1.5, 1000)
train_y = function(train_x)

Nonlinear Regression as a Linear Combination of ReLU Functions

In an MLP with ReLU activation, the output of the model can be interpreted as a linear combination of ReLU-transformed features. Each hidden layer neuron applies a ReLU transformation to a weighted sum of the input features, and the final output is a linear combination of these transformed features.

Key Insight:

Nonlinear regression with MLPs can be thought of as creating a piecewise linear approximation of the true function, where the linear segments are defined by the ReLU activations in the hidden layer. The network learns how to place these segments and their slopes to best fit the target function.

For a neural network with one hidden layer, the predicted output $ \hat{y} $ is:

$$ \hat{y} = \sum_{j=1}^{m} \omega_j^{(2)} h\left( \sum_{i=1}^{n} \omega_{ij}^{(1)} x_i + b_j^{(1)} \right) + b^{(2)} $$

$\quad$where:

$ \omega_{ij}^{(1)} $ and $ b_j^{(1)} $ are the weights and biases for the hidden layer.
$ \omega_j^{(2)} $ and $ b^{(2)} $ are the weights and bias for the output layer.

In the following examples, $b^{(2)}$, the bias term for the output layer, is omitted for simplicity. In other words, the vertical shift transformation is not applied.

Single Neuron with ReLU

$$\hat{y} = \omega^{(2)} \left( h \left( \omega^{(1)} x + b^{(1)} \right) \right)$$

model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (1,)),
    tf.keras.layers.Dense(units = 1, activation = 'relu'),
    tf.keras.layers.Dense(units = 1, use_bias = False)
])

model.compile(optimizer = 'adam',
              loss = 'mse')

train_x = train_x.reshape(-1, 1)
train_y = train_y.reshape(-1, 1)

model.fit(train_x, train_y, epochs = 500, verbose = 0)

<keras.src.callbacks.history.History at 0x7824501afdd0>

weights = model.layers[0].get_weights()[0]
biases = model.layers[0].get_weights()[1]
weights2 = model.layers[1].get_weights()[0]

print("Coefficients (Weights):", weights)
print("Intercepts (Biases):", biases)
print("Coefficients (Weights):", weights2)

Coefficients (Weights): [[2.0966315]]
Intercepts (Biases): [-2.079162]
Coefficients (Weights): [[1.9092116]]

real_x = np.linspace(-1.5, 1.5, 100)
real_y = function(real_x)

def relu(x):
    return np.maximum(0, x)

x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2*(relu(weights*x1p + biases))

plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-4/3, 4/3])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'c', linewidth = 3)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()

Two Neurons with ReLU

$$\hat{y} = \omega_1^{(2)} \left( h \left( \omega_1^{(1)} x + b_1^{(1)} \right) \right) + \omega_2^{(2)} \left( h \left( \omega_2^{(1)} x + b_2^{(1)} \right) \right)$$

model_2 = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (1,)),
    tf.keras.layers.Dense(units = 2, activation = 'relu'),
    tf.keras.layers.Dense(units = 1, use_bias = False)
])

model_2.compile(optimizer = 'adam',
                loss = 'mse')

train_x = train_x.reshape(-1, 1)
train_y = train_y.reshape(-1, 1)

model_2.fit(train_x, train_y, epochs = 3000, verbose = 0)

<keras.src.callbacks.history.History at 0x7823bca8fdd0>

weights = model_2.layers[0].get_weights()[0]
biases = model_2.layers[0].get_weights()[1]
weights2 = model_2.layers[1].get_weights()[0]

print("Coefficients (Weights):", weights)
print("Intercepts (Biases):", biases)
print("Coefficients (Weights):", weights2)

Coefficients (Weights): [[-1.1161743  1.8106993]]
Intercepts (Biases): [-1.2588075 -1.7955695]
Coefficients (Weights): [[-3.508285]
 [ 2.210944]]

x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2[0]*relu(weights[0][0]*x1p + biases[0])
x3p = weights2[1]*relu(weights[0][1]*x1p + biases[1])

plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-4/3, 4/3])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'b', linewidth = 3, alpha = 0.5)
plt.plot(x1p, x3p, 'k', linewidth = 3, alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()

x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2[0]*relu(weights[0][0]*x1p + biases[0]) + weights2[1]*relu(weights[0][1]*x1p + biases[1])

plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-4/3, 4/3])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'c', linewidth = 3)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()

Four Neurons with ReLU

$$\hat{y} = \omega_1^{(2)} \left( h \left( \omega_1^{(1)} x + b_1^{(1)} \right) \right) + \omega_2^{(2)} \left( h \left( \omega_2^{(1)} x + b_2^{(1)} \right) \right) + \omega_3^{(2)} \left( h \left( \omega_3^{(1)} x + b_3^{(1)} \right)\right) + \omega_4^{(2)} \left( h \left( \omega_4^{(1)} x + b_4^{(1)} \right) \right) $$

model_4 = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (1,)),
    tf.keras.layers.Dense(units = 4, activation = 'relu'),
    tf.keras.layers.Dense(units = 1, use_bias = False)
])

model_4.compile(optimizer = 'adam',
              loss = 'mse')

model_4.fit(train_x, train_y, epochs = 4000, verbose = 0)

<keras.src.callbacks.history.History at 0x7823bc7df6d0>

weights = model_4.layers[0].get_weights()[0]
biases = model_4.layers[0].get_weights()[1]
weights2 = model_4.layers[1].get_weights()[0]

x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2[0]*relu(weights[0][0]*x1p + biases[0])
x3p = weights2[1]*relu(weights[0][1]*x1p + biases[1])
x4p = weights2[2]*relu(weights[0][2]*x1p + biases[2])
x5p = weights2[3]*relu(weights[0][3]*x1p + biases[3])

plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-4/3, 4/3])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'b', linewidth = 3, alpha = 0.5)
plt.plot(x1p, x3p, 'k', linewidth = 3, alpha = 0.5)
plt.plot(x1p, x4p, 'g', linewidth = 3, alpha = 0.5)
plt.plot(x1p, x5p, 'y', linewidth = 3, alpha = 0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()

x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = weights2[0]*relu(weights[0][0]*x1p + biases[0]) + weights2[1]*relu(weights[0][1]*x1p + biases[1]) + weights2[2]*relu(weights[0][2]*x1p + biases[2]) + weights2[3]*relu(weights[0][3]*x1p + biases[3])

plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-4/3, 4/3])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'c', linewidth = 3)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()

20 Neurons with ReLU

k = 20

model_20 = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (1,)),
    tf.keras.layers.Dense(units = k, activation = 'relu'),
    tf.keras.layers.Dense(units = 1, use_bias = False)
])

model_20.compile(optimizer = 'adam',
              loss = 'mse')

model_20.fit(train_x, train_y, epochs = 1000, verbose = 0)

<keras.src.callbacks.history.History at 0x7823bc134990>

weights = model_20.layers[0].get_weights()[0]
biases = model_20.layers[0].get_weights()[1]
weights2 = model_20.layers[1].get_weights()[0]

x1p = np.arange(-2, 2, 0.01).reshape(-1, 1)
x2p = 0

for i in range(k):
    x2p += weights2[i]*relu(weights[0][i]*x1p + biases[i])

plt.figure(figsize = (6, 4))
plt.xlim([-2, 2])
plt.ylim([-4/3, 4/3])
plt.plot(real_x, real_y, '--', color = 'red', alpha = 0.5)
plt.plot(x1p, x2p, 'c', linewidth = 3)
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha = 0.3)
plt.show()

Key Points:

Each curve represents the output of a single ReLU neuron in response to the input feature $x$.
The activation values are zero for certain ranges of $x$, indicating that the neuron is "inactive" in those regions.
The piecewise linear segments created by the ReLU activations contribute to forming the overall non-linear regression curve.

This illustrates how the hidden layer's neurons, each applying ReLU activation, form different regions of the input space and collectively approximate the non-linear function.

5. Artificial Neural Networks¶

So far, we have conducted an in-depth examination of how an artificial neural network (ANN) operates at a micro level, focusing on the underlying mechanisms of individual neurons, layers, and their interactions. Having developed these foundational insights and intuitions, we are now in a position to broaden our perspective and analyze the ANN as a cohesive system, considering its overall architecture, functionality, and the role each component plays in contributing to the network's collective decision-making process.

Complex/Nonlinear Universal Function Approximator

ANNs are powerful universal function approximators that can model both linear and nonlinear relationships.
By stacking layers of neurons and using non-linear activation functions, ANNs can represent highly complex mappings, making them suitable for a wide range of tasks, such as image recognition, speech processing, and time-series predictions.

ANN Architecture

ANNs are typically organized as feedforward networks with layers that are fully connected. In these networks:
- Each neuron in a layer receives inputs from all neurons in the previous layer.
- The network propagates information forward, from input to output, without loops.
Each layer performs a linear transformation of the input.
Linear connections alone cannot capture complex relationships; this is where non-linear activation functions become essential.
Each neuron applies a non-linear activation function to the weighted sum of its inputs
These nonlinear neurons allow ANNs to learn non-linear decision boundaries, making them capable of solving complex, non-linearly separable problems.

Hidden Layers and Autonomous Feature Learning

Hidden layers are the intermediate layers between the input and output layers in a neural network.
Each hidden layer learns intermediate representations or features from the input data.
- Shallow networks have fewer hidden layers and may struggle with highly complex patterns.
- Deep networks with many hidden layers (Deep Neural Networks) can model highly intricate relationships but require large datasets and longer training times.
The neurons in the hidden layers create intermediate transformations that allow the network to construct hierarchical feature representations, leading to improved performance for complex tasks.
Unlike traditional machine learning models that rely on manual feature extraction, ANNs can automatically learn features from raw data.
- Each layer in an ANN learns increasingly abstract features.
- This autonomous feature learning makes ANNs robust across diverse domains and applications, eliminating the need for domain-specific feature engineering.

6. ANN Learning¶

Now that we understand how an Artificial Neural Network (ANN) works, the next step is to determine the unknown parameters (the weights and biases) from the data. Once the structure of the ANN is designed, these parameters must be learned through training. In the following, we will discuss how this learning process takes place.

To begin with the conclusion, the learning process is performed using the backpropagation algorithm. Before diving into the details of backpropagation, it is essential to first review "recursive algorithms" and "dynamic programming", as they form the foundational concepts necessary for understanding backpropagation.

from IPython.display import YouTubeVideo
YouTubeVideo('mGuXwSbantc', width="560", height="315", frameborder="0")

6.1. Recursive Algorithm¶

A fundamental concept in computer science:
- Recursive algorithms play a pivotal role in solving problems by breaking them down into smaller, manageable subproblems.
Subproblem-based approach:
- The solution to the main problem depends on solving smaller instances of the same problem (subproblems).
Self-referential function calls:
- A recursive function calls itself with a modified input until reaching a base case.
- This mechanism does not directly occur in the physical world, making recursion a unique and abstract computational concept.

%%html
<iframe src="https://www.youtube.com/embed/t4MSwiqfLaY?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>

Factorial example

$$n ! = n \cdot (n-1) \cdots 2 \cdot 1$$

n = 5

m = 1
for i in range(n):
    m = m*(i+1)

print(m)

120

def fac(n):
    if n == 1:
        return 1
    else:
        return n*fac(n-1)

# recursive

fac(5)

120

6.2. Dynamic Programming¶

Dynamic programming (DP) is a powerful algorithmic technique used to solve optimization problems by breaking them down into "overlapping" subproblems. It is particularly effective for problems with overlapping subproblems and optimal substructure, where solutions to smaller subproblems can be reused to build the solution to the overall problem.

Optimal Substructure
- A problem exhibits optimal substructure if the optimal solution to the problem can be constructed from the optimal solutions of its subproblems.
- Example: In the shortest path problem, the shortest path from point A to point C passing through point B consists of the shortest path from A to B combined with the shortest path from B to C.
Overlapping Subproblems
- A problem has overlapping subproblems if the same subproblems are solved multiple times during the computation.
- Instead of solving the same subproblem repeatedly, dynamic programming stores the results of solved subproblems in a lookup table (memoization) for future reuse.

Let’s discuss dynamic programming using the Fibonacci sequence as an example:

$$ \begin{align*} F_1 &= F_2 = 1 \\ F_n &= F_{n-1} + F_{n-2} \end{align*} $$

# naive Fibonacci

def fib(n):
    if n <= 2:
        return 1
    else:
        return fib(n-1) + fib(n-2)

fib(10)

55

The naive recursive method performs redundant calculations, repeatedly solving subproblems even when their solutions have already been computed. In contrast, dynamic programming allocates memory to store previously computed results. When a subproblem is encountered, DP first checks whether the result has already been calculated. If so, it retrieves the value from memory, avoiding redundant computations and significantly reducing the overall computational time. This approach exemplifies the trade-off where computational time is saved at the expense of additional memory usage - an illustration of the 'space-for-time' principle.

# Memorized DP Fibonacci

def mfib(n):
    global memo

    if memo[n-1] != 0:
        return memo[n-1]
    elif n <= 2:
        memo[n-1] = 1
        return memo[n-1]
    else:
        memo[n-1] = mfib(n-1) + mfib(n-2)
        return memo[n-1]

import numpy as np

n = 10
memo = np.zeros(n)
mfib(n)

np.float64(55.0)

Dynamic Programming (DP) is much faster than a naive recursive algorithm due to its use of memoization to avoid redundant computations.

n = 30
%timeit fib(30)

98.5 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

memo = np.zeros(n)
%timeit mfib(30)

288 ns ± 1.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Dynamic Programming (DP) $\approx$ Recursion + Memorization

Dynamic Programming can be thought of as an optimization technique that enhances recursion by storing the results of previously computed subproblems (memorization). This avoids redundant calculations and improves efficiency.

Recursion: Breaks a problem down into smaller subproblems and solves them recursively.
Memorization: Stores the solutions to subproblems in a cache (e.g., a dictionary or array) to reuse them instead of recalculating.

This combination allows DP to solve problems with overlapping subproblems and optimal substructure more efficiently.

6.3. Training Neural Networks¶

Now, let us return to our original problem: learning the weights and biases of a multi-layer perceptron (MLP).

Learning or estimating weights and biases of multi-layer perceptron from training data
1. Forward Pass: Input $X$ passes through the network to generate $\hat Y$.
2. Loss Calculation: The loss function compares $\hat Y$ and $Y$.
3. Backward Pass: The optimizer computes gradients of the loss concerning the weights.
4. Weight Update: Weights are adjusted based on the gradients.
5. Repeat: The process continues until the loss converges or the training reaches a predefined number of iterations.

Loss Function

Measures error between target values and predictions

$$ \min_{\omega} \sum_{i=1}^{m}\ell\left( h_{\omega}\left(x^{(i)}\right),y^{(i)}\right)$$

Example
- Squared loss (for regression):
  
  $$ \frac{1}{m} \sum_{i=1}^{m} \left(h_{\omega}\left(x^{(i)}\right) - y^{(i)}\right)^2 $$
- Cross-entropy (for classification):
  
  $$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log\left(h_{\omega}\left(x^{(i)}\right)\right) + \left(1-y^{(i)}\right)\log\left(1-h_{\omega}\left(x^{(i)}\right)\right)$$

Learning

Learning weights and biases from data using gradient descent

$$\omega \leftarrow \omega - \alpha \nabla_{\omega} \ell \left( h_{\omega}\left(x^{(i)}\right), y^{(i)} \right)$$

$\frac{\partial \ell}{\partial \omega}$: too many computations are required for all $\omega$
Structural constraints of NN:
- Composition of functions
- Chain rule
- Dynamic programming

Backpropagation

Backpropagation is a fundamental algorithm used for training neural networks by adjusting their weights to minimize the loss function. It operates by propagating errors backward through the network, from the output layer to the input layer, using the chain rule of differentiation.

Forward propagation
- the initial information propagates up to the hidden units at each layer and finally produces output
Backpropagation
- allows the information from the cost to flow backwards through the network in order to compute the gradients
Chain rule
- Computing the derivative of the composition of functions

$$ \begin{align*} f(g(x))' &= f'(g(x))g'(x)\\\\ {dz \over dx} &= {dz \over dy} \cdot {dy \over dx}\\ {dz \over dw} &= \left({dz \over dy} \cdot {dy \over dx}\right) \cdot {dx \over dw}\\ {dz \over du} &= \left({dz \over dy} \cdot {dy \over dx} \cdot {dx \over dw}\right) \cdot {dw \over du} \end{align*} $$

Backpropagation
- Update weights recursively with memory

Learning the weights and biases in an artificial neural network (ANN) can be viewed as an optimization process using gradient descent or its variants. However, a significant challenge arises due to the large number of parameters (or unknowns) that need to be estimated, as well as the extensive computations of gradients required during each iteration.

The uniqueness of this problem lies in the sequentially stacked structure of ANN layers, which can be interpreted as a composition of functions. As a result, the computation of gradients involves the application of the chain rule, leading to repeated calculations for intermediate layers.

By computing these gradients in a backward manner, previously computed results can be reused in subsequent steps. This is the fundamental concept behind backpropagation, making it analogous to dynamic programming. The term 'back' in 'backpropagation' highlights this backward computation process, where intermediate results are stored and revisited to avoid redundant computations.

Summary

Learning weights and biases from data using gradient descent $\approx$ backpropagation

It is not easy to numerically compute gradients in network in general.
- The good news: people have already done all the "hardwork" of developing numerical solvers (or libraries)
- There are a wide range of tools: TensorFlow

6.4. Historical Notes¶

Backpropagation and Geoffrey Hinton

Backpropagation (short for "backward propagation of errors") is a core algorithm used to train artificial neural networks (ANNs). It calculates gradients of the loss function with respect to the network’s weights by applying the chain rule of calculus in a systematic, layer-by-layer manner from the output layer back to the input layer. This enables the efficient updating of weights using optimization methods such as gradient descent.

While the mathematical foundations of backpropagation date back to the 1960s (notably by Werbos in 1974, who applied it to neural networks), it was largely unrecognized until Geoffrey Hinton, along with David Rumelhart and Ronald J. Williams, demonstrated its practical potential for deep learning in their seminal 1986 paper, "Learning Representations by Back-Propagating Errors".

Why Hinton's Contribution Was Pivotal

Revival of Neural Networks:

During the 1980s, symbolic AI dominated the field, and neural networks were seen as outdated and impractical. Hinton’s work with backpropagation demonstrated that neural networks could learn complex representations and outperformed traditional AI approaches on several tasks.

Scalability and Deep Networks:

Hinton's research showed that backpropagation could handle multi-layer (deep) networks, allowing networks to automatically learn hierarchical representations of features.

The Legacy of Backpropagation:

The 1986 breakthrough laid the groundwork for modern deep learning.
Backpropagation became the standard training algorithm for neural networks.
Hinton continued to push the field forward, contributing to the resurgence of neural networks in the 2000s and co-developing groundbreaking architectures like deep belief networks (DBNs) and capsule networks.

The Broader Impact on Modern AI

Backpropagation remains essential in modern deep learning frameworks (such as TensorFlow and PyTorch), enabling the training of massive neural networks for applications in language models (e.g., GPT), computer vision, and beyond. Hinton's contributions earned him the nickname "the godfather of deep learning" and prestigious accolades, including the Turing Award (2018) and the Nobel Prize in Physics (2024).

7. ANN with MNIST¶

from IPython.display import YouTubeVideo
YouTubeVideo('z-ZhKdQpF7I', width="560", height="315", frameborder="0")

7.1. What's an MNIST?¶

From Wikipedia

The MNIST database (Mixed National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, NIST's complete dataset was too hard.
MNIST (Mixed National Institute of Standards and Technology database) database
- Handwritten digit database
- $28 \times 28$ gray scaled image
- (Flattened matrix into a vector of $28 \times 28 = 784$) $\rightarrow$ not for TensorFlow 2

More here

http://yann.lecun.com/exdb/mnist/

We will use the MNIST dataset to create a multinomial classifier that can detect whether the image belongs to one of the classes: 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. In short, we're teaching a computer to recognize handwritten digits.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline

Let's download and load the dataset.

mnist = tf.keras.datasets.mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()

train_x, test_x = train_x/255.0, test_x/255.0

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step

print ("The training data set is:\n")
print (train_x.shape)
print (train_y.shape)

The training data set is:

(60000, 28, 28)
(60000,)

print ("The test data set is:")
print (test_x.shape)
print (test_y.shape)

The test data set is:
(10000, 28, 28)
(10000,)

Display a few random samples from it:

# So now we have a 28x28 matrix, where each element is an intensity level from 0 to 1.

img = train_x[5]
img.shape

(28, 28)

Let's visualize what some of these images and their corresponding training labels look like.

plt.figure(figsize = (4, 4))
plt.imshow(img, 'gray')
plt.xticks([])
plt.yticks([])
plt.show()

train_y[5]

np.uint8(2)

7.2. ANN with TensorFlow¶

Feed a gray image to ANN

Our network model
- The hyperparameters of the network have not been optimized.
- This example is intended solely to demonstrate how to implement the network in Python based on the provided structure.

Network training (learning)
- Forward Pass: Input $X$ passes through the network to generate $\hat Y$.
- Loss Calculation: The loss function compares $\hat Y$ and $Y$.
- Backward Pass: The optimizer computes gradients of the loss concerning the weights.
- Weight Update: Weights are adjusted based on the gradients.
- Repeat: The process continues until the loss converges or the training reaches a predefined number of iterations.

$$\omega:= \omega - \alpha \nabla_{\omega} \left( h_{\omega} \left(x^{(i)}\right),y^{(i)}\right)$$

Import Library

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

Load MNIST Data

Download MNIST data from tensorflow tutorial example

mnist = tf.keras.datasets.mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()
train_x, test_x = train_x/255.0, test_x/255.0

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 ━━━━━━━━━━━━━━━━━━━━ 2s 0us/step

Define an ANN Structure

Input size
Hidden layer size
The number of classes

model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(28, 28)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(units = 100, activation = 'relu'),
    tf.keras.layers.Dense(units = 10, activation = 'softmax')
])

Define Loss and Optimizer

Optimizer

This defines how the model is updated based on the data it sees and its loss function.
AdamOptimizer: the most popular optimizer

Loss

This defines how we measure how accurate the model is during training. As was covered in lecture, during training we want to minimize this function, which will "steer" the model in the right direction.
Classification: Cross-entropy
- Equivalent to apply logistic regression

$$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(h_{\theta}\left(x^{(i)}\right)) + (1-y^{(i)})\log(1-h_{\theta}\left(x^{(i)}\right)) $$

Metrics

Here we can define metrics used to monitor the training and testing steps. In this example, we'll look at the accuracy, the fraction of the images that are correctly classified.

model.compile(optimizer = 'adam',
              loss = 'sparse_categorical_crossentropy',
              metrics = ['accuracy'])

Train Model

# Train Model

loss = model.fit(train_x, train_y, epochs = 5, verbose = 0)

Test or Evaluate

# Evaluate Test Data

test_loss, test_acc = model.evaluate(test_x, test_y, verbose = 0)

test_img = test_x[np.random.choice(test_x.shape[0], 1)]

predict = model.predict_on_batch(test_img)
mypred = np.argmax(predict, axis = 1)

plt.figure(figsize = (8,4))

plt.subplot(1,2,1)
plt.imshow(test_img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()

print('Prediction : {}'.format(mypred[0]))

Prediction : 8

What is the highest accuracy you can achieve with this first fully connected model? Since the handwritten digit classification task is pretty straightforward, you may be wondering how we can do better...

$\Rightarrow$ As we saw in lecture, convolutional neural networks (CNNs) are particularly well-suited for a variety of tasks in computer vision, and have achieved near-perfect accuracies on the MNIST dataset. We will build a CNN and ultimately output a probability distribution over the 10 digit classes (0-9) in the next lectures.

7.3. Historial Notes¶

The MNIST Dataset and Yann LeCun

The MNIST (Modified National Institute of Standards and Technology) dataset is one of the most famous benchmark datasets in machine learning, consisting of 60,000 training images and 10,000 test images of handwritten digits (0 to 9), each in a 28x28 grayscale pixel format. It is widely used for training and evaluating classification models, especially in the field of neural networks.

Yann LeCun’s Role in the Creation of MNIST

Yann LeCun (Turing Award, 2018), a pioneer in the field of deep learning, created the MNIST dataset in collaboration with Corinna Cortes and Christopher J.C. Burges in 1998. LeCun, known for his groundbreaking work in neural networks, particularly convolutional neural networks (CNNs), developed MNIST as an easily accessible and standardized dataset to evaluate machine learning algorithms.

Before MNIST, many machine learning models struggled with inconsistent and poorly standardized datasets for real-world pattern recognition tasks. MNIST solved this by providing a well-curated, balanced, and preprocessed dataset that became the de facto benchmark for evaluating classification models.

Legacy of MNIST

Despite its simplicity, MNIST remains a crucial stepping stone in the history of artificial intelligence. It demonstrated the power of neural networks in pattern recognition and inspired the development of modern deep learning architectures. Yann LeCun’s creation of MNIST not only provided a foundational benchmark for machine learning but also played a pivotal role in popularizing neural networks and shaping the trajectory of modern AI research.

8. Other Tutorials¶

%%html
<iframe src="https://www.youtube.com/embed/aircAruvnKk?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>

%%html
<iframe src="https://www.youtube.com/embed/IHZwWFHWa-w?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>

%%html
<iframe src="https://www.youtube.com/embed/Ilg3gGewQ5U?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>

%%html
<iframe src="https://www.youtube.com/embed/tIeHLnjs5U8?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe>

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')