Convolutional Neural Networks (CNN)
Table of Contents
We will explore Convolutional Neural Networks (CNNs); however, before diving into the details, it is essential to first examine the concept of the convolution operation.
from IPython.display import YouTubeVideo
YouTubeVideo('xSuFInvLjBo', width = "560", height = "315")
Convolution is a fundamental mathematical operation in signal processing, widely used in applications such as filtering, audio processing, image processing, and system analysis. The operation involves integrating the product of two functions, where one function is flipped and shifted relative to the other. Convolution can be interpreted as a process that combines two signals to produce a third signal, characterizing how the shape of one signal is altered by the influence of another.
Mathematical Formulation
(1) Continuous-Time Convolution
For two continuous functions $f(t)$ (input signal) and $g(t)$ (impulse response), the convolution is defined as:
(2) Discrete-Time Convolution
For discrete signals $f[n]$ and $g[n]$, the convolution is expressed as:
Here, $g[n - k]$ represents the flipped and shifted version of $g$, and the summation computes the weighted sum of the overlapping values.
Concept of Convolution in Signal Processing
Input Signal: The function being processed (e.g., an image, an audio waveform).
Impulse Response (Kernel/Filter): The function that defines how the system modifies the input. Common kernels include smoothing filters, edge detection filters, and band-pass filters.
Convolution Output: The result after applying the filter to the input, showing how the original signal is transformed.
Visual Intuition
The convolution operation can be understood as a sliding window process, where one function (usually the filter or kernel) is flipped and shifted across the input. At each position, the product of the overlapping values is summed to compute the output value at that point.
For example:
In 1D audio signals, the sliding window represents the flipped and shifted impulse response passing over the time-series input.
In 2D image convolutions, the kernel slides over different regions of the image, applying the same process of element-wise multiplication and summation for each spatial location.
Convolution and cross-correlation are two similar yet distinct operations commonly used in signal processing and deep learning. Both operations involve a sliding window (filter/kernel) applied to an input signal, but they differ in how the filter is applied.
Mathematical Formulation
(1) Continuous-Time Cross-Correlation
The continuous-time cross-correlation of two signals $f(t)$ and $g(t)$ is defined as:
Here, $g(t + \tau)$ represents the shifted version of $g(t)$. Unlike convolution, the function $g(t)$ is not flipped before integration.
(2) Discrete-Time Cross-Correlation
The discrete 1D cross-correlation of $f[n]$ and $g[n]$ is defined as:
The kernel $g[n]$ is not flipped; instead, it is directly shifted across the input signal.
Visual Intuition
from IPython.display import YouTubeVideo
YouTubeVideo('vrl1YlCvyQo', width = "560", height = "315")
Intuition Behind the Operations
(1) Convolution Intuition:
The signal $f[n]$ is "processed" by the kernel $g[n]$, which acts as a function that modifies or filters the input.
By flipping the kernel, convolution ensures that the output respects the system's causality (i.e., the output depends only on present and past inputs, not future values).
(2) Cross-Correlation Intuition:
Cross-correlation measures how similar the input signal is to the kernel at different positions.
The larger the output value at a given shift, the more similar the signal segment is to the kernel.
Let's compute the 1D convolution process step-by-step.
(1) Kernel $[1, 3, 0, -1]$ at Position 1:
The kernel overlaps with the first four values of the input signal: $[1, 3, 2, 3]$.
(2) Kernel $[1, 3, 0, -1]$ Shifted by 1 Position:
The kernel now overlaps with the next segment of the input: $[3, 2, 3, 0]$.
(3) This process repeats until the kernel has slid across the entire input signal.
In computer vision and machine learning, images are represented as numerical data.
Grayscale Image
A grayscale image is essentially a 2D matrix, where each element corresponds to the intensity of a pixel, typically ranging from 0 (black) to 255 (white) for 8-bit images.
Each number in this matrix tells us how bright or dark a specific pixel is. This numerical representation allows us to perform mathematical operations - such as convolution, filtering, and transformation - to extract features or train neural networks.
Colored Image
Unlike grayscale images, which are represented as 2D matrices, colored images are represented as 3D tensors. Each color image is composed of three separate channels:
Each channel is itself a 2D matrix containing pixel intensities for that specific color. Together, they form a 3D array.
Convolution on Image (2D Convolution) refers to the process of applying a 2D filter (kernel) to an image to extract features, such as edges, textures, or patterns, by sliding the filter over the image. Here's a step-by-step explanation:
Key Components of 2D Convolution:
(1) Input Image (Matrix):
A 2D array of pixel values (grayscale) or 3D (for RGB images).
Example: A 28$\times$28 matrix with values representing pixel intensities.
(2) Filter/Kernel:
A smaller matrix (e.g., 3$\times$3) with predefined or learned values.
The values are used to multiply corresponding input pixels during the convolution operation.
(3) Sliding Operation:
The filter slides across the image matrix (from left to right and top to bottom) based on a defined stride.
At each position, the dot product of the overlapping region and the kernel is computed and saved in the output feature map.
Mathematical Representation
If $I$ is the input image and $K$ is the kernel:
Where:
Example of Convolution: $5 \times 5 $ image with $3 \times 3$ Kernel
A filter (or kernel) is slid across the input image to produce a feature map. For example, convolving a 5 $\times$ 5 image with a 3 $\times $ 3 kernel produces a 3 $\times$ 3 feature map.
For example, Kernel (Filter)
This filter will perform element-wise multiplication with each 3 $\times$ 3 sliding window from the image. At each step:
The filter overlays a 3 $\times$ 3 region of the image.
Each element of the kernel is multiplied element-wise with the corresponding pixel value in the region.
All products are summed to produce a single number in the output (feature map).
The filter slides (convolves) over the entire image, repeating the process for all valid regions.
The illustration below demonstrates how convolution works
The figure below shows the same concept, but in a more three-dimensional (intuitive) way, making it easier to understand.
Kernel (or Filter)
A kernel (also known as a filter) is a small, fixed-size matrix that slides over the input image during the convolution operation. The kernel is responsible for detecting specific features such as edges, patterns, and textures.
Modify or enhance an image by filtering
Filter images to emphasize certain features or remove other features
Filtering includes smoothing, sharpening and edge enhancement
Discrete convolution can be viewed as element-wise multiplication by a matrix
Common sizes: 3 $\times$ 3, 5 $\times$ 5, or 7 $\times$ 7
Kernels are used to extract various features such as:
Edge detection (using Sobel or Prewitt filters).
Blurring (using Gaussian filters).
Sharpening (using Laplacian filters).
from google.colab import drive
drive.mount('/content/drive')
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
import cv2
img = cv2.imread('/content/drive/MyDrive/ML/ML_data/lena.png', 0)
print(img.shape)
plt.figure(figsize = (4, 4))
plt.imshow(img, cmap = 'gray')
plt.axis('off')
plt.show()
Horizontal Edge Detector
Mh = [[1, 0, -1],
[1, 0, -1],
[1, 0, -1]]
img_conv = signal.convolve2d(img, Mh, 'valid')
plt.figure(figsize = (4, 4))
plt.imshow(img_conv, cmap = 'gray', vmin = 0, vmax = np.amax(img_conv))
plt.title('Horizontal Edge Detector')
plt.axis('off')
plt.show()
Vertical Edge Detector
Mv = [[1, 1, 1],
[0, 0, 0],
[-1, -1, -1]]
img_conv = signal.convolve2d(img, Mv, 'valid')
plt.figure(figsize = (4, 4))
plt.imshow(img_conv, cmap = 'gray', vmin = 0, vmax = np.amax(img_conv))
plt.title('Vertical Edge Detector')
plt.axis('off')
plt.show()
Blurring
M = np.ones([5, 5])/25
img_conv = signal.convolve2d(img, M, 'valid')
plt.figure(figsize = (4, 4))
plt.imshow(img_conv, cmap = 'gray', vmin = 0, vmax = np.amax(img_conv))
plt.title('Blurring')
plt.axis('off')
plt.show()
Sharpening
M = [[-1, -1, -1],
[-1, 9, -1],
[-1, -1, -1]]
img_conv = signal.convolve2d(img, M, 'valid')
plt.figure(figsize = (4, 4))
plt.imshow(img_conv, cmap = 'gray', vmin = 0, vmax = np.amax(img_conv))
plt.title('Sharpening')
plt.axis('off')
plt.show()
How to Find the Right Kernels
We have explored the use of various predefined kernels that produce specific effects on images. Now, let us consider the opposite approach - instead of manually designing the kernels, the network will learn the kernels directly from the data during training. This approach enables the convolution to automatically discover and refine feature extractors that are most relevant for the task at hand, such as detecting edges, textures, or more complex patterns. By learning the kernels from data, the network adapts to the unique characteristics of the dataset, leading to more robust and effective feature extraction.
from IPython.display import YouTubeVideo
YouTubeVideo('u03QN8lJsDg', width = "560", height = "315")
Image Classification is a computer vision task that involves assigning a label or category to an entire image based on its visual content. The goal is to analyze the image as a whole and predict the class to which it belongs from a predefined set of categories (e.g., identifying "bird" in an image).
The goal is to assign a single label (such as "bird") to the entire image.
Since we've primarily learned about Artificial Neural Networks (ANNs) so far, let's first see how ANNs can be applied to image classification by flattening the 2D image into a 1D vector.
ANN Structure for Classification in Image
Using a traditional Artificial Neural Network (ANN) for image classification is generally not ideal.
Limitations:
(1) Lacks Spatial Awareness:
(2) Flattening Destroys Spatial Relationships:
The 2D image is flattened into a 1D vector before being passed into the network.
This process removes the spatial organization of the pixels (e.g., locality, patterns, edges).
As a result, the network loses information about how nearby pixels form meaningful features (like lines, corners, and textures).
(3) Poor Feature Learning:
Without local feature detection, ANNs often struggle to learn meaningful representations from image data.
This can lead to poor generalization and inefficient learning, especially on complex images.
This observation motivates the development of alternative neural network architectures designed to effectively handle the unique 2D structure of image data. Specifically, this challenge highlights the intersection between the limitations of traditional ANNs and the advantages offered by the convolution operation discussed earlier. By preserving spatial hierarchies and local connectivity, convolutional operations enable networks to learn meaningful features such as edges, textures, and shapes directly from the image data, paving the way for more robust and efficient image classification and recognition models.
Since we believe that a standard artificial neural network (ANN) structure is not well-suited for image inputs, we need to adopt a different architecture that can better handle the spatial structure of images.
Naive Idea
Apply a 2D filter (or kernel) to the entire image - this means the layer is fully, but convolutionally connected.
This approach helps maintain the spatial organization of pixel information.
Local Receptive Fields
Objects in images typically exhibit local spatial support, meaning that an object of interest occupies specific, localized regions rather than the entire image.
This insight motivates the transition from fully, but convolutionally connected layers - which treat every pixel as equally relevant - to locally and convolutionally connected layers, where connections are limited to a neighborhood of pixels.
Local connectivity allows convolutional layers to efficiently capture spatially coherent patterns. This not only preserves the local structure of the input but also drastically reduces the number of learnable parameters compared to fully connected networks.
Translation Invariance
The red and green units represent distinct sets of convolutional filters, each designed to detect specific features. These filters focus on specific locations within the image to identify objects within those locations.
However, since the same object can appear anywhere in an image, its appearance is independent of position. To account for this, a convolutional filter slides across the entire image, applying the same filter at every location to ensure consistent feature detection regardless of spatial position.
This sliding mechanism, inherent to the convolution operation, provides the network with translation invariance, enabling it to recognize features or objects regardless of where they appear in the image.
It further reduces the number of learnable parameters.
Considering the progression from the naive idea to the concepts of Local Receptive Fields and Translation Invariance, the convolution operation - originally developed in mathematics and signal processing - can serve as a perfect candidate for feature extraction in image classification tasks.
Kernel Learning
At this point, it's clear that the convolution operation is well-suited for image classification, thanks to its ability to preserve spatial structure and capture meaningful features across the input image. Naturally, the next step is to focus on learning the optimal weights within the kernel.
Importantly, the kernel is not manually designed; instead, it is learned from data during the training process. The network acts as a visual feature extractor, learning to detect and refine patterns such as edges, textures, and shapes that are relevant to the classification task. This data-driven approach enables the network to adaptively discover the most salient features, improving its ability to generalize to new, unseen inputs.
Again, this is the philosophy of deep learning: learning from data, not from others
Note on Convolution and Cross-Correlation in CNN:
In practice, cross-correlation is often used instead of true convolution for computational convenience. In many deep learning frameworks (e.g., TensorFlow, PyTorch), the "convolution" operation is technically implemented as cross-correlation - meaning the kernel is shifted but not flipped. However, this distinction generally does not affect the performance of CNNs, as the model can learn filters that behave equivalently to flipped kernels.
Explanation in 2D (Images)
This is similar to viewing the kernel through a mirror placed both vertically and horizontally.
(1) Multiple Channels
Multiple channels in convolution refer to input data with more than one feature channels (e.g., RGB images have three channels corresponding to Red, Green, and Blue).
The convolution operation is extended to process all these channels simultaneously to extract meaningful features across all of input channels.
(2) Multiple Kernels
Multiple kernels (filters) are used to extract different types of features from the input image.
Each kernel learns to detect a specific pattern, such as edges, textures, corners, or shapes, and the combination of these kernels allows the network to construct a richer and more comprehensive representation of the input image.
By stacking the outputs of multiple filters, the network captures rich and complementary representations of the input image, enabling more effective learning for tasks like classification or detection. This diversity of learned features is a key to the success of CNNs.
For example, using two kernels on an input will produce two output feature maps (channels).
from IPython.display import YouTubeVideo
YouTubeVideo('cr2fz5A2MXQ', width = "560", height = "315")
Strides refer to the step size of the convolution operator - specifically, the number of pixels the kernel moves horizontally or vertically after each operation.
(1) Stride Example:
For a kernel size 3 $\times$ 3 and a stride of 1, the kernel moves one pixel at a time, producing a densely overlapped output feature map.
(2) Stride Example:
For a kernel size 3 $\times$ 3 and a stride of 2, the kernel moves two pixels at a time, reducing the spatial dimensions of the output (i.e., downsampling).
Padding refers to the process of adding extra pixels (usually zeros) around the border of the input image before applying the convolution.
The main purposes of padding is to preserve spatial dimensions:
The convolution operation is inherently a linear operation, as it involves computing the weighted sum of input values (pixels) and kernel weights. However, real-world data and tasks like image classification involve complex and nonlinear patterns. Therefore, to make the network capable of learning these complex relationships, nonlinear activation functions are applied after the convolution operation.
The nonlinear activation function is applied individually to each pixel value in the feature map produced by the convolution operation. This pointwise application of nonlinearity is crucial for enabling the network to introduce complexity and flexibility in its learned representations.
Pooling is a crucial operation in Convolutional Neural Networks (CNNs) used to downsample feature maps by summarizing the presence of features in small spatial regions. Pooling reduces the spatial dimensions (height and width) of the feature maps while retaining the most important information. This helps make the network more efficient and robust to translations and small distortions in the input image.
Purpose of Pooling
Reduction in Spatial Dimensions: The spatial size (width and height) of the feature map is reduced.
Retention of Key Features: Max pooling retains the most prominent features, while average pooling retains a smooth summary of the region.
Overfitting Prevention: By downsampling, pooling reduces the model's complexity, making it less prone to overfitting.
Types of Pooling Operations
Max Pooling: Frequently used for downsampling in CNNs due to its ability to retain strong activations.
Average Pooling: Used in tasks requiring smoothed outputs, such as image compression.
Global Pooling: Used to reduce the feature map to a single value per channel.
(1) Max Pooling
Max Pooling computes the maximum value within a sliding window over the feature map.
Reduces spatial resolution, leading to faster computation and lower memory usage.
Introduces translation invariance by being robust to small shifts or displacements in the input.
(2) Average Pooling
Average pooling is a down-sampling technique to reduce the spatial dimensions (height and width) of feature maps by computing the average value within a sliding window
(3) Global Average Pooling (GAP)
Global Average Pooling takes the average of each entire spatial dimensions of each feature map, resulting in a single value per channel.
Question: Why Max Pooling is Invariant to Small Deformations
Max pooling offers a form of pseudo-invariance to small deformations in the input, such as local translations, distortions, or noise.
Here's why:
It captures the maximum value within a local region, so minor spatial shifts in the input may still result in the same maximum value being selected.
As a result, the output remains unchanged or only slightly affected, making the network more robust to small spatial variations.
This reduces sensitivity to exact pixel locations, allowing the model to focus on the presence of features rather than their precise positions.
For instance, the output of max pooling often remains unchanged even when the input is shifted by one pixel, because the same maximum value can still fall within the pooling window.
We have studied all the necessary components of convolutional layer blocks. Note that convolution is not the only element - nonlinear activation functions and pooling operations are also typically combined to form a complete convolutional layer block.
Convolutional Layer Block
A Convolutional Layer Block is the fundamental unit in Convolutional Neural Networks (CNNs) that processes input data to extract hierarchical features from images. It typically consists of three main components: convolution, nonlinear activation function, and pooling. These components work together to learn meaningful features while preserving spatial structure and reducing the dimensions of the data.
(1) Convolution Operation:
(2) Nonlinear Activation Function:
(3) Pooling Operation:
Stacking Convolutional Layer Blocks for Deeper Representations
When multiple convolutional blocks are arranged in sequence (e.g., Block 1 $\rightarrow$ Block 2 $\rightarrow$ Block 3), each block operates on the feature maps produced by the previous one.
By stacking these blocks, a CNN is able to progressively learn more complex and abstract representations of the input - starting with low-level features like edges and textures, and moving toward high-level concepts such as shapes or specific object parts.
Returning to the original image classification problem, all the discussions so far have focused on feature extraction using a sequence of convolutional layer blocks.
However, after extracting and reorganizing these features, we still need to perform classification. This is typically done using fully connected layers, which act as a linear classifier.
(1) Feature Learning (Sequence of Convolutional Layer Blocks)
A sequence of convolutional, activation, and pooling layers is stacked to form multiple convolutional blocks.
These blocks enable the network to learn hierarchical feature representations, from low-level edges to high-level object parts.
(2) Classification (Fully Connected Layers)
Flattening:
Fully Connected Layers:
Softmax Output:
Convolutional Neural Networks (CNNs) and Yann LeCun
Convolutional Neural Networks (CNNs) are a class of deep learning models designed specifically for processing data with a grid-like structure, such as images. They have revolutionized the fields of computer vision and pattern recognition by automatically learning spatial hierarchies of features from input data, without requiring handcrafted features.
Yann LeCun's Contributions to CNNs
Yann LeCun is widely recognized as the father of convolutional neural networks. His pioneering research laid the foundation for CNNs, demonstrating their potential in visual recognition tasks, long before deep learning became mainstream.
from IPython.display import YouTubeVideo
YouTubeVideo('FwFduRA_L6Q', width = "560", height = "315")
Impact of CNNs on Modern AI
Computer Vision:
Architectural Evolution:
Recognition of Yann LeCuns Work:
In 2018, Yann LeCun, along with Geoffrey Hinton and Yoshua Bengio, received the Turing Award, often referred to as the "Nobel Prize of Computing," for their contributions to deep learning.
LeCun has remained a prominent voice in the AI community, advocating for advancements in neural networks, including his more recent work on self-supervised learning and energy-based models.
The Legacy of CNNs
Yann LeCun's work on CNNs changed the landscape of artificial intelligence by demonstrating that neural networks could learn to recognize patterns directly from raw data. The success of LeNet-5 proved that CNNs could generalize well across a variety of visual tasks, and this architecture remains the foundation of many deep learning models today. Without LeCun's contributions, the rapid progress in computer vision, autonomous vehicles, and AI-driven image processing would not have been possible.
from IPython.display import YouTubeVideo
YouTubeVideo('cr2fz5A2MXQ?si=df-ux84Q096QpMIc&start=726', width = "560", height = "315")
We will once again use the MNIST dataset to build a multiclass classifier that can identify whether an image belongs to one of the digits: 0 through 9. This time, however, the classifier will be implemented using a Convolutional Neural Network (CNN).
Note:
Load MNIST Data
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
mnist = tf.keras.datasets.mnist
(train_x, train_y), (test_x, test_y) = mnist.load_data()
train_x, test_x = train_x/255.0, test_x/255.0
train_x = train_x.reshape((train_x.shape[0], 28, 28, 1))
test_x = test_x.reshape((test_x.shape[0], 28, 28, 1))
Define an CNN Structure
model = tf.keras.models.Sequential([
tf.keras.layers.Input(shape = (28, 28, 1)),
tf.keras.layers.Conv2D(filters = 32,
kernel_size = (3,3),
activation = 'relu',
padding = 'SAME'),
tf.keras.layers.MaxPool2D((2,2)),
tf.keras.layers.Conv2D(filters = 64,
kernel_size = (3,3),
activation = 'relu',
padding = 'SAME'),
tf.keras.layers.MaxPool2D((2,2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(units = 128, activation = 'relu'),
tf.keras.layers.Dense(units = 10, activation = 'softmax')
])
Optimizer and Loss
model.compile(optimizer = 'adam',
loss = 'sparse_categorical_crossentropy',
metrics = ['accuracy'])
Training
model.fit(train_x, train_y, batch_size = 50, epochs = 3)
Testing or Evaluating
test_loss, test_acc = model.evaluate(test_x, test_y)
test_img = test_x[[1495]]
predict = model.predict(test_img, verbose = 0)
mypred = np.argmax(predict, axis = 1)
plt.figure(figsize = (9, 4))
plt.subplot(1,2,1)
plt.imshow(test_img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()
print('Prediction : {}'.format(mypred[0]))
Having explored the implementation of a CNN using the MNIST dataset - a relatively simple and well-structured benchmark - we now turn our attention to a more realistic and practical engineering problem. In this case, we will consider the NEU Surface Defect Database (NEU), which contains images of steel surface defects commonly encountered in industrial applications.
NEU steel surface defects
To classify defects images into 6 classes
Download NEU steel surface defects images and labels
Training
from google.colab import drive
drive.mount('/content/drive')
# Change file paths if necessary
train_x = np.load('/content/drive/MyDrive/DL/DL_data/NEU_train_imgs.npy')
train_y = np.load('/content/drive/MyDrive/DL/DL_data/NEU_train_labels.npy')
test_x = np.load('/content/drive/MyDrive/DL/DL_data/NEU_test_imgs.npy')
test_y = np.load('/content/drive/MyDrive/DL/DL_data/NEU_test_labels.npy')
print(train_x.shape)
print(train_y.shape)
print(test_x.shape)
print(test_y.shape)
model = tf.keras.models.Sequential([
tf.keras.layers.Input(shape = (200, 200, 1)),
tf.keras.layers.Conv2D(filters = 32,
kernel_size = (3,3),
activation = 'relu',
padding = 'SAME'),
tf.keras.layers.MaxPool2D((2,2)),
tf.keras.layers.Conv2D(filters = 64,
kernel_size = (3,3),
activation = 'relu',
padding = 'SAME'),
tf.keras.layers.MaxPool2D((2,2)),
tf.keras.layers.Conv2D(filters = 128,
kernel_size = (3,3),
activation = 'relu',
padding = 'SAME'),
tf.keras.layers.MaxPool2D((2,2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(units = 128, activation = 'relu'),
tf.keras.layers.Dense(units = 6, activation = 'softmax')
])
model.compile(optimizer = 'adam',
loss = 'sparse_categorical_crossentropy',
metrics = ['accuracy'])
model.fit(train_x, train_y, batch_size = 50, epochs = 10)
Testing or Evaluating
test_loss, test_acc = model.evaluate(test_x, test_y)
name = ['scratches', 'rolled-in scale', 'pitted surface', 'patches', 'inclusion', 'crazing']
idx = np.random.choice(test_x.shape[0], 1)
test_img = test_x[idx]
test_label = test_y[idx]
predict = model.predict(test_img, verbose = 0)
mypred = np.argmax(predict, axis = 1)
plt.figure(figsize = (9, 4))
plt.subplot(1,2,1)
plt.imshow(test_img.reshape(200, 200), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()
print('Prediction : {}'.format(name[mypred[0]]))
print('True Label : {}'.format(name[test_label[0]]))
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')