Convolutional Neural Networks (CNN)
Table of Contents
In this section, we will explore Convolutional Neural Networks (CNNs); however, before diving into the details, it is essential to first examine the concept of the convolution operation.
from IPython.display import YouTubeVideo
YouTubeVideo('xSuFInvLjBo', width = "560", height = "315")
Convolution is a fundamental mathematical operation in signal processing, widely used in applications such as filtering, audio processing, image processing, and system analysis. The operation involves integrating the product of two functions, where one function is flipped and shifted relative to the other. Convolution can be interpreted as a process that combines two signals to produce a third signal, characterizing how the shape of one signal is altered by the influence of another.
Mathematical Formulation
Continuous-Time Convolution
For two continuous functions $f(t)$ (input signal) and $g(t)$ (impulse response), the convolution is defined as:
Discrete-Time Convolution
For discrete signals $f[n]$ and $g[n]$, the convolution is expressed as:
Here, $g[n - k]$ represents the flipped and shifted version of $g$, and the summation computes the weighted sum of the overlapping values.
Concept of Convolution in Signal Processing
Input Signal: The function being processed (e.g., an image, an audio waveform).
Impulse Response (Kernel/Filter): The function that defines how the system modifies the input. Common kernels include smoothing filters, edge detection filters, and band-pass filters.
Convolution Output: The result after applying the filter to the input, showing how the original signal is transformed.
Visual Intuition
The convolution operation can be understood as a sliding window process, where one function (usually the filter or kernel) is flipped and shifted across the input. At each position, the product of the overlapping values is summed to compute the output value at that point.
For example:
In 1D audio signals, the sliding window represents the flipped and shifted impulse response passing over the time-series input.
In 2D image convolutions, the kernel slides over different regions of the image, applying the same process of element-wise multiplication and summation for each spatial location.
Convolution and cross-correlation are two similar yet distinct operations commonly used in signal processing and deep learning. Both operations involve a sliding window (filter/kernel) applied to an input signal, but they differ in how the filter is applied.
Mathematical Formulation
Continuous-Time Cross-Correlation
The continuous-time cross-correlation of two signals $f(t)$ and $g(t)$ is defined as:
Here, $g(t + \tau)$ represents the shifted version of $g(t)$. Unlike convolution, the function $g(t)$ is not flipped before integration.
Discrete-Time Cross-Correlation
The discrete 1D cross-correlation of $f[n]$ and $g[n]$ is defined as:
The kernel $g[n]$ is not flipped; instead, it is directly shifted across the input signal.
Visual Intuition
from IPython.display import YouTubeVideo
YouTubeVideo('vrl1YlCvyQo', width = "560", height = "315")
Intuition Behind the Operations
Convolution Intuition:
The signal $f[n]$ is "processed" by the kernel $g[n]$, which acts as a function that modifies or filters the input.
By flipping the kernel, convolution ensures that the output respects the system's causality (i.e., the output depends only on present and past inputs, not future values).
Cross-Correlation Intuition:
Cross-correlation measures how similar the input signal is to the kernel at different positions.
The larger the output value at a given shift, the more similar the signal segment is to the kernel.
Why the Difference Does not Matter in Deep Learning (CNNs)
In practice, many frameworks (e.g., TensorFlow, PyTorch) use cross-correlation instead of convolution for convolutional neural networks (CNNs) because flipping the kernel does not significantly affect feature learning. The goal in CNNs is to match patterns rather than strictly respect the mathematical definition of convolution used in classical signal processing.
Kernel $[1, 3, 0, -1]$ at Position 1:
The kernel overlaps with the first four values of the input signal: $[1, 3, 2, 3]$.
Kernel $[1, 3, 0, -1]$ Shifted by 1 Position:
The kernel now overlaps with the next segment of the input: $[3, 2, 3, 0]$.
This process repeats until the kernel has slid across the entire input signal.
Convolution on Image (2D Convolution) refers to the process of applying a 2D filter (kernel) to an image to extract features, such as edges, textures, or patterns, by sliding the filter over the image. Here's a step-by-step explanation:
Key Components of 2D Convolution:
Input Image (Matrix):
A 2D array of pixel values (grayscale) or 3D (for RGB images).
Example: A 5$\times$5 matrix with values representing pixel intensities.
Filter/Kernel:
A smaller matrix (e.g., 3$\times$3) with predefined or learned values.
The values are used to multiply corresponding input pixels during the convolution operation.
Sliding Operation:
The filter slides across the image matrix (from left to right and top to bottom) based on a defined stride.
At each position, the dot product of the overlapping region and the kernel is computed and saved in the output feature map.
Mathematical Representation
If $I$ is the input image and $K$ is the kernel:
Where:
Kernel (or Filter)
A kernel (also known as a filter) is a small, fixed-size matrix that slides over the input image during the convolution operation. The kernel is responsible for detecting specific features such as edges, patterns, and textures.
Modify or enhance an image by filtering
Filter images to emphasize certain features or remove other features
Filtering includes smoothing, sharpening and edge enhancement
Discrete convolution can be viewed as element-wise multiplication by a matrix
Common sizes: 3x3, 5x5, or 7x7
Kernels are used to extract various features such as:
Edge detection (using Sobel or Prewitt filters).
Blurring (using Gaussian filters).
Sharpening (using Laplacian filters).
How to Find the Right Kernels
We have explored the use of various predefined kernels that produce specific effects on images. Now, let us consider the opposite approach — instead of manually designing the kernels, the network will learn the kernels directly from the data during training. This approach enables the convolution to automatically discover and refine feature extractors that are most relevant for the task at hand, such as detecting edges, textures, or more complex patterns. By learning the kernels from data, the network adapts to the unique characteristics of the dataset, leading to more robust and effective feature extraction.
from IPython.display import YouTubeVideo
YouTubeVideo('u03QN8lJsDg', width = "560", height = "315")
Image Classification is a computer vision task that involves assigning a label or category to an entire image based on its visual content. The goal is to analyze the image as a whole and predict the class to which it belongs from a predefined set of categories (e.g., identifying "bird" in an image).
Since we've primarily learned about Artificial Neural Networks (ANNs) so far, let's first see how ANNs can be applied to image classification by flattening the 2D image into a 1D vector.
ANN structure for classification in image
This observation motivates the development of alternative neural network architectures designed to effectively handle the unique 2D structure of image data. Specifically, this challenge highlights the intersection between the limitations of traditional ANNs and the advantages offered by the convolution operation discussed earlier. By preserving spatial hierarchies and local connectivity, convolutional operations enable networks to learn meaningful features such as edges, textures, and shapes directly from the image data, paving the way for more robust and efficient image classification and recognition models.
Local Receptive Fields:
Objects in images typically exhibit local spatial support, meaning that an object of interest is confined to specific, localized regions of the image.
This motivates the transition from fully connected layers — which treat all input pixels as equally relevant — to locally and convolutionally connected layers, where connections are restricted to a neighborhood of pixels.
By leveraging this local connectivity, convolutional layers can efficiently capture spatially coherent patterns, thereby preserving the local structure of the input image and reducing the number of parameters compared to fully connected layers.
Translation invariance
Kernel Learning
At this point, it is reasonable to conclude that the convolution operation is well-suited for image classification due to its capacity to preserve spatial structure and capture meaningful features across the input image. Naturally, the next step is to focus on learning the optimal weights within the kernel.
Of course, the kernel is not manually designed; rather, it is learned from the data during the training process. Specifically, the network is trained to act as a visual feature extractor by identifying and refining patterns such as edges, textures, and shapes that are relevant for the classification task. This data-driven approach allows the network to adaptively discover the most salient features, thereby enhancing its ability to generalize to new, unseen inputs.
Why CNNs Work for Bird Classification:
Spatial Awareness: Unlike ANNs, CNNs maintain the spatial arrangement of the image.
Preserves 2D Structure: The convolutional layers maintain the spatial arrangement of pixels.
Feature Learning: Kernels (filters) in the convolutional layers learn to detect local features (like edges, corners, textures).
Translation Invariance: Pooling layers make CNNs robust to small shifts in the input image (e.g., the bird appearing in different positions).
Shared Weights: Kernels (filters) are shared across the image, drastically reducing the number of parameters compared to ANNs.
Hierarchical Learning: Early layers learn simple features (e.g., edges), while deeper layers learn complex features (e.g., bird wings, beak, etc.).
Note on Convolution and Cross-Correlation:
In Convolutional Neural Networks (CNNs), cross-correlation is often used instead of convolution for computational convenience. In many frameworks (e.g., TensorFlow, PyTorch), the "convolution" operation is technically implemented as cross-correlation — meaning the kernel is shifted but not flipped. However, this distinction generally does not affect the performance of CNNs, as the model can learn filters that behave equivalently to flipped kernels.
Explanation in 2D (Images)
This is similar to viewing the kernel through a mirror placed both vertically and horizontally.
More on Convolution of CNN
Multiple channels
Multiple channels in convolution refer to handling input data that contains multiple feature channels (e.g., RGB images have three channels corresponding to Red, Green, and Blue).
The convolution operation is extended to process all these channels simultaneously to extract meaningful features across all of input channels.
from IPython.display import YouTubeVideo
YouTubeVideo('cr2fz5A2MXQ', width = "560", height = "315")
The convolution operation is inherently a linear operation, as it involves computing the weighted sum of input values (pixels) and kernel weights. However, real-world data and tasks like image classification involve complex and nonlinear patterns. Therefore, to make the network capable of learning these complex relationships, nonlinear activation functions are applied after the convolution operation.
The nonlinear activation function is applied individually to each pixel value in the feature map produced by the convolution operation. This pointwise application of nonlinearity is crucial for enabling the network to introduce complexity and flexibility in its learned representations.
Pooling is a crucial operation in Convolutional Neural Networks (CNNs) used to downsample feature maps by summarizing the presence of features in small spatial regions. Pooling layers reduce the spatial dimensions (height and width) of the feature maps while retaining the most important information. This helps make the network more efficient and robust to translations and small distortions in the input image.
Purpose of Pooling
Reduction in Spatial Dimensions: The spatial size (width and height) of the feature map is reduced.
Retention of Key Features: Max pooling retains the most prominent features, while average pooling retains a smooth summary of the region.
Overfitting Prevention: By downsampling, pooling reduces the model's complexity, making it less prone to overfitting.
Types of Pooling Operations
Max Pooling: Frequently used for downsampling in CNNs due to its ability to retain strong activations.
Average Pooling: Used in tasks requiring smoothed outputs, such as image compression.
Global Pooling: Used in modern CNN architectures (like ResNet) to reduce the feature map to a single value per channel before fully connected layers.
Max Pooling
Why Convolution is Important in CNNs
However, convolution is not the only component of a CNN. The nonlinear activation function and pooling must be combined to form a Convolutional Layer Block.
Convolutional Layer Block
A Convolutional Layer Block is the fundamental unit in Convolutional Neural Networks (CNNs) that processes input data to extract hierarchical features from images. It typically consists of three main components: convolution, nonlinear activation function, and pooling. These components work together to learn meaningful features while preserving spatial structure and reducing the dimensions of the data.
Convolution Operation:
Nonlinear Activation Function:
Pooling Operation:
CNN for Classification
Classification (Fully Connected Layers)
Flattening:
Fully Connected Layers:
Softmax Output:
Convolutional Neural Networks (CNNs) and Yann LeCun
Convolutional Neural Networks (CNNs) are a class of deep learning models designed specifically for processing data with a grid-like structure, such as images. They have revolutionized the fields of computer vision and pattern recognition by automatically learning spatial hierarchies of features from input data, without requiring handcrafted features.
Yann LeCun's Contributions to CNNs
Yann LeCun is widely recognized as the father of convolutional neural networks. His pioneering research laid the foundation for CNNs, demonstrating their potential in visual recognition tasks, long before deep learning became mainstream.
from IPython.display import YouTubeVideo
YouTubeVideo('FwFduRA_L6Q', width = "560", height = "315")
Impact of CNNs on Modern AI
Computer Vision:
Architectural Evolution:
Recognition of Yann LeCuns Work:
In 2018, Yann LeCun, along with Geoffrey Hinton and Yoshua Bengio, received the Turing Award, often referred to as the "Nobel Prize of Computing," for their contributions to deep learning.
LeCun has remained a prominent voice in the AI community, advocating for advancements in neural networks, including his more recent work on self-supervised learning and energy-based models.
The Legacy of CNNs
Yann LeCun's work on CNNs changed the landscape of artificial intelligence by demonstrating that neural networks could learn to recognize patterns directly from raw data. The success of LeNet-5 proved that CNNs could generalize well across a variety of visual tasks, and this architecture remains the foundation of many deep learning models today. Without LeCun's contributions, the rapid progress in computer vision, autonomous vehicles, and AI-driven image processing would not have been possible.
from IPython.display import YouTubeVideo
YouTubeVideo('cr2fz5A2MXQ?si=df-ux84Q096QpMIc&start=726', width = "560", height = "315")
Network model
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
mnist = tf.keras.datasets.mnist
(train_x, train_y), (test_x, test_y) = mnist.load_data()
train_x, test_x = train_x/255.0, test_x/255.0
train_x = train_x.reshape((train_x.shape[0], 28, 28, 1))
test_x = test_x.reshape((test_x.shape[0], 28, 28, 1))
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(filters = 32,
kernel_size = (3,3),
activation = 'relu',
padding = 'SAME',
input_shape = (28, 28, 1)),
tf.keras.layers.MaxPool2D((2,2)),
tf.keras.layers.Conv2D(filters = 64,
kernel_size = (3,3),
activation = 'relu',
padding = 'SAME',
input_shape = (14, 14, 32)),
tf.keras.layers.MaxPool2D((2,2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(units = 128, activation = 'relu'),
tf.keras.layers.Dense(units = 10, activation = 'softmax')
])
model.compile(optimizer = 'adam',
loss = 'sparse_categorical_crossentropy',
metrics = ['accuracy'])
model.fit(train_x, train_y, batch_size = 50, epochs = 3)
test_loss, test_acc = model.evaluate(test_x, test_y)
test_img = test_x[[1495]]
predict = model.predict(test_img, verbose = 0)
mypred = np.argmax(predict, axis = 1)
plt.figure(figsize = (9, 4))
plt.subplot(1,2,1)
plt.imshow(test_img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()
print('Prediction : {}'.format(mypred[0]))
Having explored the implementation of a CNN using the MNIST dataset — a relatively simple and well-structured benchmark — we now turn our attention to a more realistic and practical engineering problem. In this case, we will consider the NEU Surface Defect Database (NEU), which contains images of steel surface defects commonly encountered in industrial applications.
NEU steel surface defects
To classify defects images into 6 classes
Download NEU steel surface defects images and labels
from google.colab import drive
drive.mount('/content/drive')
# Change file paths if necessary
train_x = np.load('/content/drive/MyDrive/DL/DL_data/NEU_train_imgs.npy')
train_y = np.load('/content/drive/MyDrive/DL/DL_data/NEU_train_labels.npy')
test_x = np.load('/content/drive/MyDrive/DL/DL_data/NEU_test_imgs.npy')
test_y = np.load('/content/drive/MyDrive/DL/DL_data/NEU_test_labels.npy')
print(train_x.shape)
print(train_y.shape)
print(test_x.shape)
print(test_y.shape)
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(filters = 32,
kernel_size = (3,3),
activation = 'relu',
padding = 'SAME',
input_shape = (200, 200, 1)),
tf.keras.layers.MaxPool2D((2,2)),
tf.keras.layers.Conv2D(filters = 64,
kernel_size = (3,3),
activation = 'relu',
padding = 'SAME',
input_shape = (100, 100, 32)),
tf.keras.layers.MaxPool2D((2,2)),
tf.keras.layers.Conv2D(filters = 128,
kernel_size = (3,3),
activation = 'relu',
padding = 'SAME',
input_shape = (50, 50, 64)),
tf.keras.layers.MaxPool2D((2,2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(units = 128, activation = 'relu'),
tf.keras.layers.Dense(units = 6, activation = 'softmax')
])
model.compile(optimizer = 'adam',
loss = 'sparse_categorical_crossentropy',
metrics = ['accuracy'])
model.fit(train_x, train_y, batch_size = 50, epochs = 10)
test_loss, test_acc = model.evaluate(test_x, test_y)
name = ['scratches', 'rolled-in scale', 'pitted surface', 'patches', 'inclusion', 'crazing']
idx = np.random.choice(test_x.shape[0], 1)
test_img = test_x[idx]
test_label = test_y[idx]
predict = model.predict(test_img, verbose = 0)
mypred = np.argmax(predict, axis = 1)
plt.figure(figsize = (9, 4))
plt.subplot(1,2,1)
plt.imshow(test_img.reshape(200, 200), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()
print('Prediction : {}'.format(name[mypred[0]]))
print('True Label : {}'.format(name[test_label[0]]))
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')