Convolutional Neural Networks (CNN)


By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents



1. Convolution

In this section, we will explore Convolutional Neural Networks (CNNs); however, before diving into the details, it is essential to first examine the concept of the convolution operation.


In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('xSuFInvLjBo', width = "560", height = "315")
Out[ ]:

1.1. Convolution in Signal Processing

Convolution is a fundamental mathematical operation in signal processing, widely used in applications such as filtering, audio processing, image processing, and system analysis. The operation involves integrating the product of two functions, where one function is flipped and shifted relative to the other. Convolution can be interpreted as a process that combines two signals to produce a third signal, characterizing how the shape of one signal is altered by the influence of another.


Mathematical Formulation

  • Continuous-Time Convolution

    For two continuous functions $f(t)$ (input signal) and $g(t)$ (impulse response), the convolution is defined as:


    $$(f * g)(t) = \int_{-\infty}^{\infty} f(\tau) g(t - \tau) d\tau$$
  • Discrete-Time Convolution

    For discrete signals $f[n]$ and $g[n]$, the convolution is expressed as:


    $$(f * g)[n] = \sum_{k=-\infty}^{\infty} f[k] g[n - k]$$

    Here, $g[n - k]$ represents the flipped and shifted version of $g$, and the summation computes the weighted sum of the overlapping values.


Concept of Convolution in Signal Processing

  • Input Signal: The function being processed (e.g., an image, an audio waveform).

  • Impulse Response (Kernel/Filter): The function that defines how the system modifies the input. Common kernels include smoothing filters, edge detection filters, and band-pass filters.

  • Convolution Output: The result after applying the filter to the input, showing how the original signal is transformed.


Visual Intuition

The convolution operation can be understood as a sliding window process, where one function (usually the filter or kernel) is flipped and shifted across the input. At each position, the product of the overlapping values is summed to compute the output value at that point.

For example:

  • In 1D audio signals, the sliding window represents the flipped and shifted impulse response passing over the time-series input.

  • In 2D image convolutions, the kernel slides over different regions of the image, applying the same process of element-wise multiplication and summation for each spatial location.


1.2. Convolution and Cross-Correlation

Convolution and cross-correlation are two similar yet distinct operations commonly used in signal processing and deep learning. Both operations involve a sliding window (filter/kernel) applied to an input signal, but they differ in how the filter is applied.


Mathematical Formulation

  • Continuous-Time Cross-Correlation

    The continuous-time cross-correlation of two signals $f(t)$ and $g(t)$ is defined as:


    $$(f \star g)(t) = \int_{-\infty}^{\infty} f(\tau) g(t + \tau) d\tau$$

    Here, $g(t + \tau)$ represents the shifted version of $g(t)$. Unlike convolution, the function $g(t)$ is not flipped before integration.


  • Discrete-Time Cross-Correlation

    The discrete 1D cross-correlation of $f[n]$ and $g[n]$ is defined as:


    $$(f \star g)[n] = \sum_{k=-\infty}^{\infty} f[k] g[n + k]$$

    The kernel $g[n]$ is not flipped; instead, it is directly shifted across the input signal.


Visual Intuition

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('vrl1YlCvyQo', width = "560", height = "315")
Out[ ]:

Intuition Behind the Operations

Convolution Intuition:

  • The signal $f[n]$ is "processed" by the kernel $g[n]$, which acts as a function that modifies or filters the input.

  • By flipping the kernel, convolution ensures that the output respects the system's causality (i.e., the output depends only on present and past inputs, not future values).

Cross-Correlation Intuition:

  • Cross-correlation measures how similar the input signal is to the kernel at different positions.

  • The larger the output value at a given shift, the more similar the signal segment is to the kernel.


Why the Difference Does not Matter in Deep Learning (CNNs)

In practice, many frameworks (e.g., TensorFlow, PyTorch) use cross-correlation instead of convolution for convolutional neural networks (CNNs) because flipping the kernel does not significantly affect feature learning. The goal in CNNs is to match patterns rather than strictly respect the mathematical definition of convolution used in classical signal processing.


1.3. 1D Convolution

Let's compute the 1D convolution process step-by-step.




  • Kernel $[1, 3, 0, -1]$ at Position 1:

    The kernel overlaps with the first four values of the input signal: $[1, 3, 2, 3]$.


    • Element-wise multiplication:

    $$1 \times 1 + 3 \times 3 + 0 \times 2 + (-1) \times 3 = 1 + 9 + 0 - 3 = 7 $$
    • The first output value is 7.

  • Kernel $[1, 3, 0, -1]$ Shifted by 1 Position:

    The kernel now overlaps with the next segment of the input: $[3, 2, 3, 0]$.


    • Element-wise multiplication:

    $$ 1 \times 3 + 3 \times 2 + 0 \times 3 + (-1) \times 0 = 3 + 6 + 0 + 0 = 9$$
    • The second output value is 9.

    This process repeats until the kernel has slid across the entire input signal.



1.4. Convolution on Image (= Convolution in 2D)

Convolution on Image (2D Convolution) refers to the process of applying a 2D filter (kernel) to an image to extract features, such as edges, textures, or patterns, by sliding the filter over the image. Here's a step-by-step explanation:


Key Components of 2D Convolution:

  • Input Image (Matrix):

    • A 2D array of pixel values (grayscale) or 3D (for RGB images).

    • Example: A 5$\times$5 matrix with values representing pixel intensities.

  • Filter/Kernel:

    • A smaller matrix (e.g., 3$\times$3) with predefined or learned values.

    • The values are used to multiply corresponding input pixels during the convolution operation.

  • Sliding Operation:

    • The filter slides across the image matrix (from left to right and top to bottom) based on a defined stride.

    • At each position, the dot product of the overlapping region and the kernel is computed and saved in the output feature map.


Mathematical Representation

If $I$ is the input image and $K$ is the kernel:


$$O(i, j) = \sum_{m=0}^{M-1} \sum_{n=0}^{N-1} I(i+m, j+n) \times K(m, n)$$

Where:

  • $O(i, j)$ is the value of the output feature map at position $(i, j)$.
  • $M \times N$ is the size of the kernel.
  • $I(i+m, j+n)$ represents the pixel in the input image at position $(i+m, j+n)$.
  • $K(m, n)$ is the kernel value at $(m, n)$.



Kernel (or Filter)

A kernel (also known as a filter) is a small, fixed-size matrix that slides over the input image during the convolution operation. The kernel is responsible for detecting specific features such as edges, patterns, and textures.

  • Modify or enhance an image by filtering

  • Filter images to emphasize certain features or remove other features

  • Filtering includes smoothing, sharpening and edge enhancement

  • Discrete convolution can be viewed as element-wise multiplication by a matrix

  • Common sizes: 3x3, 5x5, or 7x7


Kernels are used to extract various features such as:

  • Edge detection (using Sobel or Prewitt filters).

  • Blurring (using Gaussian filters).

  • Sharpening (using Laplacian filters).


How to Find the Right Kernels

We have explored the use of various predefined kernels that produce specific effects on images. Now, let us consider the opposite approach — instead of manually designing the kernels, the network will learn the kernels directly from the data during training. This approach enables the convolution to automatically discover and refine feature extractors that are most relevant for the task at hand, such as detecting edges, textures, or more complex patterns. By learning the kernels from data, the network adapts to the unique characteristics of the dataset, leading to more robust and effective feature extraction.



2. Convolutional Neural Networks (CNN)

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('u03QN8lJsDg', width = "560", height = "315")
Out[ ]:

2.1. Motivation: Learning Visual Features

Image Classification is a computer vision task that involves assigning a label or category to an entire image based on its visual content. The goal is to analyze the image as a whole and predict the class to which it belongs from a predefined set of categories (e.g., identifying "bird" in an image).

  • The goal is to assign a single label (such as "bird") to the entire image




Since we've primarily learned about Artificial Neural Networks (ANNs) so far, let's first see how ANNs can be applied to image classification by flattening the 2D image into a 1D vector.




  • ANN structure for classification in image

    • does not seem the best
    • did not make use of the fact that we are dealing with images
    • Spatial organization of the input is destroyed by flattening
      • Input Image Flattened: The 2D image is flattened into a 1D vector.
      • Problem: Flattening loses all spatial relationships (e.g., how pixels are grouped into shapes).
      • Result: Struggles to learn meaningful image features.

This observation motivates the development of alternative neural network architectures designed to effectively handle the unique 2D structure of image data. Specifically, this challenge highlights the intersection between the limitations of traditional ANNs and the advantages offered by the convolution operation discussed earlier. By preserving spatial hierarchies and local connectivity, convolutional operations enable networks to learn meaningful features such as edges, textures, and shapes directly from the image data, paving the way for more robust and efficient image classification and recognition models.


2.2. Convolutional Operator


Local Receptive Fields:

  • Objects in images typically exhibit local spatial support, meaning that an object of interest is confined to specific, localized regions of the image.

  • This motivates the transition from fully connected layers — which treat all input pixels as equally relevant — to locally and convolutionally connected layers, where connections are restricted to a neighborhood of pixels.

  • By leveraging this local connectivity, convolutional layers can efficiently capture spatially coherent patterns, thereby preserving the local structure of the input image and reducing the number of parameters compared to fully connected layers.




Translation invariance

  • The green and purple units represent distinct sets of convolutional filters, each designed to detect specific features. These filters focus on specific locations within the image to identify objects within those locations.
  • However, the appearance of an object is independent of its position within the image. Consequently, a single convolutional layer slides across the entire image, ensuring that the same feature is detected regardless of its location.
  • This sliding mechanism, inherent to the convolution operation, provides the network with translation invariance, enabling it to recognize objects even when they appear in different regions of the image.



Kernel Learning

At this point, it is reasonable to conclude that the convolution operation is well-suited for image classification due to its capacity to preserve spatial structure and capture meaningful features across the input image. Naturally, the next step is to focus on learning the optimal weights within the kernel.

Of course, the kernel is not manually designed; rather, it is learned from the data during the training process. Specifically, the network is trained to act as a visual feature extractor by identifying and refining patterns such as edges, textures, and shapes that are relevant for the classification task. This data-driven approach allows the network to adaptively discover the most salient features, thereby enhancing its ability to generalize to new, unseen inputs.


Why CNNs Work for Bird Classification:

  • Spatial Awareness: Unlike ANNs, CNNs maintain the spatial arrangement of the image.

    • Preserves 2D Structure: The convolutional layers maintain the spatial arrangement of pixels.

    • Feature Learning: Kernels (filters) in the convolutional layers learn to detect local features (like edges, corners, textures).

  • Translation Invariance: Pooling layers make CNNs robust to small shifts in the input image (e.g., the bird appearing in different positions).

  • Shared Weights: Kernels (filters) are shared across the image, drastically reducing the number of parameters compared to ANNs.

  • Hierarchical Learning: Early layers learn simple features (e.g., edges), while deeper layers learn complex features (e.g., bird wings, beak, etc.).


Note on Convolution and Cross-Correlation:

In Convolutional Neural Networks (CNNs), cross-correlation is often used instead of convolution for computational convenience. In many frameworks (e.g., TensorFlow, PyTorch), the "convolution" operation is technically implemented as cross-correlation — meaning the kernel is shifted but not flipped. However, this distinction generally does not affect the performance of CNNs, as the model can learn filters that behave equivalently to flipped kernels.


Explanation in 2D (Images)

  • Flipping a 2D kernel is equivalent to reflecting the kernel along both its vertical and horizontal axes.

This is similar to viewing the kernel through a mirror placed both vertically and horizontally.

  • The flipping process transforms the kernel into its mirrored form, but this transformation has a minimal impact in deep learning because the CNN will learn to adjust to the direction and orientation of features.

More on Convolution of CNN

  • Multiple channels

    • Multiple channels in convolution refer to handling input data that contains multiple feature channels (e.g., RGB images have three channels corresponding to Red, Green, and Blue).

    • The convolution operation is extended to process all these channels simultaneously to extract meaningful features across all of input channels.


  • Multiple kernels
    • Multiple kernels (filters) are used to extract different types of features from the input image.
    • Each kernel learns to detect a specific pattern, such as edges, textures, corners, or shapes, and the combination of these kernels allows the network to construct a richer and more comprehensive representation of the input image.
    • This feature diversity is key to the success of CNNs


2.3 Stride and Padding

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('cr2fz5A2MXQ', width = "560", height = "315")
Out[ ]:

  • Strides: increment step size for the convolution operator or the number of pixels the kernel moves after each step

  • Stride example with kernel size 3 $\times$ 3 and a stride of 1

  • Stride example with kernel size 3 $\times$ 3 and a stride of 2

  • Padding: artificially fill borders of image
    • Usually fill with 0s
    • Useful to keep spatial dimension constant across filters


2.4. Nonlinear Activation Function

The convolution operation is inherently a linear operation, as it involves computing the weighted sum of input values (pixels) and kernel weights. However, real-world data and tasks like image classification involve complex and nonlinear patterns. Therefore, to make the network capable of learning these complex relationships, nonlinear activation functions are applied after the convolution operation.

The nonlinear activation function is applied individually to each pixel value in the feature map produced by the convolution operation. This pointwise application of nonlinearity is crucial for enabling the network to introduce complexity and flexibility in its learned representations.



2.5. Pooling

Pooling is a crucial operation in Convolutional Neural Networks (CNNs) used to downsample feature maps by summarizing the presence of features in small spatial regions. Pooling layers reduce the spatial dimensions (height and width) of the feature maps while retaining the most important information. This helps make the network more efficient and robust to translations and small distortions in the input image.

Purpose of Pooling

  • Reduction in Spatial Dimensions: The spatial size (width and height) of the feature map is reduced.

  • Retention of Key Features: Max pooling retains the most prominent features, while average pooling retains a smooth summary of the region.

  • Overfitting Prevention: By downsampling, pooling reduces the model's complexity, making it less prone to overfitting.

Types of Pooling Operations

  • Max Pooling: Frequently used for downsampling in CNNs due to its ability to retain strong activations.

  • Average Pooling: Used in tasks requiring smoothed outputs, such as image compression.

  • Global Pooling: Used in modern CNN architectures (like ResNet) to reduce the feature map to a single value per channel before fully connected layers.

Max Pooling

  • Compute a maximum value in a sliding window
    • Reduce spatial resolution for faster computation
    • It is robust to small shifts in the input, meaning that minor translations or displacements in the input feature map do not significantly affect the output.
    • i.e., the pooling window may still capture the same maximum value, which means the output remains unchanged.


  • Pooling size : $2\times2$ for example

2.6. CNN for Classification


Why Convolution is Important in CNNs

  • Feature Extraction: Identifies spatial patterns (like edges or shapes) from the input data.
  • Parameter Sharing: Instead of having a weight for each input pixel, the same filter is applied across the input, reducing the number of parameters and improving efficiency.
  • Translation Invariance: The same filter slides across the image, allowing the network to recognize features regardless of their location.

However, convolution is not the only component of a CNN. The nonlinear activation function and pooling must be combined to form a Convolutional Layer Block.


Convolutional Layer Block

A Convolutional Layer Block is the fundamental unit in Convolutional Neural Networks (CNNs) that processes input data to extract hierarchical features from images. It typically consists of three main components: convolution, nonlinear activation function, and pooling. These components work together to learn meaningful features while preserving spatial structure and reducing the dimensions of the data.

  • Convolution Operation:

    • Applies a set of kernels (filters) to the input.
    • Extracts local features by computing weighted sums of pixel values.
    • Produces multiple feature maps, each corresponding to a different learned feature.
  • Nonlinear Activation Function:

    • Applies a pointwise nonlinear activation function (e.g., ReLU) to each feature map.
    • Introduces nonlinearity, enabling the network to learn complex, nonlinear patterns.
  • Pooling Operation:

    • Performs downsampling by summarizing each region of the feature map (e.g., max pooling).
    • Reduces the spatial dimensions (height and width) while retaining important features.
    • Helps prevent overfitting and introduces translation invariance.

CNN for Classification

  • Feature Learning (Sequence of Convolutional Layer Blocks)
    • Multiple convolution + activation + pooling blocks are stacked, enabling the network to learn hierarchical features

  • Classification (Fully Connected Layers)

    • Flattening:

      • The output of the final convolutional layer (a 3D feature map) is flattened into a 1D vector.
    • Fully Connected Layers:

      • The flattened vector is passed through dense (fully connected) layers that combine the learned features to make a prediction.
    • Softmax Output:

      • The final fully connected layer outputs probabilities for each class
      • The class with the highest probability is selected as the predicted label.




2.7. Historial Notes


Convolutional Neural Networks (CNNs) and Yann LeCun

Convolutional Neural Networks (CNNs) are a class of deep learning models designed specifically for processing data with a grid-like structure, such as images. They have revolutionized the fields of computer vision and pattern recognition by automatically learning spatial hierarchies of features from input data, without requiring handcrafted features.


Yann LeCun's Contributions to CNNs

Yann LeCun is widely recognized as the father of convolutional neural networks. His pioneering research laid the foundation for CNNs, demonstrating their potential in visual recognition tasks, long before deep learning became mainstream.


In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('FwFduRA_L6Q', width = "560", height = "315")
Out[ ]:

Impact of CNNs on Modern AI

Computer Vision:

  • CNNs became the backbone of modern computer vision tasks, including image classification (e.g., ImageNet), object detection, facial recognition, and medical imaging analysis.

Architectural Evolution:

  • LeCun's LeNet-5 architecture inspired the development of more sophisticated CNNs, such as AlexNet (2012), VGGNet, ResNet, and EfficientNet, each contributing to further advances in image recognition and deep learning research.

Recognition of Yann LeCuns Work:

  • In 2018, Yann LeCun, along with Geoffrey Hinton and Yoshua Bengio, received the Turing Award, often referred to as the "Nobel Prize of Computing," for their contributions to deep learning.

  • LeCun has remained a prominent voice in the AI community, advocating for advancements in neural networks, including his more recent work on self-supervised learning and energy-based models.


The Legacy of CNNs

Yann LeCun's work on CNNs changed the landscape of artificial intelligence by demonstrating that neural networks could learn to recognize patterns directly from raw data. The success of LeNet-5 proved that CNNs could generalize well across a variety of visual tasks, and this architecture remains the foundation of many deep learning models today. Without LeCun's contributions, the rapid progress in computer vision, autonomous vehicles, and AI-driven image processing would not have been possible.


3. Lab: CNN with TensorFlow (MNIST)

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('cr2fz5A2MXQ?si=df-ux84Q096QpMIc&start=726', width = "560", height = "315")
Out[ ]:

  • MNIST without flattening
  • To classify handwritten digits


Network model

  • The hyperparameters of the network have not been optimized.
  • This example is intended solely to demonstrate how to implement CNN in Python based on the provided structure.

3.1. Training

In [ ]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
In [ ]:
mnist = tf.keras.datasets.mnist

(train_x, train_y), (test_x, test_y) = mnist.load_data()

train_x, test_x = train_x/255.0, test_x/255.0
In [ ]:
train_x = train_x.reshape((train_x.shape[0], 28, 28, 1))
test_x = test_x.reshape((test_x.shape[0], 28, 28, 1))
In [ ]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters = 32,
                           kernel_size = (3,3),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (28, 28, 1)),

    tf.keras.layers.MaxPool2D((2,2)),

    tf.keras.layers.Conv2D(filters = 64,
                           kernel_size = (3,3),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (14, 14, 32)),

    tf.keras.layers.MaxPool2D((2,2)),

    tf.keras.layers.Flatten(),

    tf.keras.layers.Dense(units = 128, activation = 'relu'),

    tf.keras.layers.Dense(units = 10, activation = 'softmax')
])
In [ ]:
model.compile(optimizer = 'adam',
              loss = 'sparse_categorical_crossentropy',
              metrics = ['accuracy'])
In [ ]:
model.fit(train_x, train_y, batch_size = 50, epochs = 3)
Epoch 1/3
1200/1200 ━━━━━━━━━━━━━━━━━━━━ 6s 3ms/step - accuracy: 0.9059 - loss: 0.3076
Epoch 2/3
1200/1200 ━━━━━━━━━━━━━━━━━━━━ 4s 3ms/step - accuracy: 0.9866 - loss: 0.0444
Epoch 3/3
1200/1200 ━━━━━━━━━━━━━━━━━━━━ 4s 3ms/step - accuracy: 0.9920 - loss: 0.0264
Out[ ]:
<keras.src.callbacks.history.History at 0x7932f7fbc7c0>

3.2. Testing or Evaluating

In [ ]:
test_loss, test_acc = model.evaluate(test_x, test_y)
313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9837 - loss: 0.0555
In [ ]:
test_img = test_x[[1495]]

predict = model.predict(test_img, verbose = 0)
mypred = np.argmax(predict, axis = 1)

plt.figure(figsize = (9, 4))

plt.subplot(1,2,1)
plt.imshow(test_img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()

print('Prediction : {}'.format(mypred[0]))
Prediction : 3

4. Lab: CNN with Tensorflow (Steel Surface Defects)


Having explored the implementation of a CNN using the MNIST dataset — a relatively simple and well-structured benchmark — we now turn our attention to a more realistic and practical engineering problem. In this case, we will consider the NEU Surface Defect Database (NEU), which contains images of steel surface defects commonly encountered in industrial applications.

  • NEU steel surface defects

  • To classify defects images into 6 classes




Download NEU steel surface defects images and labels


4.1. Training

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
# Change file paths if necessary

train_x = np.load('/content/drive/MyDrive/DL/DL_data/NEU_train_imgs.npy')
train_y = np.load('/content/drive/MyDrive/DL/DL_data/NEU_train_labels.npy')

test_x = np.load('/content/drive/MyDrive/DL/DL_data/NEU_test_imgs.npy')
test_y = np.load('/content/drive/MyDrive/DL/DL_data/NEU_test_labels.npy')
In [ ]:
print(train_x.shape)
print(train_y.shape)
(1500, 200, 200, 1)
(1500,)
In [ ]:
print(test_x.shape)
print(test_y.shape)
(300, 200, 200, 1)
(300,)
In [ ]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters = 32,
                           kernel_size = (3,3),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (200, 200, 1)),

    tf.keras.layers.MaxPool2D((2,2)),

    tf.keras.layers.Conv2D(filters = 64,
                           kernel_size = (3,3),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (100, 100, 32)),

    tf.keras.layers.MaxPool2D((2,2)),

    tf.keras.layers.Conv2D(filters = 128,
                           kernel_size = (3,3),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (50, 50, 64)),

    tf.keras.layers.MaxPool2D((2,2)),

    tf.keras.layers.Flatten(),

    tf.keras.layers.Dense(units = 128, activation = 'relu'),

    tf.keras.layers.Dense(units = 6, activation = 'softmax')
])
In [ ]:
model.compile(optimizer = 'adam',
              loss = 'sparse_categorical_crossentropy',
              metrics = ['accuracy'])
In [ ]:
model.fit(train_x, train_y, batch_size = 50, epochs = 10)
Epoch 1/10
30/30 ━━━━━━━━━━━━━━━━━━━━ 4s 58ms/step - accuracy: 0.2180 - loss: 1.7870
Epoch 2/10
30/30 ━━━━━━━━━━━━━━━━━━━━ 2s 56ms/step - accuracy: 0.6771 - loss: 0.8989
Epoch 3/10
30/30 ━━━━━━━━━━━━━━━━━━━━ 2s 56ms/step - accuracy: 0.8274 - loss: 0.4721
Epoch 4/10
30/30 ━━━━━━━━━━━━━━━━━━━━ 2s 56ms/step - accuracy: 0.8898 - loss: 0.3322
Epoch 5/10
30/30 ━━━━━━━━━━━━━━━━━━━━ 2s 57ms/step - accuracy: 0.8632 - loss: 0.3320
Epoch 6/10
30/30 ━━━━━━━━━━━━━━━━━━━━ 3s 59ms/step - accuracy: 0.9026 - loss: 0.2565
Epoch 7/10
30/30 ━━━━━━━━━━━━━━━━━━━━ 2s 56ms/step - accuracy: 0.9475 - loss: 0.1591
Epoch 8/10
30/30 ━━━━━━━━━━━━━━━━━━━━ 2s 57ms/step - accuracy: 0.9362 - loss: 0.1698
Epoch 9/10
30/30 ━━━━━━━━━━━━━━━━━━━━ 3s 56ms/step - accuracy: 0.9437 - loss: 0.1645
Epoch 10/10
30/30 ━━━━━━━━━━━━━━━━━━━━ 3s 56ms/step - accuracy: 0.9547 - loss: 0.1317
Out[ ]:
<keras.src.callbacks.history.History at 0x7932f846d3c0>

4.2. Testing or Evaluating

In [ ]:
test_loss, test_acc = model.evaluate(test_x, test_y)
10/10 ━━━━━━━━━━━━━━━━━━━━ 1s 63ms/step - accuracy: 0.8135 - loss: 0.4512
In [ ]:
name = ['scratches', 'rolled-in scale', 'pitted surface', 'patches', 'inclusion', 'crazing']

idx = np.random.choice(test_x.shape[0], 1)
test_img = test_x[idx]
test_label = test_y[idx]

predict = model.predict(test_img, verbose = 0)
mypred = np.argmax(predict, axis = 1)

plt.figure(figsize = (9, 4))
plt.subplot(1,2,1)
plt.imshow(test_img.reshape(200, 200), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()

print('Prediction : {}'.format(name[mypred[0]]))
print('True Label : {}'.format(name[test_label[0]]))
Prediction : pitted surface
True Label : pitted surface
In [ ]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')