Pre-trained Models and Transfer Learning

By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents

1. Pre-trained Models¶

A pre-trained model is a neural network that has been previously trained on a large dataset for a specific task, such as image recognition or language processing. Instead of training a model from scratch, pre-trained models allow practitioners to leverage pre-existing knowledge, making the training process faster and more efficient.

Advantages of Pre-Trained Models

Efficiency: Reduces the need for extensive computational resources.
Adaptability: Can be adapted to different but related tasks by fine-tuning.
Access to Large-Scale Learning: Utilizes knowledge from large-scale datasets that may be difficult for individual practitioners to curate.
Reduced Training Time: Since the model has already learned basic patterns from large datasets, only minimal adjustments (fine-tuning) may be needed for specific tasks.

from IPython.display import YouTubeVideo
YouTubeVideo('7JcSo0jCLdE?si=d530KtZ2bu7pNTxe&amp;start=23', width = "560", height = "315")

1.1. ImageNet¶

ImageNet is a large-scale image dataset that has played a pivotal role in the development of modern computer vision and deep learning. Created by Fei-Fei Li, a professor of computer science at Stanford University and a leader in artificial intelligence, ImageNet serves as a comprehensive benchmark for image classification, object detection, and other vision-related tasks.

The Creation of ImageNet

Goal:

Fei-Fei Li’s vision for ImageNet was to create a dataset that could bridge the gap between machine learning algorithms and the human ability to understand and classify images at scale. She believed that the lack of large, labeled datasets was a major obstacle preventing AI systems from achieving human-level perception.

Development:

The dataset was built by collecting images from the web and crowd-sourcing labels through Amazon Mechanical Turk, where workers annotated the images according to specific categories.

ImageNet Challenge (ILSVRC)

In 2010, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was launched to evaluate the performance of algorithms on large-scale image classification tasks. The ILSVRC became a benchmark for testing the accuracy and scalability of computer vision models.

Key Events in ImageNet’s History

2012 - AlexNet Revolution:

AlexNet, developed by Alex Krizhevsky, Geoffrey Hinton, and Ilya Sutskever, used a deep convolutional neural network (CNN) to achieve a top-5 error rate of 15.3%, significantly outperforming traditional methods. This victory demonstrated the power of deep learning and popularized the use of GPUs for training deep networks.

Subsequent Breakthroughs:

The years following AlexNet saw the emergence of more sophisticated models, including VGGNet (2014), GoogLeNet (2014), and ResNet (2015), all of which were trained on ImageNet and set new performance records.
In 2015, ResNet (by Kaiming He et al.) achieved a top-5 error rate of 3.6%, surpassing human-level accuracy (human performance = 5.1%)

1.2. Pre-trained CNN Models¶

LeNet

CNN = Convolutional Neural Networks = ConvNet
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition.
All are still the basic components of modern ConvNets!

AlexNet

Simplified version of Krizhevsky, Alex, Sutskever, and Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012
LeNet-style backbone, plus:
- ReLU [Nair & Hinton 2010]
  - RevoLUtion of deep learning
  - Accelerate training
- Dropout [Hinton et al 2012]
  - In-network ensembling
  - Reduce overfitting
- Data augmentation
  - Label-preserving transformation
  - Reduce overfitting

VGG-16/19

Simonyan, Karen, and Zisserman. "Very deep convolutional networks for large-scale image recognition." (2014)
Simply “Very Deep”!
- Modularized design
  - 3x3 Conv as the module
  - Stack the same module
  - Same computation for each module
- Stage-wise training
  - VGG-11 → VGG-13 → VGG-16
  - We need a better initialization…

GoogleNet/Inception

Multiple branches
- e.g., 1x1, 3x3, 5x5, pool
Shortcuts
- stand-alone 1x1, merged by concat.
Bottleneck
- Reduce dim by 1x1 before expensive 3x3/5x5 conv

ResNet

He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016.

Skip Connection and Residual Net
- A direct connection between 2 non-consecutive layers
- No gradient vanishing
- Parameters are optimized to learn a residual, that is the diﬀerence between the value before the block and the one needed after.
- A skip connection is a connection that bypasses at least one layer.
- Here, it is often used to transfer local information by concatenating or summing feature maps from the downsampling path with feature maps from the upsampling path.
- Merging features from various resolution levels helps combining context information with spatial information.

def residual_net(x):
    conv1 = tf.keras.layers.Conv2D(filters = 32,
                                   kernel_size = (3, 3),
                                   padding = "SAME",
                                   activation = 'relu')(x)

    conv2 = tf.keras.layers.Conv2D(filters = 32,
                                   kernel_size = (3, 3),
                                   padding = "SAME",
                                   activation = 'relu')(conv1)

    maxp2 = tf.keras.layers.MaxPool2D(pool_size = (2, 2),
                                      strides = 2)(conv2 + x)

    flat = tf.keras.layers.Flatten()(maxp2)

    hidden = tf.keras.layers.Dense(units = n_hidden,
                                   activation='relu')(flat)

    output = tf.keras.layers.Dense(units = n_output)(hidden)

    return output

DenseNets

DenseNets (Densely Connected Convolutional Networks) are a type of convolutional neural network (CNN) architecture introduced by Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger in 2017. DenseNets were proposed to address issues related to vanishing gradients, feature reuse, and efficiency in deep learning models.

In DenseNets, each layer is connected to every other layer in a feed-forward manner. Instead of summing feature maps (as in ResNets), DenseNets concatenate the feature maps from previous layers and pass them to subsequent layers.

Traditional CNNs: Each layer passes information only to the next layer.
DenseNets: Each layer receives the concatenated outputs of all preceding layers as input, allowing for maximum feature reuse.

1.3. Load Pre-trained Models¶

List of Available Models

VGG16
VGG19
ResNet
GoogLeNet/Inception
DenseNet
MobileNet

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import cv2

%matplotlib inline

Model Selection

# model_type = tf.keras.applications.densenet
# model_type = tf.keras.applications.inception_resnet_v2
# model_type = tf.keras.applications.inception_v3
model_type = tf.keras.applications.mobilenet
# model_type = tf.keras.applications.mobilenet_v2
# model_type = tf.keras.applications.nasnet
# model_type = tf.keras.applications.resnet50
# model_type = tf.keras.applications.vgg16
# model_type = tf.keras.applications.vgg19

Model Summary

model = model_type.MobileNet() # Change Model

model.summary()

Model: "mobilenet_1.00_224"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ input_layer_2 (InputLayer)           │ (None, 224, 224, 3)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv1 (Conv2D)                       │ (None, 112, 112, 32)        │             864 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv1_bn (BatchNormalization)        │ (None, 112, 112, 32)        │             128 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv1_relu (ReLU)                    │ (None, 112, 112, 32)        │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_1 (DepthwiseConv2D)          │ (None, 112, 112, 32)        │             288 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_1_bn (BatchNormalization)    │ (None, 112, 112, 32)        │             128 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_1_relu (ReLU)                │ (None, 112, 112, 32)        │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_1 (Conv2D)                   │ (None, 112, 112, 64)        │           2,048 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_1_bn (BatchNormalization)    │ (None, 112, 112, 64)        │             256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_1_relu (ReLU)                │ (None, 112, 112, 64)        │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pad_2 (ZeroPadding2D)           │ (None, 113, 113, 64)        │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_2 (DepthwiseConv2D)          │ (None, 56, 56, 64)          │             576 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_2_bn (BatchNormalization)    │ (None, 56, 56, 64)          │             256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_2_relu (ReLU)                │ (None, 56, 56, 64)          │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_2 (Conv2D)                   │ (None, 56, 56, 128)         │           8,192 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_2_bn (BatchNormalization)    │ (None, 56, 56, 128)         │             512 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_2_relu (ReLU)                │ (None, 56, 56, 128)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_3 (DepthwiseConv2D)          │ (None, 56, 56, 128)         │           1,152 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_3_bn (BatchNormalization)    │ (None, 56, 56, 128)         │             512 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_3_relu (ReLU)                │ (None, 56, 56, 128)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_3 (Conv2D)                   │ (None, 56, 56, 128)         │          16,384 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_3_bn (BatchNormalization)    │ (None, 56, 56, 128)         │             512 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_3_relu (ReLU)                │ (None, 56, 56, 128)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pad_4 (ZeroPadding2D)           │ (None, 57, 57, 128)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_4 (DepthwiseConv2D)          │ (None, 28, 28, 128)         │           1,152 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_4_bn (BatchNormalization)    │ (None, 28, 28, 128)         │             512 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_4_relu (ReLU)                │ (None, 28, 28, 128)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_4 (Conv2D)                   │ (None, 28, 28, 256)         │          32,768 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_4_bn (BatchNormalization)    │ (None, 28, 28, 256)         │           1,024 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_4_relu (ReLU)                │ (None, 28, 28, 256)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_5 (DepthwiseConv2D)          │ (None, 28, 28, 256)         │           2,304 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_5_bn (BatchNormalization)    │ (None, 28, 28, 256)         │           1,024 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_5_relu (ReLU)                │ (None, 28, 28, 256)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_5 (Conv2D)                   │ (None, 28, 28, 256)         │          65,536 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_5_bn (BatchNormalization)    │ (None, 28, 28, 256)         │           1,024 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_5_relu (ReLU)                │ (None, 28, 28, 256)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pad_6 (ZeroPadding2D)           │ (None, 29, 29, 256)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_6 (DepthwiseConv2D)          │ (None, 14, 14, 256)         │           2,304 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_6_bn (BatchNormalization)    │ (None, 14, 14, 256)         │           1,024 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_6_relu (ReLU)                │ (None, 14, 14, 256)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_6 (Conv2D)                   │ (None, 14, 14, 512)         │         131,072 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_6_bn (BatchNormalization)    │ (None, 14, 14, 512)         │           2,048 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_6_relu (ReLU)                │ (None, 14, 14, 512)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_7 (DepthwiseConv2D)          │ (None, 14, 14, 512)         │           4,608 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_7_bn (BatchNormalization)    │ (None, 14, 14, 512)         │           2,048 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_7_relu (ReLU)                │ (None, 14, 14, 512)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_7 (Conv2D)                   │ (None, 14, 14, 512)         │         262,144 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_7_bn (BatchNormalization)    │ (None, 14, 14, 512)         │           2,048 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_7_relu (ReLU)                │ (None, 14, 14, 512)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_8 (DepthwiseConv2D)          │ (None, 14, 14, 512)         │           4,608 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_8_bn (BatchNormalization)    │ (None, 14, 14, 512)         │           2,048 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_8_relu (ReLU)                │ (None, 14, 14, 512)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_8 (Conv2D)                   │ (None, 14, 14, 512)         │         262,144 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_8_bn (BatchNormalization)    │ (None, 14, 14, 512)         │           2,048 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_8_relu (ReLU)                │ (None, 14, 14, 512)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_9 (DepthwiseConv2D)          │ (None, 14, 14, 512)         │           4,608 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_9_bn (BatchNormalization)    │ (None, 14, 14, 512)         │           2,048 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_9_relu (ReLU)                │ (None, 14, 14, 512)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_9 (Conv2D)                   │ (None, 14, 14, 512)         │         262,144 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_9_bn (BatchNormalization)    │ (None, 14, 14, 512)         │           2,048 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_9_relu (ReLU)                │ (None, 14, 14, 512)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_10 (DepthwiseConv2D)         │ (None, 14, 14, 512)         │           4,608 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_10_bn (BatchNormalization)   │ (None, 14, 14, 512)         │           2,048 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_10_relu (ReLU)               │ (None, 14, 14, 512)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_10 (Conv2D)                  │ (None, 14, 14, 512)         │         262,144 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_10_bn (BatchNormalization)   │ (None, 14, 14, 512)         │           2,048 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_10_relu (ReLU)               │ (None, 14, 14, 512)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_11 (DepthwiseConv2D)         │ (None, 14, 14, 512)         │           4,608 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_11_bn (BatchNormalization)   │ (None, 14, 14, 512)         │           2,048 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_11_relu (ReLU)               │ (None, 14, 14, 512)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_11 (Conv2D)                  │ (None, 14, 14, 512)         │         262,144 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_11_bn (BatchNormalization)   │ (None, 14, 14, 512)         │           2,048 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_11_relu (ReLU)               │ (None, 14, 14, 512)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pad_12 (ZeroPadding2D)          │ (None, 15, 15, 512)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_12 (DepthwiseConv2D)         │ (None, 7, 7, 512)           │           4,608 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_12_bn (BatchNormalization)   │ (None, 7, 7, 512)           │           2,048 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_12_relu (ReLU)               │ (None, 7, 7, 512)           │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_12 (Conv2D)                  │ (None, 7, 7, 1024)          │         524,288 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_12_bn (BatchNormalization)   │ (None, 7, 7, 1024)          │           4,096 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_12_relu (ReLU)               │ (None, 7, 7, 1024)          │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_13 (DepthwiseConv2D)         │ (None, 7, 7, 1024)          │           9,216 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_13_bn (BatchNormalization)   │ (None, 7, 7, 1024)          │           4,096 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_dw_13_relu (ReLU)               │ (None, 7, 7, 1024)          │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_13 (Conv2D)                  │ (None, 7, 7, 1024)          │       1,048,576 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_13_bn (BatchNormalization)   │ (None, 7, 7, 1024)          │           4,096 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_pw_13_relu (ReLU)               │ (None, 7, 7, 1024)          │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ global_average_pooling2d_1           │ (None, 1, 1, 1024)          │               0 │
│ (GlobalAveragePooling2D)             │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout (Dropout)                    │ (None, 1, 1, 1024)          │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv_preds (Conv2D)                  │ (None, 1, 1, 1000)          │       1,025,000 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ reshape_2 (Reshape)                  │ (None, 1000)                │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ predictions (Activation)             │ (None, 1000)                │               0 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘

 Total params: 4,253,864 (16.23 MB)

 Trainable params: 4,231,976 (16.14 MB)

 Non-trainable params: 21,888 (85.50 KB)

Example of Pre-trained Model

Download image 1
Download image 2

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

# img = cv2.imread('/content/drive/MyDrive/DL/DL_data/ILSVRC2017_test_00000005.JPEG')
img = cv2.imread('/content/drive/MyDrive/DL/DL_data/ILSVRC2017_test_00005381.JPEG')

print(img.shape)

plt.figure(figsize = (6, 6))
plt.imshow(img)
plt.axis('off')
plt.show()

(500, 333, 3)

Since the input size required for the pre-trained model is $224 \times 224 \times 3$, resizing should be performed as part of the preprocessing step. However, depending on the application, you might prefer cropping instead of resizing to preserve the original aspect ratio and avoid distortion.

resized_img = cv2.resize(img, (224, 224)).reshape(1, 224, 224, 3)

plt.figure(figsize = (6, 6))
plt.imshow(resized_img[0])
plt.axis('off')
plt.show()

input_img = model_type.preprocess_input(resized_img)

pred = model.predict(input_img, verbose = 0)
label = model_type.decode_predictions(pred)[0]

print('\n')
print('%s (%.2f%%)\n' % (label[0][1], label[0][2]*100))
print('%s (%.2f%%)\n' % (label[1][1], label[1][2]*100))
print('%s (%.2f%%)\n' % (label[2][1], label[2][2]*100))
print('%s (%.2f%%)\n' % (label[3][1], label[3][2]*100))
print('%s (%.2f%%)\n' % (label[4][1], label[4][2]*100))


soccer_ball (92.07%)

knee_pad (2.68%)

football_helmet (2.44%)

ballplayer (1.17%)

tennis_ball (0.49%)

The pre-trained model successfully predicts the given input image as 'soccer_ball.' This is an example of zero-shot learning, where the model correctly classifies an input instance without having been explicitly trained on specific examples of the same dataset during its training phase.

2. Transfer Learning¶

Transfer learning is a machine learning technique where a pre-trained model is reused as the starting point for a different but related task. Instead of training a model from scratch, we "transfer" the learned weights from a large dataset (e.g., ImageNet) and fine-tune them for a specific, typically smaller dataset.

To better understand transfer learning, imagine how students learn new topics in school. Students don’t gain knowledge entirely from scratch — they learn from teachers, who have already accumulated and organized knowledge over years of study and experience.

For example:

When a student learns physics, the teacher explains concepts like Newton’s laws by building on foundational knowledge from mathematics, such as algebra and calculus.
Instead of deriving everything from first principles, the student benefits from the structured, pre-digested knowledge provided by the teacher, making their own learning process faster and more efficient.

Why Use Transfer Learning?

Improved Performance: Pre-trained models already capture general features (like edges, shapes), which improves performance on related tasks.
Faster Training: Since most of the model's parameters are already optimized, training time is reduced.
Less Data Needed: Transfer learning works well with small datasets, as the pre-trained model already has useful feature representations.

Common Approaches to Transfer Learning

Feature Extraction (Frozen Base):

The pre-trained model's weights are frozen (kept constant), and only the final layers are retrained for the new task.
Useful when the new dataset is small or similar to the original dataset.

Fine-Tuning (Updating Weights):

The entire model (or a portion of it) is re-trained, adjusting weights to better fit the new dataset.
The weights from the pre-trained model are used as the initial values instead of random initialization, providing a "head start" for training by leveraging previously learned features.
Useful when the new dataset is significantly different from the pre-training dataset.

from IPython.display import YouTubeVideo
YouTubeVideo('7JcSo0jCLdE?si=7IuLwj5L5lxk6lxI&amp;start=2003', width = "560", height = "315")

2.1. Pre-trained Model (VGG16)¶

Training a model on ImageNet from scratch takes days or weeks.
Many models trained on ImageNet and their weights are publicly available!
Transfer learning
- Use pre-trained weights, remove last layers to compute representations of images
- The network is used as a generic feature extractor
- Train a classification model from these features on a new classification task
- Pre- trained models can extract more general image features that can help identify edges, textures, shapes, and object composition
- Better than handcrafted feature extraction on natural images

Import Library

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

Load Data

Download data files

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

# Change file paths if necessary

train_imgs = np.load('/content/drive/MyDrive/DL/DL_data/tranfer_learning_train_images.npy')
train_labels = np.load('/content/drive/MyDrive/DL/DL_data/tranfer_learning_train_labels.npy')

test_imgs = np.load('/content/drive/MyDrive/DL/DL_data/tranfer_learning_test_images.npy')
test_labels = np.load('/content/drive/MyDrive/DL/DL_data/tranfer_learning_test_labels.npy')

print(train_imgs.shape)
print(train_labels[0]) # one-hot-encoded 5 classes

# remove one-hot-encoding
train_labels = np.argmax(train_labels, axis = 1)
test_labels = np.argmax(test_labels, axis = 1)

(65, 224, 224, 3)
[1. 0. 0. 0. 0.]

n_train = train_imgs.shape[0]
n_test = test_imgs.shape[0]

# very small dataset
print(n_train)
print(n_test)

65
9

Target dataset

The training dataset consists of only 65 images, with an additional 9 images for testing. This sample size is clearly insufficient for effectively training deep learning models, which typically require a substantial amount of data to achieve robust performance and generalization.

Before applying transfer learning, let’s observe how the pre-trained model performs predictions via zero-shot learning.

Dict = ['Hat','Cube','Card','Torch','Screw']

plt.figure(figsize = (8, 6))
plt.subplot(2,3,1)
plt.imshow(train_imgs[1])
plt.title("Label: {}".format(Dict[train_labels[1]]))
plt.axis('off')
plt.subplot(2,3,2)
plt.imshow(train_imgs[2])
plt.title("Label: {}".format(Dict[train_labels[2]]))
plt.axis('off')
plt.subplot(2,3,3)
plt.imshow(train_imgs[3])
plt.title("Label: {}".format(Dict[train_labels[3]]))
plt.axis('off')
plt.subplot(2,3,4)
plt.imshow(train_imgs[18])
plt.title("Label: {}".format(Dict[train_labels[18]]))
plt.axis('off')
plt.subplot(2,3,5)
plt.imshow(train_imgs[25])
plt.title("Label: {}".format(Dict[train_labels[25]]))
plt.axis('off')
plt.show()

Load VGG16 Model

'base_model.trainable = False' ensures that all parameters (weights and biases) in the pre-trained base model are 'frozen' and will no longer be updated during training. This means that the layers of the base model will retain their pre-trained values and will not be modified by backpropagation.

model_type = tf.keras.applications.vgg16
base_model = model_type.VGG16()
base_model.trainable = False

print('\n')
base_model.summary()

Model: "vgg16"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ input_layer_3 (InputLayer)           │ (None, 224, 224, 3)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block1_conv1 (Conv2D)                │ (None, 224, 224, 64)        │           1,792 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block1_conv2 (Conv2D)                │ (None, 224, 224, 64)        │          36,928 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block1_pool (MaxPooling2D)           │ (None, 112, 112, 64)        │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block2_conv1 (Conv2D)                │ (None, 112, 112, 128)       │          73,856 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block2_conv2 (Conv2D)                │ (None, 112, 112, 128)       │         147,584 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block2_pool (MaxPooling2D)           │ (None, 56, 56, 128)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block3_conv1 (Conv2D)                │ (None, 56, 56, 256)         │         295,168 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block3_conv2 (Conv2D)                │ (None, 56, 56, 256)         │         590,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block3_conv3 (Conv2D)                │ (None, 56, 56, 256)         │         590,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block3_pool (MaxPooling2D)           │ (None, 28, 28, 256)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block4_conv1 (Conv2D)                │ (None, 28, 28, 512)         │       1,180,160 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block4_conv2 (Conv2D)                │ (None, 28, 28, 512)         │       2,359,808 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block4_conv3 (Conv2D)                │ (None, 28, 28, 512)         │       2,359,808 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block4_pool (MaxPooling2D)           │ (None, 14, 14, 512)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block5_conv1 (Conv2D)                │ (None, 14, 14, 512)         │       2,359,808 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block5_conv2 (Conv2D)                │ (None, 14, 14, 512)         │       2,359,808 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block5_conv3 (Conv2D)                │ (None, 14, 14, 512)         │       2,359,808 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block5_pool (MaxPooling2D)           │ (None, 7, 7, 512)           │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ flatten (Flatten)                    │ (None, 25088)               │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ fc1 (Dense)                          │ (None, 4096)                │     102,764,544 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ fc2 (Dense)                          │ (None, 4096)                │      16,781,312 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ predictions (Dense)                  │ (None, 1000)                │       4,097,000 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘

 Total params: 138,357,544 (527.79 MB)

 Trainable params: 0 (0.00 B)

 Non-trainable params: 138,357,544 (527.79 MB)

Testing for Target Data

idx = 1
pred = base_model.predict(test_imgs[idx].reshape(-1, 224, 224, 3), verbose = 0)
label = model_type.decode_predictions(pred)[0]

print('\n')
print('%s (%.2f%%)' % (label[0][1], label[0][2]*100))
print('%s (%.2f%%)' % (label[1][1], label[1][2]*100))
print('%s (%.2f%%)' % (label[2][1], label[2][2]*100))
print('%s (%.2f%%)' % (label[3][1], label[3][2]*100))
print('%s (%.2f%%)' % (label[4][1], label[4][2]*100))
print('\n')

plt.figure(figsize = (4, 4))
plt.imshow(test_imgs[idx])
plt.title("Label : {}".format(Dict[test_labels[idx]]))
plt.axis('off')
plt.show()


mosquito_net (6.94%)
toilet_tissue (3.43%)
Band_Aid (1.53%)
envelope (1.46%)
shower_curtain (1.39%)

All five classification predictions differ entirely from the ground truth label, 'Hat'. It is important to note that not only are the predicted classes entirely different from the ground truth label, but their associated prediction probabilities are also low, indicating low confidence in the model's predictions.

2.2. Transfer Learning¶

We assume that these model parameters contain the knowledge learned from the source data set and that this knowledge will be equally applicable to the target data set.
We will train the output layer from scratch, while the parameters of all remaining layers are fine tuned based on the parameters of the source model.
Or initialize all weights from pre-trained model, then train them with target data

Pre-trained Weights, Biases

vgg16_weights = base_model.get_weights()

Build a Transfer Learning Model

# replace new and trainable classifier layer
fc2_layer = base_model.layers[-2].output
output = tf.keras.layers.Dense(units = 5, activation = 'softmax')(fc2_layer)

# define new model
TL_model = tf.keras.Model(inputs = base_model.inputs, outputs = output)

TL_model.summary()

Model: "functional_1"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ input_layer_3 (InputLayer)           │ (None, 224, 224, 3)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block1_conv1 (Conv2D)                │ (None, 224, 224, 64)        │           1,792 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block1_conv2 (Conv2D)                │ (None, 224, 224, 64)        │          36,928 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block1_pool (MaxPooling2D)           │ (None, 112, 112, 64)        │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block2_conv1 (Conv2D)                │ (None, 112, 112, 128)       │          73,856 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block2_conv2 (Conv2D)                │ (None, 112, 112, 128)       │         147,584 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block2_pool (MaxPooling2D)           │ (None, 56, 56, 128)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block3_conv1 (Conv2D)                │ (None, 56, 56, 256)         │         295,168 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block3_conv2 (Conv2D)                │ (None, 56, 56, 256)         │         590,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block3_conv3 (Conv2D)                │ (None, 56, 56, 256)         │         590,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block3_pool (MaxPooling2D)           │ (None, 28, 28, 256)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block4_conv1 (Conv2D)                │ (None, 28, 28, 512)         │       1,180,160 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block4_conv2 (Conv2D)                │ (None, 28, 28, 512)         │       2,359,808 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block4_conv3 (Conv2D)                │ (None, 28, 28, 512)         │       2,359,808 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block4_pool (MaxPooling2D)           │ (None, 14, 14, 512)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block5_conv1 (Conv2D)                │ (None, 14, 14, 512)         │       2,359,808 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block5_conv2 (Conv2D)                │ (None, 14, 14, 512)         │       2,359,808 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block5_conv3 (Conv2D)                │ (None, 14, 14, 512)         │       2,359,808 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ block5_pool (MaxPooling2D)           │ (None, 7, 7, 512)           │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ flatten (Flatten)                    │ (None, 25088)               │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ fc1 (Dense)                          │ (None, 4096)                │     102,764,544 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ fc2 (Dense)                          │ (None, 4096)                │      16,781,312 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_1 (Dense)                      │ (None, 5)                   │          20,485 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘

 Total params: 134,281,029 (512.24 MB)

 Trainable params: 20,485 (80.02 KB)

 Non-trainable params: 134,260,544 (512.16 MB)

Out of the total model parameters, only 20,485 are designated as trainable, whereas 134,206,544 parameters remain non-trainable.

Define Loss and Optimizer

TL_model.compile(optimizer = 'adam',
                 loss = 'sparse_categorical_crossentropy',
                 metrics = ['accuracy'])

Optimize

TL_model.fit(train_imgs, train_labels, epochs = 10)

Epoch 1/10
3/3 ━━━━━━━━━━━━━━━━━━━━ 41s 10s/step - accuracy: 0.2133 - loss: 2.2441
Epoch 2/10
3/3 ━━━━━━━━━━━━━━━━━━━━ 83s 11s/step - accuracy: 0.1629 - loss: 1.7597
Epoch 3/10
3/3 ━━━━━━━━━━━━━━━━━━━━ 81s 11s/step - accuracy: 0.1706 - loss: 2.3421
Epoch 4/10
3/3 ━━━━━━━━━━━━━━━━━━━━ 81s 10s/step - accuracy: 0.1975 - loss: 2.4318
Epoch 5/10
3/3 ━━━━━━━━━━━━━━━━━━━━ 43s 10s/step - accuracy: 0.3486 - loss: 1.6871
Epoch 6/10
3/3 ━━━━━━━━━━━━━━━━━━━━ 81s 11s/step - accuracy: 0.2327 - loss: 1.4730
Epoch 7/10
3/3 ━━━━━━━━━━━━━━━━━━━━ 81s 10s/step - accuracy: 0.3645 - loss: 1.7028
Epoch 8/10
3/3 ━━━━━━━━━━━━━━━━━━━━ 83s 10s/step - accuracy: 0.4454 - loss: 1.3292
Epoch 9/10
3/3 ━━━━━━━━━━━━━━━━━━━━ 42s 11s/step - accuracy: 0.5698 - loss: 1.1335
Epoch 10/10
3/3 ━━━━━━━━━━━━━━━━━━━━ 80s 10s/step - accuracy: 0.6084 - loss: 1.1418

<keras.src.callbacks.history.History at 0x789fd75c4f70>

Test and Evaluate

test_loss, test_acc = TL_model.evaluate(test_imgs, test_labels)

1/1 ━━━━━━━━━━━━━━━━━━━━ 5s 5s/step - accuracy: 0.3333 - loss: 1.3855

idx = np.random.randint(n_test)
test_x = test_imgs[idx].reshape(-1, 224, 224, 3)
pred = np.argmax(TL_model.predict(test_x, verbose = 0))

plt.figure(figsize = (4, 4))
plt.imshow(test_x.reshape(224, 224, 3))
plt.title("Label : {}".format(Dict[test_labels[idx]]))
plt.axis('off')
plt.show()

print('\nPrediction: {}'.format(Dict[pred]))

Prediction: Cube

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')