Pre-trained Models and Transfer Learning
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST
Table of Contents
1. Pre-trained ModelsĀ¶
A pre-trained model is a neural network that has been previously trained on a large dataset for a specific task, such as image recognition or language processing. Instead of training a model from scratch, pre-trained models allow practitioners to leverage pre-existing knowledge, making the training process faster and more efficient.
Advantages of Pre-Trained Models
Efficiency: Reduces the need for extensive computational resources.
Adaptability: Can be adapted to different but related tasks by fine-tuning.
Access to Large-Scale Learning: Utilizes knowledge from large-scale datasets that may be difficult for individual practitioners to curate.
Reduced Training Time: Since the model has already learned basic patterns from large datasets, only minimal adjustments (fine-tuning) may be needed for specific tasks.
from IPython.display import YouTubeVideo
YouTubeVideo('7JcSo0jCLdE?si=d530KtZ2bu7pNTxe&start=23', width = "560", height = "315")
1.1. ImageNetĀ¶
ImageNet is a large-scale image dataset that has played a pivotal role in the development of modern computer vision and deep learning. Created by Fei-Fei Li, a professor of computer science at Stanford University and a leader in artificial intelligence, ImageNet serves as a comprehensive benchmark for image classification, object detection, and other vision-related tasks.
The Creation of ImageNet
Goal:
- Fei-Fei Liās vision for ImageNet was to create a dataset that could bridge the gap between machine learning algorithms and the human ability to understand and classify images at scale. She believed that the lack of large, labeled datasets was a major obstacle preventing AI systems from achieving human-level perception.
Development:
- The dataset was built by collecting images from the web and crowd-sourcing labels through Amazon Mechanical Turk, where workers annotated the images according to specific categories.
ImageNet Challenge (ILSVRC)
In 2010, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was launched to evaluate the performance of algorithms on large-scale image classification tasks. The ILSVRC became a benchmark for testing the accuracy and scalability of computer vision models.
Key Events in ImageNetās History
2012 - AlexNet Revolution:
- AlexNet, developed by Alex Krizhevsky, Geoffrey Hinton, and Ilya Sutskever, used a deep convolutional neural network (CNN) to achieve a top-5 error rate of 15.3%, significantly outperforming traditional methods. This victory demonstrated the power of deep learning and popularized the use of GPUs for training deep networks.
Subsequent Breakthroughs:
The years following AlexNet saw the emergence of more sophisticated models, including VGGNet (2014), GoogLeNet (2014), and ResNet (2015), all of which were trained on ImageNet and set new performance records.
In 2015, ResNet (by Kaiming He et al.) achieved a top-5 error rate of 3.6%, surpassing human-level accuracy (human performance = 5.1%)
1.2. Pre-trained CNN ModelsĀ¶
LeNet
- CNN = Convolutional Neural Networks = ConvNet
- LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition.
- All are still the basic components of modern ConvNets!
AlexNet
Simplified version of Krizhevsky, Alex, Sutskever, and Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012
LeNet-style backbone, plus:
- ReLU [Nair & Hinton 2010]
- RevoLUtion of deep learning
- Accelerate training
- Dropout [Hinton et al 2012]
- In-network ensembling
- Reduce overfitting
- Data augmentation
- Label-preserving transformation
- Reduce overfitting
- ReLU [Nair & Hinton 2010]
VGG-16/19
Simonyan, Karen, and Zisserman. "Very deep convolutional networks for large-scale image recognition." (2014)
Simply āVery Deepā!
- Modularized design
- 3x3 Conv as the module
- Stack the same module
- Same computation for each module
- Stage-wise training
- VGG-11 ā VGG-13 ā VGG-16
- We need a better initializationā¦
- Modularized design
GoogleNet/Inception
- Multiple branches
- e.g., 1x1, 3x3, 5x5, pool
- Shortcuts
- stand-alone 1x1, merged by concat.
- Bottleneck
- Reduce dim by 1x1 before expensive 3x3/5x5 conv
ResNet
- He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016.
- Skip Connection and Residual Net
A direct connection between 2 non-consecutive layers
No gradient vanishing
Parameters are optimized to learn a residual, that is the diļ¬erence between the value before the block and the one needed after.
A skip connection is a connection that bypasses at least one layer.
Here, it is often used to transfer local information by concatenating or summing feature maps from the downsampling path with feature maps from the upsampling path.
Merging features from various resolution levels helps combining context information with spatial information.
def residual_net(x):
conv1 = tf.keras.layers.Conv2D(filters = 32,
kernel_size = (3, 3),
padding = "SAME",
activation = 'relu')(x)
conv2 = tf.keras.layers.Conv2D(filters = 32,
kernel_size = (3, 3),
padding = "SAME",
activation = 'relu')(conv1)
maxp2 = tf.keras.layers.MaxPool2D(pool_size = (2, 2),
strides = 2)(conv2 + x)
flat = tf.keras.layers.Flatten()(maxp2)
hidden = tf.keras.layers.Dense(units = n_hidden,
activation='relu')(flat)
output = tf.keras.layers.Dense(units = n_output)(hidden)
return output
DenseNets
DenseNets (Densely Connected Convolutional Networks) are a type of convolutional neural network (CNN) architecture introduced by Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger in 2017. DenseNets were proposed to address issues related to vanishing gradients, feature reuse, and efficiency in deep learning models.
In DenseNets, each layer is connected to every other layer in a feed-forward manner. Instead of summing feature maps (as in ResNets), DenseNets concatenate the feature maps from previous layers and pass them to subsequent layers.
Traditional CNNs: Each layer passes information only to the next layer.
DenseNets: Each layer receives the concatenated outputs of all preceding layers as input, allowing for maximum feature reuse.
1.3. Load Pre-trained ModelsĀ¶
List of Available Models
- VGG16
- VGG19
- ResNet
- GoogLeNet/Inception
- DenseNet
- MobileNet
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import cv2
%matplotlib inline
Model Selection
# model_type = tf.keras.applications.densenet
# model_type = tf.keras.applications.inception_resnet_v2
# model_type = tf.keras.applications.inception_v3
model_type = tf.keras.applications.mobilenet
# model_type = tf.keras.applications.mobilenet_v2
# model_type = tf.keras.applications.nasnet
# model_type = tf.keras.applications.resnet50
# model_type = tf.keras.applications.vgg16
# model_type = tf.keras.applications.vgg19
Model Summary
model = model_type.MobileNet() # Change Model
model.summary()
from google.colab import drive
drive.mount('/content/drive')
# img = cv2.imread('/content/drive/MyDrive/DL/DL_data/ILSVRC2017_test_00000005.JPEG')
img = cv2.imread('/content/drive/MyDrive/DL/DL_data/ILSVRC2017_test_00005381.JPEG')
print(img.shape)
plt.figure(figsize = (6, 6))
plt.imshow(img)
plt.axis('off')
plt.show()
Since the input size required for the pre-trained model is $224 \times 224 \times 3$, resizing should be performed as part of the preprocessing step. However, depending on the application, you might prefer cropping instead of resizing to preserve the original aspect ratio and avoid distortion.
resized_img = cv2.resize(img, (224, 224)).reshape(1, 224, 224, 3)
plt.figure(figsize = (6, 6))
plt.imshow(resized_img[0])
plt.axis('off')
plt.show()
input_img = model_type.preprocess_input(resized_img)
pred = model.predict(input_img, verbose = 0)
label = model_type.decode_predictions(pred)[0]
print('\n')
print('%s (%.2f%%)\n' % (label[0][1], label[0][2]*100))
print('%s (%.2f%%)\n' % (label[1][1], label[1][2]*100))
print('%s (%.2f%%)\n' % (label[2][1], label[2][2]*100))
print('%s (%.2f%%)\n' % (label[3][1], label[3][2]*100))
print('%s (%.2f%%)\n' % (label[4][1], label[4][2]*100))
The pre-trained model successfully predicts the given input image as 'soccer_ball.' This is an example of zero-shot learning, where the model correctly classifies an input instance without having been explicitly trained on specific examples of the same dataset during its training phase.
2. Transfer LearningĀ¶
Transfer learning is a machine learning technique where a pre-trained model is reused as the starting point for a different but related task. Instead of training a model from scratch, we "transfer" the learned weights from a large dataset (e.g., ImageNet) and fine-tune them for a specific, typically smaller dataset.
To better understand transfer learning, imagine how students learn new topics in school. Students donāt gain knowledge entirely from scratch ā they learn from teachers, who have already accumulated and organized knowledge over years of study and experience.
For example:
When a student learns physics, the teacher explains concepts like Newtonās laws by building on foundational knowledge from mathematics, such as algebra and calculus.
Instead of deriving everything from first principles, the student benefits from the structured, pre-digested knowledge provided by the teacher, making their own learning process faster and more efficient.
Why Use Transfer Learning?
Improved Performance: Pre-trained models already capture general features (like edges, shapes), which improves performance on related tasks.
Faster Training: Since most of the model's parameters are already optimized, training time is reduced.
Less Data Needed: Transfer learning works well with small datasets, as the pre-trained model already has useful feature representations.
Common Approaches to Transfer Learning
Feature Extraction (Frozen Base):
The pre-trained model's weights are frozen (kept constant), and only the final layers are retrained for the new task.
Useful when the new dataset is small or similar to the original dataset.
Fine-Tuning (Updating Weights):
The entire model (or a portion of it) is re-trained, adjusting weights to better fit the new dataset.
The weights from the pre-trained model are used as the initial values instead of random initialization, providing a "head start" for training by leveraging previously learned features.
Useful when the new dataset is significantly different from the pre-training dataset.
from IPython.display import YouTubeVideo
YouTubeVideo('7JcSo0jCLdE?si=7IuLwj5L5lxk6lxI&start=2003', width = "560", height = "315")
2.1. Pre-trained Model (VGG16)Ā¶
Training a model on ImageNet from scratch takes days or weeks.
Many models trained on ImageNet and their weights are publicly available!
Transfer learning
- Use pre-trained weights, remove last layers to compute representations of images
- The network is used as a generic feature extractor
- Train a classification model from these features on a new classification task
- Pre- trained models can extract more general image features that can help identify edges, textures, shapes, and object composition
- Better than handcrafted feature extraction on natural images
Import Library
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
Load Data
Download data files
- tranfer_learning_train_images
- tranfer_learning_train_labels
- tranfer_learning_test_images
- tranfer_learning_test_labels
from google.colab import drive
drive.mount('/content/drive')
# Change file paths if necessary
train_imgs = np.load('/content/drive/MyDrive/DL/DL_data/tranfer_learning_train_images.npy')
train_labels = np.load('/content/drive/MyDrive/DL/DL_data/tranfer_learning_train_labels.npy')
test_imgs = np.load('/content/drive/MyDrive/DL/DL_data/tranfer_learning_test_images.npy')
test_labels = np.load('/content/drive/MyDrive/DL/DL_data/tranfer_learning_test_labels.npy')
print(train_imgs.shape)
print(train_labels[0]) # one-hot-encoded 5 classes
# remove one-hot-encoding
train_labels = np.argmax(train_labels, axis = 1)
test_labels = np.argmax(test_labels, axis = 1)
n_train = train_imgs.shape[0]
n_test = test_imgs.shape[0]
# very small dataset
print(n_train)
print(n_test)
Target dataset
- The training dataset consists of only 65 images, with an additional 9 images for testing. This sample size is clearly insufficient for effectively training deep learning models, which typically require a substantial amount of data to achieve robust performance and generalization.
Before applying transfer learning, letās observe how the pre-trained model performs predictions via zero-shot learning.
Dict = ['Hat','Cube','Card','Torch','Screw']
plt.figure(figsize = (8, 6))
plt.subplot(2,3,1)
plt.imshow(train_imgs[1])
plt.title("Label: {}".format(Dict[train_labels[1]]))
plt.axis('off')
plt.subplot(2,3,2)
plt.imshow(train_imgs[2])
plt.title("Label: {}".format(Dict[train_labels[2]]))
plt.axis('off')
plt.subplot(2,3,3)
plt.imshow(train_imgs[3])
plt.title("Label: {}".format(Dict[train_labels[3]]))
plt.axis('off')
plt.subplot(2,3,4)
plt.imshow(train_imgs[18])
plt.title("Label: {}".format(Dict[train_labels[18]]))
plt.axis('off')
plt.subplot(2,3,5)
plt.imshow(train_imgs[25])
plt.title("Label: {}".format(Dict[train_labels[25]]))
plt.axis('off')
plt.show()
Load VGG16 Model
'base_model.trainable = False' ensures that all parameters (weights and biases) in the pre-trained base model are 'frozen' and will no longer be updated during training. This means that the layers of the base model will retain their pre-trained values and will not be modified by backpropagation.
model_type = tf.keras.applications.vgg16
base_model = model_type.VGG16()
base_model.trainable = False
print('\n')
base_model.summary()
Testing for Target Data
idx = 1
pred = base_model.predict(test_imgs[idx].reshape(-1, 224, 224, 3), verbose = 0)
label = model_type.decode_predictions(pred)[0]
print('\n')
print('%s (%.2f%%)' % (label[0][1], label[0][2]*100))
print('%s (%.2f%%)' % (label[1][1], label[1][2]*100))
print('%s (%.2f%%)' % (label[2][1], label[2][2]*100))
print('%s (%.2f%%)' % (label[3][1], label[3][2]*100))
print('%s (%.2f%%)' % (label[4][1], label[4][2]*100))
print('\n')
plt.figure(figsize = (4, 4))
plt.imshow(test_imgs[idx])
plt.title("Label : {}".format(Dict[test_labels[idx]]))
plt.axis('off')
plt.show()
All five classification predictions differ entirely from the ground truth label, 'Hat'. It is important to note that not only are the predicted classes entirely different from the ground truth label, but their associated prediction probabilities are also low, indicating low confidence in the model's predictions.
2.2. Transfer LearningĀ¶
- We assume that these model parameters contain the knowledge learned from the source data set and that this knowledge will be equally applicable to the target data set.
- We will train the output layer from scratch, while the parameters of all remaining layers are fine tuned based on the parameters of the source model.
- Or initialize all weights from pre-trained model, then train them with target data
Pre-trained Weights, Biases
vgg16_weights = base_model.get_weights()
Build a Transfer Learning Model
# replace new and trainable classifier layer
fc2_layer = base_model.layers[-2].output
output = tf.keras.layers.Dense(units = 5, activation = 'softmax')(fc2_layer)
# define new model
TL_model = tf.keras.Model(inputs = base_model.inputs, outputs = output)
TL_model.summary()
Out of the total model parameters, only 20,485 are designated as trainable, whereas 134,206,544 parameters remain non-trainable.
Define Loss and Optimizer
TL_model.compile(optimizer = 'adam',
loss = 'sparse_categorical_crossentropy',
metrics = ['accuracy'])
Optimize
TL_model.fit(train_imgs, train_labels, epochs = 10)
Test and Evaluate
test_loss, test_acc = TL_model.evaluate(test_imgs, test_labels)
idx = np.random.randint(n_test)
test_x = test_imgs[idx].reshape(-1, 224, 224, 3)
pred = np.argmax(TL_model.predict(test_x, verbose = 0))
plt.figure(figsize = (4, 4))
plt.imshow(test_x.reshape(224, 224, 3))
plt.title("Label : {}".format(Dict[test_labels[idx]]))
plt.axis('off')
plt.show()
print('\nPrediction: {}'.format(Dict[pred]))
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')