Self-supervised Learning


By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents

1. Supervised Learning and Transfer Learning

Supervised pretraining on large labeled datasets has led to successful transfer learning

  • ImageNet

  • Pretrain for fine-grained image classification of 1000 classes



  • Use feature representations for downstream tasks, e.g., object detection, image segmentation, and action recognition



But supervised pretraining comes at a cost …

  • Time-consuming and expensive to label datasets for new tasks
  • Domain expertise needed for specialized tasks
    • Radiologists to label medical images
    • Native speakers or language specialists for labeling text in different languages
  • To relieve the burden of labelling,
    • Semi-supervised learning
    • Weakly-supervised learning
    • Unsupervised learning


Self-supervised learning

  • Self-supervised learning (SSL): supervise using labels generated from the data without any manual or weak label sources
    • Sub-class of unsupervised learning
  • Idea: Hide or modify part of the input. Ask model to recover input or classify what changed
    • Self-supervised task referred to as the pretext task can be formulated using only unlabeled data
    • The features obtained from pretext tasks are transferred to downstream tasks like classification, object detection, and segmentation



Pretext Tasks

  • Solving the pretext tasks allow the model to learn good features.

  • We can automatically generate labels for the pretext tasks.



2. Pretext Tasks

2.1. Pretext Task - Context Prediction

  • After creating 9 patches from one input image, the classifier is trained on the location information between the middle and other patches
  • A pair of middle patch and other patch is given as the input for the network
  • Method to avoid trivial solutions
    • uneven spacing between patches



Carl Doersch, Abhinav Gupta, Alexei A. Efros, 2015, "Unsupervised Visual Representation Learning by Context Prediction," Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1422-1430.

2.2. Pretext Task - Jigsaw Puzzle

  • Generate 9 patches from the input image
  • After shuffling the patches, learn a classifier that predicts permutations to return to the original position





Noroozi, M., and Favaro, P., 2016, "Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles," Computer Vision – ECCV 2016, 69–84.

2.3. Pretext Task - Image Colorization

  • Given a grayscale photograph as input, image colorization attacks the problem of hallucinating a plausible color version of the photograph
  • Transfer the trained encoder to the downstream task



Zhang, R., Isola, P., and Efros, A. A., 2016, "Colorful Image Colorization," Computer Vision – ECCV 2016, 649–666.

  • Training data generation for self-supervised learning



  • Network architecture



2.4. Pretext Task - Image Super-resolution

  • What if we prepared training pairs of (small, upscaled) images by downsampling millions of images we have freely available?
  • Training data generation for self-supervised learning



  • Network architecture



2.5. Pretext Task - Image Inpainting

  • What if we prepared training pairs of (corrupted, fixed) images by randomly removing part of images?
  • Training data generation for self-supervised learning



  • Network architecture



3. Self-supervised Learning

Benefits of Self-supervised Learning

  • Like supervised pretraining, can learn general-purpose feature representations for downstream tasks
  • Reduce expense of hand-labeling large datasets
  • Can leverage nearly unlimited unlabeled data available on the web


Pipeline of Self-supervised Learning

  1. Within pretext tasks, deep neural network learns visual features of input unlabeled data
  1. The learned parameters of the network remain fixed and the trained network serves as a pre-trained model for downstream tasks
  1. The pre-trained model is transferred to downstream tasks and is fine-tuned
  1. The performance of downstream tasks is used to evaluate the methodology used in pretext tasks to learn features from unlabeled data



Jing, L., & Tian, Y., 2021, "Self-supervised visual feature learning with Deep Neural Networks: A survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11), 4037–4058.

Downstream Tasks

  • After transferring the neural network pre-trained by the pretext task, freeze the weights and build additional layers for the downstream tasks

  • Wide variety of downstream tasks

    • Classification
    • Regression
    • Object detection
    • Segmentation



4. Self-supervised Learning with TensorFlow

Pretext Task - Rotation

  • RotNet
  • Hypothesis: a model could recognize the correct rotation of an object only if it has the “visual commonsense” of what the object should look like

    • Self-supervised learning by rotating the entire input images
    • The model learns to predict which rotation is applied (4-way classification)



  • RotNet: Supervised vs Self-supervised
    • The accuracy gap between the RotNet based model and the fully supervised Network-In-Network (NIN) model is very small, only 1.64% points
    • We do not need data labels to train the RotNet based model but achieved similar accuracy with that of the model which used data labels for training

Import Library

In [1]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

Load MNIST Data

In [3]:
(X_train, Y_train), (X_test, Y_test) = tf.keras.datasets.mnist.load_data()
XX_train = X_train[10000:11000]
YY_train = Y_train[10000:11000]
X_train = X_train[:10000]
Y_train = Y_train[:10000]
XX_test = X_test[300:600]
YY_test = Y_test[300:600]
X_test = X_test[:300]
Y_test = Y_test[:300]
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 0s 0us/step
In [4]:
print('shape of x_train:', X_train.shape)
print('shape of y_train:', Y_train.shape)
print('shape of xx_train:', XX_train.shape)
print('shape of yy_train:', YY_train.shape)
print('shape of x_test:', X_test.shape)
print('shape of y_test:', Y_test.shape)
print('shape of xx_test:', XX_test.shape)
print('shape of yy_test:', YY_test.shape)
shape of x_train: (10000, 28, 28)
shape of y_train: (10000,)
shape of xx_train: (1000, 28, 28)
shape of yy_train: (1000,)
shape of x_test: (300, 28, 28)
shape of y_test: (300,)
shape of xx_test: (300, 28, 28)
shape of yy_test: (300,)

4.1. Build RotNet for Pretext Task



Dataset for Pretext Task (Rotation)

Need to generate rotated images and their labels to train the model for pretext task

  • [1, 0, 0, 0]: 0$^\circ $ rotation
  • [0, 1, 0, 0]: 90$^\circ $ rotation
  • [0, 0, 1, 0]: 180$^\circ $ rotation
  • [0, 0, 0, 1]: 270$^\circ $ rotation
In [5]:
n_samples = X_train.shape[0]
X_rotate = np.zeros(shape = (n_samples*4,
                             X_train.shape[1],
                             X_train.shape[2]))
Y_rotate = np.zeros(shape = (n_samples*4, 4))

for i in range(n_samples):
    img = X_train[i]
    X_rotate[4*i-4] = img
    Y_rotate[4*i-4] = tf.one_hot([0], depth = 4)

    # 90 degrees rotation
    X_rotate[4*i-3] = np.rot90(img, k = 1)
    Y_rotate[4*i-3] = tf.one_hot([1], depth = 4)

    # 180 degrees rotation
    X_rotate[4*i-2] = np.rot90(img, k = 2)
    Y_rotate[4*i-2] = tf.one_hot([2], depth = 4)

    # 270 degrees rotation
    X_rotate[4*i-1] = np.rot90(img, k = 3)
    Y_rotate[4*i-1] = tf.one_hot([3], depth = 4)

Plot Dataset for Pretext Task (Rotation)

In [19]:
plt.figure(figsize = (10, 10))

plt.subplot(141)
plt.imshow(X_rotate[12], cmap = 'gray')
plt.axis('off')

plt.subplot(142)
plt.imshow(X_rotate[13], cmap = 'gray')
plt.axis('off')

plt.subplot(143)
plt.imshow(X_rotate[14], cmap = 'gray')
plt.axis('off')

plt.subplot(144)
plt.imshow(X_rotate[15], cmap = 'gray')
plt.axis('off')
Out[19]:
(-0.5, 27.5, 27.5, -0.5)
In [7]:
X_rotate = X_rotate.reshape(-1,28,28,1)

Build Model for Pretext Task (Rotation)

In [8]:
model_pretext = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters = 64,
                           kernel_size = (3,3),
                           strides = (2,2),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (28, 28, 1)),

    tf.keras.layers.MaxPool2D(pool_size = (2, 2),
                              strides = (2, 2)),

    tf.keras.layers.Conv2D(filters = 32,
                           kernel_size = (3,3),
                           strides = (1,1),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (7, 7, 64)),

    tf.keras.layers.MaxPool2D(pool_size = (2, 2),
                              strides = (2, 2)),

    tf.keras.layers.Conv2D(filters = 16,
                           kernel_size = (3,3),
                           strides = (2,2),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (3, 3, 32)),

    tf.keras.layers.Flatten(),

    tf.keras.layers.Dense(units = 4, activation = 'softmax')

])
model_pretext.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 14, 14, 64)        640       
                                                                 
 max_pooling2d (MaxPooling2  (None, 7, 7, 64)          0         
 D)                                                              
                                                                 
 conv2d_1 (Conv2D)           (None, 7, 7, 32)          18464     
                                                                 
 max_pooling2d_1 (MaxPoolin  (None, 3, 3, 32)          0         
 g2D)                                                            
                                                                 
 conv2d_2 (Conv2D)           (None, 2, 2, 16)          4624      
                                                                 
 flatten (Flatten)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 4)                 260       
                                                                 
=================================================================
Total params: 23988 (93.70 KB)
Trainable params: 23988 (93.70 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
  • Training the model for the pretext task



In [9]:
model_pretext.compile(optimizer = 'adam',
                      loss = 'categorical_crossentropy',
                      metrics = 'accuracy')

model_pretext.fit(X_rotate,
                  Y_rotate,
                  batch_size = 192,
                  epochs = 50,
                  verbose = 0,
                  shuffle = False)
Out[9]:
<keras.src.callbacks.History at 0x7e2741df5150>

4.2. Build Downstream Task (MNIST Image Classification)

  • Freezing trained parameters to transfer them for the downstream task
In [10]:
model_pretext.trainable = False

Reshape Dataset

In [11]:
XX_train = XX_train.reshape(-1,28,28,1)
XX_test = XX_test.reshape(-1,28,28,1)
YY_train = tf.one_hot(YY_train, 10,on_value = 1.0, off_value = 0.0)
YY_test = tf.one_hot(YY_test, 10,on_value = 1.0, off_value = 0.0)

Build Model

  • Model: two convolution layers and one fully connected layer
    • Two convolution layers are transferred from the model for the pretext task
    • Single fully connected layer is trained only



In [12]:
model_downstream = tf.keras.models.Sequential([
    model_pretext.get_layer(index = 0),
    model_pretext.get_layer(index = 1),
    model_pretext.get_layer(index = 2),
    model_pretext.get_layer(index = 3),

    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(units = 10, activation = 'softmax')
])

model_downstream.summary()
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 14, 14, 64)        640       
                                                                 
 max_pooling2d (MaxPooling2  (None, 7, 7, 64)          0         
 D)                                                              
                                                                 
 conv2d_1 (Conv2D)           (None, 7, 7, 32)          18464     
                                                                 
 max_pooling2d_1 (MaxPoolin  (None, 3, 3, 32)          0         
 g2D)                                                            
                                                                 
 flatten_1 (Flatten)         (None, 288)               0         
                                                                 
 dense_1 (Dense)             (None, 10)                2890      
                                                                 
=================================================================
Total params: 21994 (85.91 KB)
Trainable params: 2890 (11.29 KB)
Non-trainable params: 19104 (74.62 KB)
_________________________________________________________________
In [13]:
model_downstream.compile(optimizer = tf.keras.optimizers.SGD(learning_rate = 0.001,momentum = 0.9),
                         loss = 'categorical_crossentropy',
                         metrics = 'accuracy')

model_downstream.fit(XX_train,
                     YY_train,
                     batch_size = 64,
                     validation_split = 0.2,
                     epochs = 50,
                     verbose = 0,
                     callbacks = tf.keras.callbacks.EarlyStopping(monitor = 'accuracy', patience = 7))
Out[13]:
<keras.src.callbacks.History at 0x7e26f4b98a00>

Downstream Task Trained Result (Image Classification Result)

In [21]:
name = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
idx = 9
img = XX_train[idx].reshape(-1,28,28,1)
label = YY_train[idx]
predict = model_downstream.predict(img)
mypred = np.argmax(predict, axis = 1)

plt.figure(figsize = (8, 4))
plt.subplot(1,2,1)
plt.imshow(img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()

print('Prediction : {}'.format(name[mypred[0]]))
1/1 [==============================] - 0s 18ms/step
Prediction : 2

4.3. Build Supervised Model for Comparison

  • Convolution Neural Networks for MNIST image classification
    • Model: Same model architecture with the model for the downstream task
    • The number of total parameter is the same with the model for the downstream task, but is has zero non-trainable parameters
In [15]:
model_sup = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters = 64,
                           kernel_size = (3,3),
                           strides = (2,2),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (28, 28, 1)),

    tf.keras.layers.MaxPool2D(pool_size = (2, 2),
                              strides = (2, 2)),

    tf.keras.layers.Conv2D(filters = 32,
                           kernel_size = (3,3),
                           strides = (1,1),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (7, 7, 64)),

    tf.keras.layers.MaxPool2D(pool_size = (2, 2),
                              strides = (2, 2)),

    tf.keras.layers.Flatten(),

    tf.keras.layers.Dense(units = 10, activation = 'softmax')

])
model_sup.summary()
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d_3 (Conv2D)           (None, 14, 14, 64)        640       
                                                                 
 max_pooling2d_2 (MaxPoolin  (None, 7, 7, 64)          0         
 g2D)                                                            
                                                                 
 conv2d_4 (Conv2D)           (None, 7, 7, 32)          18464     
                                                                 
 max_pooling2d_3 (MaxPoolin  (None, 3, 3, 32)          0         
 g2D)                                                            
                                                                 
 flatten_2 (Flatten)         (None, 288)               0         
                                                                 
 dense_2 (Dense)             (None, 10)                2890      
                                                                 
=================================================================
Total params: 21994 (85.91 KB)
Trainable params: 21994 (85.91 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In [22]:
model_sup.compile(optimizer = tf.keras.optimizers.SGD(learning_rate = 0.001, momentum = 0.9),
                  loss = 'categorical_crossentropy',
                  metrics = 'accuracy')
model_sup.fit(XX_train,
              YY_train,
              batch_size = 32,
              validation_split = 0.2,
              epochs = 50,
              verbose = 0)
Out[22]:
<keras.src.callbacks.History at 0x7e26e27d8bb0>

Compare Self-supervised Learning and Supervised Learning

  • Pretext Task
    • Input data: 10,000 MNIST images without labels
  • Downstream Task and Supervised Learning (for performance comparison)
    • Training data: 1,000 MNIST images with labels
    • Test data: 300 MNIST images with labels
  • Key concepts
    • For transfer learning, we used to train networks like VGG 16 with large image dataset with labels such as ImageNet
    • With self-supervised learning, we train such networks with unlabeled image datasets which have larger number of data than labeled image datasets have and perform transfer learning
    • Comparing downstream task performance with that of supervised learning is equal to comparing the performance of (self-supervised) transfer learning and supervised learning performance
In [23]:
test_self = model_downstream.evaluate(XX_test, YY_test, batch_size = 64, verbose = 2)

print("")
print('Self-supervised Learning Accuracy on Test Data:  {:.2f}%'.format(test_self[1]*100))
5/5 - 0s - loss: 11.6132 - accuracy: 0.8100 - 42ms/epoch - 8ms/step

Self-supervised Learning Accuracy on Test Data:  81.00%
In [24]:
test_sup = model_sup.evaluate(XX_test, YY_test, batch_size = 64, verbose = 2)

print("")
print('Supervised Learning Accuracy on Test Data:  {:.2f}%'.format(test_sup[1]*100))
5/5 - 0s - loss: 1.6948 - accuracy: 0.7867 - 33ms/epoch - 7ms/step

Supervised Learning Accuracy on Test Data:  78.67%
In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')