Self-supervised Learning
Table of Contents
Many images from https://amitness.com/2020/02/illustrated-self-supervised-learning/
1. Supervised Learning and Transfer Learning¶
Supervised learning has been a cornerstone of modern computer vision and natural language processing.
Limitations of Supervised Learning
Despite its effectiveness, supervised learning comes with significant practical limitations:
- High labeling cost: Annotating large datasets is time-consuming and expensive.
- Need for domain expertise:
- Radiologists are required to label medical images.
- Language experts are needed to annotate texts in low-resource languages.
- Limited scalability: For every new task or domain, new annotations are often required.
Toward Label-Efficient Learning
To mitigate the labeling burden, researchers have explored alternatives such as:
- Semi-supervised learning: Leveraging a small amount of labeled data with a large pool of unlabeled data.
- Weakly-supervised learning: Using noisy, imprecise, or indirect supervision.
- Unsupervised learning: Learning without any labeled examples.
2. Self-Supervised Learning (SSL)¶
Self-supervised learning is a powerful subclass of unsupervised learning that generates supervision signals from the data itself, without relying on human labels or external annotation sources.
Key ideas:
- Create pretext tasks by modifying or hiding part of the input.
- Train the model to predict or reconstruct the missing or transformed part.
- Example objectives include:
- Predicting masked pixels or tokens
- Solving jigsaw puzzles
- Rotating images and predicting rotation angle
- Contrastive learning (pulling together augmented views of the same instance)
The representations learned from these pretext tasks can be transferred to downstream tasks, such as classification, detection, or segmentation, often matching or surpassing supervised baselines in low-label regimes.
2.1. Pretext Tasks in Self-Supervised Learning¶
In self-supervised learning, a pretext task is an auxiliary learning objective that enables a model to learn useful representations from unlabeled data. These tasks are designed so that the labels can be automatically derived from the input data itself, eliminating the need for human annotation.
By solving pretext tasks, the model is encouraged to learn generalizable and transferable features that can be applied to downstream tasks such as image classification, object detection, and segmentation.
The following are representative examples of pretext tasks proposed in prior work. For each example, pay close attention to how the labels are automatically constructed from the raw data.
(1) Context Prediction
In the context prediction task, an image is divided into nine patches. The model receives a pair of patches: one is always taken from the center of the image, and the other is randomly selected from a neighboring location. The model is trained to predict the relative spatial position of the second patch with respect to the center patch.
To avoid trivial solutions based on low-level cues such as edges or textures, the patches are often spaced unevenly.
Reference:
Doersch, C., Gupta, A., and Efros, A. A. (2015). Unsupervised Visual Representation Learning by Context Prediction. In Proceedings of ICCV, pp. 1422-1430.
(2) Jigsaw Puzzle Solving
In this task, an image is divided into nine square patches. These patches are shuffled according to a predefined permutation, and the model is trained to predict the permutation index that restores the original spatial arrangement.
This encourages the model to reason about object parts, spatial layout, and structure.
Reference:
Noroozi, M., and Favaro, P. (2016). Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In Proceedings of ECCV, pp. 69-84.
(3) Image Colorization
In the image colorization task, the input is a grayscale image, and the model is trained to generate a plausible colorized version of it. This requires the model to understand object semantics, textures, and scene context.
Once trained, the encoder used in the colorization model can be reused for downstream tasks.
Reference:
Zhang, R., Isola, P., and Efros, A. A. (2016). Colorful Image Colorization. In Proceedings of ECCV, pp. 649-666.
Data generation
Network architecture
(4) Image Super-Resolution
In super-resolution, the goal is to reconstruct a high-resolution image from its low-resolution version. Training pairs can be generated automatically by downsampling high-resolution images.
This task allows models to learn fine-grained textures and natural image statistics without requiring manual annotations.
Data generation
Network architecture
(5) Image Inpainting
Image inpainting involves reconstructing missing or occluded parts of an image. During training, portions of the image are randomly removed, and the model is trained to fill in the missing regions based on the surrounding context.
This task encourages the model to learn both local and global structure within images.
These pretext tasks form the foundation of many self-supervised learning methods. They provide a mechanism for extracting high-quality representations from data without requiring human labels.
Data generation
Network architecture
2.2. Pipeline of Self-supervised Learning¶
Self-supervised learning is an emerging paradigm that enables models to learn rich and transferable representations without the need for human-labeled data. By designing proxy tasks (pretext tasks), the model can extract supervision signals directly from the input data itself.
Benefits of self-supervised learning:
- Similar to supervised pretraining, it enables models to acquire general-purpose feature representations useful for a wide range of downstream tasks
- Reduces the cost and effort associated with manually labeling large-scale datasets
- Makes it possible to exploit vast quantities of unlabeled data available from sources such as the web, surveillance systems, and sensor streams
Pipeline of self-supervised learning:
- A deep neural network is trained on unlabeled data using a pretext task to learn visual representations
- Once trained, the parameters of the network are either frozen or partially fine-tuned
- The pretrained model is transferred to downstream tasks, such as classification or detection
- The performance on these downstream tasks serves as an indirect evaluation of the effectiveness of the pretext task and the learned features
Reference: Jing, L., and Tian, Y. (2021). Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11), 4037-4058.
Downstream tasks:
After training on the pretext task, the learned network can be transferred to downstream tasks by adding task-specific layers while keeping the original network frozen or fine-tuned. This enables effective reuse of learned features.
Common downstream tasks include:
- Image classification
- Regression tasks (e.g., depth estimation)
- Object detection
- Semantic segmentation
This modular transfer process makes self-supervised learning a powerful and flexible framework for representation learning in domains where labeled data is scarce or expensive to obtain.
2.3. Self-supervised Learning with TensorFlow¶
Pretext task - Rotation:
One notable example of a pretext task is image rotation prediction, as proposed in RotNet (Spyros Gidaris, Praveer Singh, Nikos Komodakis, 2018, "Unsupervised Representation Learning by Predicting Image Rotations"). The central hypothesis is that a model can accurately predict the correct rotation of an image only if it has acquired a form of visual commonsense - an understanding of what objects should look like in their upright orientation.
In this task, the self-supervised learning process proceeds as follows:
- Input images are rotated by one of four angles: 0°, 90°, 180°, or 270°
- The model is trained to classify which of these four rotations has been applied
- This turns the pretext task into a four-way classification problem that does not require any manual labels
RotNet: supervised vs self-supervised performance
Despite the lack of label supervision, models trained with the RotNet pretext task achieve competitive performance. For example:
- The performance gap between a RotNet-based model and a fully supervised Network-in-Network (NIN) model is only 1.64 percentage points
- This demonstrates that it is possible to learn meaningful representations from unlabeled data, achieving results that are close to those of supervised counterparts
This example highlights the effectiveness of self-supervised learning with simple yet powerful pretext tasks, and shows how TensorFlow or similar frameworks can be used to implement and evaluate such models in practice.
Import Library
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
Load MNIST Data
(X_train, Y_train), (X_test, Y_test) = tf.keras.datasets.mnist.load_data()
XX_train = X_train[10000:11000]
YY_train = Y_train[10000:11000]
X_train = X_train[:10000]
Y_train = Y_train[:10000]
XX_test = X_test[300:600]
YY_test = Y_test[300:600]
X_test = X_test[:300]
Y_test = Y_test[:300]
print('shape of x_train:', X_train.shape)
print('shape of y_train:', Y_train.shape)
print('shape of xx_train:', XX_train.shape)
print('shape of yy_train:', YY_train.shape)
print('shape of x_test:', X_test.shape)
print('shape of y_test:', Y_test.shape)
print('shape of xx_test:', XX_test.shape)
print('shape of yy_test:', YY_test.shape)
2.3.1. Build RotNet for Pretext Task¶
Dataset for Pretext Task (Rotation)
Need to generate rotated images and their labels to train the model for pretext task
- [1, 0, 0, 0]: 0$^\circ $ rotation
- [0, 1, 0, 0]: 90$^\circ $ rotation
- [0, 0, 1, 0]: 180$^\circ $ rotation
- [0, 0, 0, 1]: 270$^\circ $ rotation
n_samples = X_train.shape[0]
X_rotate = np.zeros(shape = (n_samples*4,
X_train.shape[1],
X_train.shape[2]))
Y_rotate = np.zeros(shape = (n_samples*4, 4))
for i in range(n_samples):
img = X_train[i]
X_rotate[4*i-4] = img
Y_rotate[4*i-4] = tf.one_hot([0], depth = 4)
# 90 degrees rotation
X_rotate[4*i-3] = np.rot90(img, k = 1)
Y_rotate[4*i-3] = tf.one_hot([1], depth = 4)
# 180 degrees rotation
X_rotate[4*i-2] = np.rot90(img, k = 2)
Y_rotate[4*i-2] = tf.one_hot([2], depth = 4)
# 270 degrees rotation
X_rotate[4*i-1] = np.rot90(img, k = 3)
Y_rotate[4*i-1] = tf.one_hot([3], depth = 4)
Plot Dataset for Pretext Task (Rotation)
plt.figure(figsize = (10, 10))
plt.subplot(141)
plt.imshow(X_rotate[12], cmap = 'gray')
plt.axis('off')
plt.subplot(142)
plt.imshow(X_rotate[13], cmap = 'gray')
plt.axis('off')
plt.subplot(143)
plt.imshow(X_rotate[14], cmap = 'gray')
plt.axis('off')
plt.subplot(144)
plt.imshow(X_rotate[15], cmap = 'gray')
plt.axis('off')
X_rotate = X_rotate.reshape(-1,28,28,1)
Build Model for Pretext Task (Rotation)
model_pretext = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(filters = 64,
kernel_size = (3,3),
strides = (2,2),
activation = 'relu',
padding = 'SAME',
input_shape = (28, 28, 1)),
tf.keras.layers.MaxPool2D(pool_size = (2, 2),
strides = (2, 2)),
tf.keras.layers.Conv2D(filters = 32,
kernel_size = (3,3),
strides = (1,1),
activation = 'relu',
padding = 'SAME',
input_shape = (7, 7, 64)),
tf.keras.layers.MaxPool2D(pool_size = (2, 2),
strides = (2, 2)),
tf.keras.layers.Conv2D(filters = 16,
kernel_size = (3,3),
strides = (2,2),
activation = 'relu',
padding = 'SAME',
input_shape = (3, 3, 32)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(units = 4, activation = 'softmax')
])
model_pretext.summary()
Training the model for the pretext task
model_pretext.compile(optimizer = 'adam',
loss = 'categorical_crossentropy',
metrics = 'accuracy')
model_pretext.fit(X_rotate,
Y_rotate,
batch_size = 192,
epochs = 50,
verbose = 0,
shuffle = False)
2.3.2. Build Downstream Task (MNIST Image Classification)¶
Freezing trained parameters to transfer them for the downstream task
model_pretext.trainable = False
Reshape Dataset
XX_train = XX_train.reshape(-1,28,28,1)
XX_test = XX_test.reshape(-1,28,28,1)
YY_train = tf.one_hot(YY_train, 10,on_value = 1.0, off_value = 0.0)
YY_test = tf.one_hot(YY_test, 10,on_value = 1.0, off_value = 0.0)
Build Model
Model: two convolution layers and one fully connected layer
- Two convolution layers are transferred from the model for the pretext task
- Single fully connected layer is trained only
model_downstream = tf.keras.models.Sequential([
model_pretext.get_layer(index = 0),
model_pretext.get_layer(index = 1),
model_pretext.get_layer(index = 2),
model_pretext.get_layer(index = 3),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(units = 10, activation = 'softmax')
])
model_downstream.summary()
model_downstream.compile(optimizer = tf.keras.optimizers.SGD(learning_rate = 0.001,momentum = 0.9),
loss = 'categorical_crossentropy',
metrics = 'accuracy')
model_downstream.fit(XX_train,
YY_train,
batch_size = 64,
validation_split = 0.2,
epochs = 50,
verbose = 0,
callbacks = tf.keras.callbacks.EarlyStopping(monitor = 'accuracy', patience = 7))
Downstream Task Trained Result (Image Classification Result)
name = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
idx = 9
img = XX_train[idx].reshape(-1,28,28,1)
label = YY_train[idx]
predict = model_downstream.predict(img)
mypred = np.argmax(predict, axis = 1)
plt.figure(figsize = (8, 4))
plt.subplot(1,2,1)
plt.imshow(img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()
print('Prediction : {}'.format(name[mypred[0]]))
2.3.3. Build Supervised Model for Comparison¶
Convolution Neural Networks for MNIST image classification
- Model: Same model architecture with the model for the downstream task
- The number of total parameter is the same with the model for the downstream task, but is has zero non-trainable parameters
model_sup = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(filters = 64,
kernel_size = (3,3),
strides = (2,2),
activation = 'relu',
padding = 'SAME',
input_shape = (28, 28, 1)),
tf.keras.layers.MaxPool2D(pool_size = (2, 2),
strides = (2, 2)),
tf.keras.layers.Conv2D(filters = 32,
kernel_size = (3,3),
strides = (1,1),
activation = 'relu',
padding = 'SAME',
input_shape = (7, 7, 64)),
tf.keras.layers.MaxPool2D(pool_size = (2, 2),
strides = (2, 2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(units = 10, activation = 'softmax')
])
model_sup.summary()
model_sup.compile(optimizer = tf.keras.optimizers.SGD(learning_rate = 0.001, momentum = 0.9),
loss = 'categorical_crossentropy',
metrics = 'accuracy')
model_sup.fit(XX_train,
YY_train,
batch_size = 32,
validation_split = 0.2,
epochs = 50,
verbose = 0)
Compare Self-supervised Learning and Supervised Learning
(1) Pretext Task
- Input data: 10,000 MNIST images without labels
(2) Downstream Task and Supervised Learning (for performance comparison)
- Training data: 1,000 MNIST images with labels
- Test data: 300 MNIST images with labels
(3) Key concepts
- For transfer learning, we used to train networks like VGG 16 with large image dataset with labels such as ImageNet
- With self-supervised learning, we train such networks with unlabeled image datasets which have larger number of data than labeled image datasets have and perform transfer learning
- Comparing downstream task performance with that of supervised learning is equal to comparing the performance of (self-supervised) transfer learning and supervised learning performance
test_self = model_downstream.evaluate(XX_test, YY_test, batch_size = 64, verbose = 2)
print("")
print('Self-supervised Learning Accuracy on Test Data: {:.2f}%'.format(test_self[1]*100))
test_sup = model_sup.evaluate(XX_test, YY_test, batch_size = 64, verbose = 2)
print("")
print('Supervised Learning Accuracy on Test Data: {:.2f}%'.format(test_sup[1]*100))
3. What is Contrastive Learning¶
Contrastive learning is a self-supervised learning technique that leverages large amounts of unlabeled data to learn useful representations.
The core idea is to train an encoder that maps inputs into a representation space where:
- Positive pairs (e.g., different views or augmentations of the same input) are mapped closely together, and
- Negative pairs (e.g., representations of different inputs) are mapped far apart.
In doing so, the encoder learns to capture semantic similarities and differences, even without explicit labels. Contrastive learning plays a central role in many modern representation learning frameworks such as SimCLR, MoCo, and CLIP.
The image below illustrates contrastive learning:
The anchor image (top right) is embedded using network $\theta (\cdot)$.
A positive example (bottom right, same class) is embedded nearby ($d^+$ small).
A negative example (left, different class) is embedded far away ($d^-$ large).
The goal of training is to minimize $d^+$ and maximize $d^-$.
Objective Function
A common loss used in contrastive learning is the contrastive loss:
$$ \mathcal{L} = \mathbb{I}[y = 1] \cdot d^2 + \mathbb{I}[y = 0] \cdot \max(0, m - d)^2 $$
where:
- $d = \lVert \theta(x_i) - \theta(x_j) \rVert_2$ is the Euclidean distance between two embeddings,
- $y = 1$ for a positive pair, and $y = 0$ for a negative pair,
- $m$ is a margin that negative pairs must exceed.
Of course, this contrastive loss is just one of many possible loss functions, which we will study later.
Unsupervised Contrastive Learning
(1) Positive Samples
A positive sample is another view or augmentation of the same underlying instance (anchor). It is assumed to have the same semantic content.
For example if the anchor is an image of a dog, then:
- A mirrored version
- A grayscale version
- Or any strongly augmented version (e.g., cropped or jittered) can serve as a positive sample.
Common augmentations used to create positive samples include:
- Color jitter
- Rotation
- Horizontal/vertical flipping
- Gaussian noise
- Random affine transformations
These augmentations preserve the class identity while forcing the network to learn invariance to superficial changes.
(2) Negative Samples
A negative sample is any image that is semantically different from the anchor.
- In unsupervised settings, negative samples are typically drawn randomly from the rest of the dataset.
- There is no guarantee that they belong to a different class, but statistically, this assumption holds if the dataset is large and diverse.
Supervised Contrastive Learning
When class labels are available, we can improve contrastive learning by using label information to define more meaningful positives.
- In this setting, positive samples include all examples from the same class as the anchor, not just augmentations.
- Negative samples are drawn from other classes.
This approach encourages embeddings of all instances with the same label to cluster tightly in the representation space, improving performance especially on downstream classification tasks.
3.1.2. Image Subsampling/Patching Method¶
Another approach to constructing positive and negative pairs is to use image patches instead of entire images.
- Positive pairs are formed by extracting different patches from the same image. These patches may capture different regions but still share the same semantic identity.
- Negative pairs are formed by pairing patches from different images.
3.2. Objectives in Contrastive Learning¶
In contrastive learning, the model learns to map inputs into an embedding space where semantically similar examples (positives) are close together, and dissimilar ones (negatives) are far apart. Formally, let us denote:
- A query vector $q = \theta(x)$, where $\theta$ is the embedding function
- A positive key $k^+ = \theta(x^+)$, derived from the same source or class as $q$
- One or more negative keys $k^- = \theta(x^-)$, representing different or unrelated instances
Distance Function
The distance metric used in contrastive learning can be chosen based on the geometry of the embedding space, with common choices including Euclidean distance and cosine similarity.
The learning objective is to minimize the distance between the query and its positive key, while maximizing the distance to negative keys. This objective can be formalized in several distinct loss functions, each suited to different settings.
(1) Max-Margin Contrastive Loss
The max-margin contrastive loss ensures that the positive pair is closer than the negative pair by at least a fixed margin $m > 0$.
$$ \mathcal{L}(q, k^+, k^-) = \max\left(0, m + d(q, k^+) - d(q, k^-)\right) $$
Here, $d(\cdot, \cdot)$ is a distance function, typically Euclidean or cosine distance. This loss becomes zero when $d(q, k^-)$ exceeds $d(q, k^+)$ by at least $m$.
(2) Triplet Loss
The triplet loss defines a triplet: an anchor ($q$), a positive ($k^+$), and a negative ($k^-$), and encourages the positive to be closer to the anchor than the negative by a margin $m$.
$$ \mathcal{L}_{\text{triplet}} = \max\left(0, d(q, k^+) - d(q, k^-) + m\right) $$
This loss focuses on relative distances and is most effective when hard negatives are selected. Mining strategies are often used to improve learning.
(3) InfoNCE Loss
The InfoNCE loss frames contrastive learning as a classification task: given a query $q$, the model must identify the correct positive $k^+$ among a set of $K$ candidates (positives + negatives).
$$ \mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(q, k^+)/\tau)}{\exp(\text{sim}(q, k^+)/\tau) + \sum_{j=1}^K \exp(\text{sim}(q, k_j^-)/\tau)} $$
Here, $\text{sim}(\cdot, \cdot)$ is a similarity function (e.g., cosine similarity), and $\tau > 0$ is a temperature parameter controlling sharpness.
(4) NT-Xent Loss
The Normalized Temperature-scaled Cross Entropy Loss (NT-Xent) is a batch-based loss introduced in SimCLR. Each of the $N$ input samples is augmented twice, yielding $2N$ views. For a positive pair $(z_i, z_j)$, the loss is:
$$ \mathcal{L}_i = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)} $$
This loss uses all other samples in the batch as negatives and promotes a symmetric separation between positive and non-positive pairs. NT-Xent has been shown to be highly effective and is widely used in modern self-supervised learning frameworks such as SimCLR.
3.3. General Scheme for Contrastive Learning¶
SimCLR (A Simple Framework for Contrastive Learning of Visual Representations) is a foundational contrastive learning method introduced by Chen et al. (2020). It demonstrated that carefully designed data augmentations, a suitable contrastive loss, and large batch sizes are sufficient to learn high-quality image representations in a self-supervised setting.
(1) Data Augmentation Strategy
The process begins by applying two independent random augmentations to the same input image. These transformations, denoted by $T$, typically include:
- Random cropping and resizing
- Color jitter
- Gaussian blur
- Horizontal flip
This results in two different views of the same image, $x_i$ and $x_j$, which are treated as a positive pair.
(2) Encoder Network
Each of the transformed images is passed through a shared base encoder network $f(\cdot)$ (typically a ResNet-50). The encoder maps each input view into a representation vector:
$$ h_i = f(x_i), \quad h_j = f(x_j) $$
These embeddings $h_i$ and $h_j$ are high-dimensional and are intended to capture rich semantic features. However, they are not used directly for contrastive loss computation.
(3) Projection Head
The output representations $h_i$ and $h_j$ are further processed by a projection head $g(\cdot)$, typically a 2-layer MLP with ReLU activation in between:
$$ z_i = g(h_i), \quad z_j = g(h_j) $$
The projection head maps the representations into a space where the contrastive loss is applied. Empirically, the authors showed that contrastive learning benefits from applying the loss in this transformed space, and discarding the projection head after pretraining improves downstream performance.
The model is trained to maximize similarity between $z_i$ and $z_j$, while minimizing similarity between $z_i$ and all other negatives in the batch.
(4) Contrastive Objective: NT-Xent Loss
The Normalized Temperature-scaled Cross Entropy Loss (NT-Xent) is applied to each positive pair and all other views in the batch are treated as negatives.
Given a batch of $N$ images, resulting in $2N$ augmented views, the loss for a positive pair $(z_i, z_j)$ is:
$$ \mathcal{L}_i = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)} $$
where $\text{sim}(\cdot, \cdot)$ is cosine similarity and $\tau$ is a temperature parameter.
(5) Downstream Usage
Once pretraining is complete, the projection head $g(\cdot)$ is discarded. The base encoder $f(\cdot)$ is retained and used to generate representations for downstream tasks such as image classification, object detection, or segmentation.
In the downstream phase, either:
- The encoder is frozen, and a linear classifier is trained on top (linear probing), or
- The encoder is fine-tuned along with the task-specific head.
SimCLR illustrates that even without labels, strong representation learning is possible through contrastive objectives, heavy data augmentation, and careful architectural design.
3.4. Summary¶
Contrastive learning demonstrates that strong representation learning is achievable even in the absence of labels. This is made possible by the combination of:
Carefully designed contrastive objectives (e.g., NT-Xent loss),
Heavy data augmentation to generate diverse yet semantically consistent views,
And a thoughtfully constructed architecture, including a projection head and encoder backbone.
By optimizing contrastive loss functions, the model learns to structure the embedding space in a meaningful way:
Samples from the same class (positive pairs) are pulled closer together,
Samples from different classes (negative pairs) are pushed farther apart.
This tight intra-class clustering and inter-class separation is what defines strong representation learning - it results in embeddings that are highly discriminative and well-suited for downstream tasks such as classification, detection, or segmentation.
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')