Self-supervised Learning

Table of Contents

Many images from https://amitness.com/2020/02/illustrated-self-supervised-learning/

1. Supervised Learning and Transfer Learning¶

Supervised learning has been a cornerstone of modern machine learning, driving breakthroughs in computer vision, natural language processing, speech recognition, and robotics. For over a decade, the prevailing recipe has been simple: collect data, annotate it, and train a deep neural network end-to-end.

Yet this recipe comes with a hidden cost — labels. As models grow deeper and datasets grow larger, the dependency on human-annotated supervision becomes an increasingly serious bottleneck. In this section, we revisit the fundamentals of supervised learning, diagnose its structural limitations, and survey the landscape of label-efficient alternatives that motivate the study of self-supervised learning.

1.1. What is Supervised Learning?¶

In supervised learning, a parametric model $f_\theta : \mathcal{X} \to \mathcal{Y}$ is optimized on a labeled dataset

$$\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{N}, \quad x_i \in \mathcal{X},\ y_i \in \mathcal{Y}$$

where $x_i$ is an input observation and $y_i$ is its corresponding ground-truth annotation. The standard training objective minimizes an empirical risk:

$$\min_\theta \; \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}\bigl(f_\theta(x_i),\, y_i\bigr) + \lambda \, \Omega(\theta)$$

where $\mathcal{L}$ is a task-specific loss function and $\Omega(\theta)$ is a regularization term.

The key assumption is that labels $y_i$ encode the relevant semantic or physical signal needed to shape the model's internal representations. When this assumption holds and sufficient labeled data is available, supervised learning is remarkably effective.

1.2. Limitations of Supervised Learning¶

Despite its empirical success, supervised learning carries three structural limitations that become increasingly acute as we push toward more complex, data-hungry models.

1.2.1. High Labeling Cost¶

Annotating large-scale datasets is labor-intensive and expensive.

The cost compounds in two ways. First, volume: modern deep networks are data-hungry, and performance on benchmarks like ImageNet continues to improve log-linearly with dataset size. Second, annotation granularity: coarse image-level tags are cheap, but fine-grained labels (pixel masks, 3D bounding boxes, keypoints) are expensive — and unfortunately, many high-value tasks require the latter.

1.2.2. Requirement for Domain Expertise¶

Many real-world annotation tasks cannot be crowdsourced — they demand specialized knowledge that is simultaneously rare and expensive:

Medical imaging: Delineating tumors, lesions, or organ boundaries in CT/MRI scans requires board-certified radiologists. Inter-annotator agreement is often moderate even among experts (e.g., Dice scores of 0.7-0.85 for brain tumor segmentation).
Industrial inspection: Identifying micro-cracks in turbine blades or weld defects in structural components requires trained quality engineers with domain-specific knowledge.
Scientific discovery: Labeling particle collision events in high-energy physics, annotating cell morphologies in fluorescence microscopy, or classifying geological formations in seismic data all require PhD-level expertise.

This creates a fundamental scalability wall: expert time is finite, expensive, and cannot be parallelized indefinitely.

1.2.3. Limited Cross-Task and Cross-Domain Generalization¶

Even when a labeled dataset exists, a model trained on it is typically bound to a narrow task-domain combination. When deployed on a different domain $\mathcal{D}_T$ or task $\mathcal{T}_T$, performance often degrades sharply due to distribution shift:

$$P_S(x, y) \neq P_T(x, y)$$

Concrete examples in mechanical engineering contexts:

A fault diagnosis model trained on bearings from Manufacturer A fails on bearings from Manufacturer B (domain shift in vibration spectra).
A segmentation model trained on daytime urban scenes degrades at night or in rain (covariate shift).
A defect detection model trained on steel surfaces does not transfer to composite materials.

Every new task-domain pair, in principle, demands a new annotated dataset. This approach is not scalable.

1.3. Toward Label-Efficient Learning¶

To mitigate the labeling burden, the machine learning community has developed a family of paradigms that reduce — or eliminate — the need for human annotations.

1.3.1. Semi-Supervised Learning¶

Semi-supervised learning (Semi-SL) assumes access to a small labeled set $\mathcal{D}_L = \{(x_i, y_i)\}_{i=1}^{M}$ and a large unlabeled set $\mathcal{D}_U = \{x_j\}_{j=1}^{N}$, where $M \ll N$.

The core idea is to exploit the geometric or statistical structure of the input distribution $P(x)$ — accessible from $\mathcal{D}_U$ — to regularize or augment the supervised signal from $\mathcal{D}_L$. Classical approaches include:

Self-training: Train on $\mathcal{D}_L$, generate pseudo-labels for $\mathcal{D}_U$, and retrain iteratively.
Label propagation: Propagate labels through a $k$-NN graph over all data points.
Consistency regularization (e.g., MixMatch, FixMatch): Enforce that predictions are invariant to stochastic augmentations.

Semi-SL is powerful when labeled and unlabeled data share the same distribution, but it still requires some labeled examples and can be sensitive to the quality of pseudo-labels.

1.3.2. Weakly-Supervised Learning¶

Weakly-supervised learning relaxes the annotation requirement by accepting imprecise, incomplete, or indirect supervision:

A classic example is weakly-supervised object detection: instead of expensive bounding box annotations, only image-level tags are provided. Class Activation Mapping (CAM) techniques can localize discriminative regions despite this coarse supervision.

1.3.3. Unsupervised Learning¶

Unsupervised learning operates entirely without labels, aiming to discover structure in the raw data distribution $P(x)$. Classical methods include clustering ($k$-means, GMM), dimensionality reduction (PCA, t-SNE, UMAP), and generative modeling (VAE, GAN). These can reveal meaningful structure, but learned representations are often not directly suitable for downstream discriminative tasks without additional supervision.

A Note on Terminology

Self-supervised learning is sometimes categorized as a special case of unsupervised learning. In this course, we treat it as a distinct paradigm: it explicitly constructs supervisory signals from the data itself (e.g., predicting one part of the input from another), rather than simply modeling $P(x)$. This distinction matters — self-supervised representations tend to be far more transferable than those learned by classical unsupervised methods.

1.4. Transfer Learning as a Partial Solution¶

Before self-supervised learning became dominant, transfer learning was the standard strategy for coping with limited labeled data. Understanding its strengths and limitations directly motivates why self-supervised pre-training is so compelling.

1.4.1. The Transfer Learning Paradigm¶

Transfer learning proceeds in two stages:

Pre-training: Train $f_\theta$ on a large source dataset $\mathcal{D}_{\text{pre}}$ with abundant labels (e.g., ImageNet-1K for vision, BooksCorpus for NLP).
Fine-tuning: Adapt $f_\theta$ to a target task by optimizing on a smaller labeled dataset $\mathcal{D}_{\text{fine}}$, with $\theta$ initialized from the pre-trained weights.

$$f_{\theta^*} = \arg\min_\theta \sum_{(x_i, y_i) \in \mathcal{D}_{\text{fine}}} \mathcal{L}\bigl(f_\theta(x_i), y_i\bigr), \quad \theta \leftarrow \theta_{\text{pre-trained}}$$

The empirical success of this recipe is striking: a ResNet-50 pre-trained on ImageNet and fine-tuned on a 1,000-sample medical dataset routinely outperforms the same architecture trained from scratch on 100,000 medical samples.

1.4.2. Why Does Transfer Learning Work?¶

Pre-training on a large, diverse dataset encourages the model to learn general-purpose feature hierarchies:

Early layers capture low-level statistics: edges, textures, color gradients — features broadly useful across visual tasks.
Middle layers encode mid-level structures: object parts, shapes, semantic segments.
Late layers encode task-specific, high-level semantics tied to the source labels.

When fine-tuning, the task-specific layers are re-trained while the general-purpose lower layers are retained or lightly updated. This dramatically reduces the effective sample complexity of learning the target task.

2. Self-Supervised Learning (SSL)¶

Self-supervised learning is a paradigm in which supervision signals are derived directly from the structure of the data itself, without relying on human-provided labels or external annotation sources. Rather than requiring a labeled dataset $\mathcal{D} = \{(x_i, y_i)\}$, a self-supervised method constructs a surrogate learning problem — called a pretext task — from the unlabeled data $\mathcal{D}_U = \{x_i\}$ alone.

Formally, given an input $x$, a pretext task defines a transformation $\mathcal{T}$ that produces a modified input $\tilde{x}$ and a corresponding pseudo-label $\tilde{y}$, both derived automatically from $x$:

$$(\tilde{x},\, \tilde{y}) = \mathcal{T}(x)$$

The model $f_\theta$ is then trained to solve this surrogate task:

$$\min_\theta \; \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}\bigl(f_\theta(\tilde{x}_i),\, \tilde{y}_i\bigr)$$

The underlying hypothesis is that in order to solve a well-designed pretext task, the model must learn internal representations that capture the semantic and structural regularities of the data — representations that generalize to downstream tasks of interest.

Pretext Tasks

A pretext task is a self-defined prediction problem whose solution does not require human annotation. The design of the pretext task is crucial: it should be neither too easy (trivially solved without learning useful structure) nor too hard (unsolvable without additional supervision). Representative examples include:

Masked prediction: A portion of the input is hidden, and the model is trained to reconstruct the missing content. In vision, this corresponds to predicting masked image patches (e.g., MAE); in language, to predicting masked tokens (e.g., BERT). The pseudo-label $\tilde{y}$ is the original unmasked content.
Geometric transformation prediction: The input is subjected to a geometric transformation — rotation, permutation of patches (jigsaw puzzle), or spatial rearrangement — and the model is trained to predict the transformation parameters. For example, in rotation prediction, $\tilde{y} \in \{0°, 90°, 180°, 270°\}$ is determined entirely by the transformation applied.
Contrastive learning: Two augmented views $x^+_1$ and $x^+_2$ are generated from the same input $x$, while views from different inputs serve as negatives. The model is trained to produce similar representations for positive pairs and dissimilar representations for negative pairs. The pseudo-label is implicitly defined by the data augmentation process, requiring no human annotation.
Temporal or contextual prediction: In video or sequential data, the model predicts future frames, missing frames, or surrounding context from observed content. The temporal structure of the data provides the supervisory signal.

From Pretext Tasks to Downstream Tasks

The representations learned through pretext task training are not an end in themselves — they serve as a general-purpose initialization for downstream tasks such as classification, object detection, or semantic segmentation. This transfer typically proceeds in one of two ways:

Linear probing: The encoder $f_\theta$ is frozen, and a lightweight linear classifier is trained on top of the learned representations using a small labeled dataset. Strong linear probing performance indicates that the representations are linearly separable with respect to the target labels — a stringent test of representation quality.
Fine-tuning: The entire network, including the pre-trained encoder, is updated on the labeled downstream dataset. This is analogous to the supervised transfer learning pipeline described in Section 1.4, except that the pre-trained weights were obtained without any labels.

A central empirical finding in the self-supervised learning literature is that, in low-label regimes, self-supervised pre-training followed by fine-tuning frequently matches or surpasses the performance of fully supervised models trained on the same downstream data. In some settings — particularly where unlabeled data is abundant but labeled data is scarce — self-supervised representations have been shown to outperform supervised pre-training even on the pre-training domain itself.

2.1. Pretext Tasks in Self-Supervised Learning¶

In self-supervised learning, a pretext task is an auxiliary learning objective that enables a model to learn useful representations from unlabeled data. These tasks are designed so that the labels can be automatically derived from the input data itself, eliminating the need for human annotation.

By solving pretext tasks, the model is encouraged to learn generalizable and transferable features that can be applied to downstream tasks such as image classification, object detection, and segmentation.

The following are representative examples of pretext tasks proposed in prior work. For each example, pay close attention to how the labels are automatically constructed from the raw data.

The image below illustrates four canonical pretext tasks applied to the same input image:

Image completion: the upper portion of the image is hidden, and the model must reconstruct the missing region from the visible lower patch.
Rotation prediction: the image is rotated by an unknown angle $\theta$, and the model is trained to predict $\theta$ from the transformed input.
Jigsaw puzzle solving: the image is divided into a grid of patches that are spatially shuffled, and the model must recover the original arrangement. The adjacent unshuffled image shows the correct reference.
Colorization: the image is converted to grayscale, and the model is trained to predict the original color values for each pixel. The supervisory signal is the original RGB image, which is available at no annotation cost.

In all four cases, the supervisory signal — the masked region, the rotation angle, the correct patch permutation, or the original color values — is derived automatically from the image itself, with no human annotation required.

(1) Context Prediction

In the context prediction task, an image is divided into nine patches arranged in a $3 \times 3$ grid, numbered 1 through 8 surrounding a central patch. The model receives a pair of patches: one is always taken from the center of the image (blue box), and the other is randomly selected from one of the eight surrounding locations (red dashed boxes). The model is trained to predict the relative spatial position of the second patch with respect to the center patch — a classification problem over 8 possible directions.

The figure below illustrates this pipeline. Given an image $x$, a center-neighbor patch pair is extracted. The model must then identify which of the 8 surrounding positions the neighbor patch was sampled from, as indicated by the $3 \times 3$ grid on the right where the blue box marks the center and the red dashed boxes represent the candidate positions.

To avoid trivial solutions based on low-level cues such as chromatic aberration, edge continuity, or texture statistics, the patches are typically spaced with a gap between them, preventing the model from exploiting simple boundary-matching shortcuts

Reference:
Doersch, C., Gupta, A., and Efros, A. A. (2015). Unsupervised Visual Representation Learning by Context Prediction. In Proceedings of ICCV, pp. 1422-1430.

(2) Jigsaw Puzzle Solving

In this task, an image is divided into a $3 \times 3$ grid of nine square patches. The patches are cropped with random gaps between them (as shown in figure (a)) to prevent the model from exploiting low-level boundary continuity. The patches are then shuffled according to a permutation selected from a predefined permutation set, producing a scrambled arrangement (figure (b) shows the shuffled result; figure (c) shows the original ordering for reference). The model is trained to predict the index of the applied permutation, framing the problem as a classification task over the permutation set.

This pretext task encourages the model to reason about object parts, their spatial relationships, and the global structure of a scene — representations that transfer well to downstream recognition tasks.

Reference:
Noroozi, M., and Favaro, P. (2016). Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In Proceedings of ECCV, pp. 69-84.

(3) Image Colorization

In the image colorization task, the input is a grayscale image and the model is trained to predict the color of each pixel. This requires the model to understand object semantics, textures, and scene context — for instance, recognizing that grass is green, sky is blue, and fur has a particular texture-dependent hue.

Dataset preparation. The first figure illustrates how training data is constructed. Given any collection of color images (top row), a grayscale filter is applied to each image to produce the corresponding grayscale version (bottom row). The grayscale image serves as the input and the original color image serves as the supervision target. This conversion is fully automatic and requires no human annotation whatsoever — any large-scale image dataset such as ImageNet or Places can be used directly, making colorization one of the most naturally scalable pretext tasks.

Network architecture. The second figure illustrates the overall pipeline. The grayscale image is passed through a CNN encoder that compresses the input into a compact latent representation, capturing high-level semantic and structural information. A CNN decoder then upsamples this representation back to the original spatial resolution to produce the predicted color image. The model is optimized by minimizing the loss between the predicted color output and the actual color image.

Once trained, the encoder captures rich semantic and textural representations that transfer effectively to downstream tasks.

Reference:
Zhang, R., Isola, P., and Efros, A. A. (2016). Colorful Image Colorization. In Proceedings of ECCV, pp. 649-666.

(4) Image Super-Resolution

In the super-resolution task, the goal is to reconstruct a high-resolution image from its low-resolution counterpart. The model must learn to recover fine-grained details — sharp edges, textures, and high-frequency patterns — that are lost during downsampling.

Dataset preparation. The first figure illustrates how training pairs are constructed. Given a collection of high-resolution images (top row), downsampling is applied to produce lower-resolution versions of the same images (bottom row). The low-resolution image serves as the input and the original high-resolution image serves as the supervision target. As with colorization, this process is entirely automatic — no human annotation is required, and any large-scale image dataset can be used directly.

Network architecture. The second figure illustrates the SRGAN pipeline. The low-resolution image is passed through a generator network that produces a predicted high-resolution output at $2\times$ the input resolution. The model is trained with a combination of two losses. The L2 loss directly penalizes pixel-wise differences between the predicted and actual high-resolution images, encouraging overall fidelity. The content loss, computed in a deep feature space rather than pixel space, encourages perceptual similarity in terms of textures and structures. In addition, a discriminator network is trained adversarially to distinguish between the generated image (fake, 0) and the real high-resolution image (real, 1), pushing the generator to produce outputs that are not only accurate but also perceptually realistic.

Once trained, the encoder within the generator learns to represent fine-grained image statistics and structural patterns that transfer well to downstream tasks.

Reference: Ledig, C. et al. (2017). Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of CVPR, pp. 4681–4690.

(5) Image Inpainting

In the image inpainting task, a randomly selected rectangular region is removed from an image, and the model is trained to reconstruct the missing content from the surrounding context. To fill in a plausible region, the model must reason about the global scene structure, object boundaries, and local texture patterns — making this a semantically demanding pretext task.

Dataset preparation. The first figure illustrates how training data is constructed. Given a collection of intact images (top row), a random rectangular region is masked out to produce the corrupted input (bottom row). The original unmasked image serves as the supervision target. As with the previous pretext tasks, this process requires no human annotation and scales to any large image dataset.

Network architecture. The second figure illustrates the training pipeline. The corrupted image is passed through a generator network that predicts the missing region, producing a completed image. Two loss signals are used jointly. The reconstruction loss penalizes pixel-wise differences between the predicted and actual image in the masked region, encouraging accurate content recovery. The adversarial loss is provided by a discriminator network trained to distinguish the generator's inpainted output (fake, 0) from the real intact image (real, 1), encouraging the generator to produce completions that are not only accurate but also visually coherent with the surrounding context.

Reference: Pathak, D. et al. (2016). Context Encoders: Feature Learning by Inpainting. In Proceedings of CVPR, pp. 2536–2544.

These pretext tasks — context prediction, jigsaw puzzle solving, colorization, super-resolution, and inpainting — form the foundation of many early self-supervised learning methods. Each provides a mechanism for extracting meaningful visual representations from unlabeled data by requiring the model to solve a structured prediction problem over the image itself.

2.2. Pretext Tasks in Natural Language Processing¶

Self-supervised learning is not limited to computer vision. In natural language processing (NLP), the same principle applies: supervision signals are constructed automatically from raw text, without any human annotation. In fact, some of the most influential self-supervised learning methods — BERT and GPT — originate from NLP, and their success has directly inspired the development of modern self-supervised vision models.

The key observation is that text has rich internal structure. Words appear in sequences, sentences follow grammatical and semantic rules, and meaning is distributed across context. These properties make it natural to define pretext tasks by withholding part of the text and asking the model to predict it from the remainder.

(1) Masked Language Modeling

Masked language modeling (MLM) is the pretext task used in BERT (Devlin et al., 2019). Given an input sentence, a random subset of tokens — typically 15% — is replaced with a special [MASK] token. The model is trained to predict the original identity of each masked token from the surrounding context.

For example, given the sentence:

"The cat [MASK] on the mat."

the model must predict that the masked token is "sat", using both the left context ("The cat") and the right context ("on the mat") simultaneously. The supervision target is the original token, which is available for free from the raw text corpus.

This task is fundamentally bidirectional: the model attends to both the left and right context of each masked token simultaneously. This encourages the model to build rich, context-aware representations of each word — representations that capture not just the word's identity but its role in the surrounding sentence.

(2) Next Sentence Prediction

Next sentence prediction (NSP) is a second pretext task used in BERT, designed to capture relationships between sentences rather than within a single sentence. Given a pair of sentences (A, B), the model is trained to predict whether sentence B is the actual sentence that follows sentence A in the original document, or a randomly sampled sentence from elsewhere in the corpus.

Training pairs are constructed automatically from any large text corpus: consecutive sentence pairs serve as positive examples, and randomly paired sentences serve as negatives. No human annotation is required.

This task encourages the model to learn discourse-level coherence and inter-sentence relationships, which are important for downstream tasks such as question answering and natural language inference.

(3) Autoregressive Language Modeling

Autoregressive language modeling is the pretext task underlying the GPT family of models (Radford et al., 2018). Rather than masking tokens at random positions, the model is trained to predict the next token in a sequence given all preceding tokens:

$$\max_\theta \sum_{t} \log P_\theta(x_t \mid x_1, x_2, \ldots, x_{t-1})$$

For example, given the partial sequence "The cat sat on the", the model must predict "mat" as the next token. The supervision target at each position is simply the next word in the original text, making the dataset construction entirely automatic.

This task is unidirectional by design: the model only attends to past context when predicting each token. While this makes autoregressive models less suited for tasks requiring bidirectional understanding, it makes them naturally capable of open-ended text generation — a property that has proven enormously powerful in large language models.

Comparison of NLP Pretext Tasks

A key observation is that all three tasks require nothing beyond a raw text corpus — Wikipedia, BooksCorpus, or web-crawled data suffice. The scale of available unlabeled text is effectively unlimited, which is why NLP has seen some of the most dramatic gains from self-supervised learning. Pre-trained language models such as BERT and GPT, trained on these pretext tasks, have become the standard initialization for virtually every downstream NLP task, from sentiment analysis and named entity recognition to machine translation and question answering.

This parallel with the vision pipeline is direct: just as rotation prediction or colorization pre-trains a visual encoder that transfers to downstream vision tasks, masked language modeling pre-trains a text encoder that transfers to downstream NLP tasks. The principle is the same — only the modality and the structure of the pretext task differ.

Reference:

Devlin, J., Chang, M. W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL, pp. 4171-4186.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving Language Understanding by Generative Pre-training. OpenAI Blog.

2.3. Pipeline of Self-Supervised Learning¶

Self-supervised learning is an emerging paradigm that enables models to learn rich and transferable representations without the need for human-labeled data. By designing pretext tasks, the model extracts supervision signals directly from the input data itself, bypassing the annotation bottleneck described in Section 1.

The benefits of this paradigm are threefold. First, similar to supervised pre-training, it enables models to acquire general-purpose feature representations useful for a wide range of downstream tasks. Second, it substantially reduces the cost and effort associated with manually labeling large-scale datasets. Third, it makes it possible to exploit the vast quantities of unlabeled data available from sources such as the web, surveillance systems, and sensor streams — data that supervised learning cannot leverage without annotation.

The first figure illustrates the overall two-stage pipeline. In the first stage (top, red box), a ConvNet is trained on an unlabeled dataset using a pretext task. The network learns to extract visual features by solving the self-defined prediction problem, without any human-provided labels. In the second stage (bottom, yellow box), the pre-trained ConvNet weights are transferred via knowledge transfer to a supervised downstream task setting, where a small labeled dataset is used to train a task-specific head on top of the transferred features.

The second figure provides a more detailed view of the transfer mechanism. During pre-training (left), the model is paired with a pretext task predictor and trained on unlabeled pre-training data. At transfer time, the predictor head used for the pretext task is discarded, and the model weights are carried over to the downstream setting (right). A new predictor is attached and fine-tuned on task-specific labeled data, while the backbone model can be either frozen or jointly fine-tuned depending on the size of the labeled dataset and the degree of domain shift.

The full pipeline therefore proceeds as follows:

A deep neural network is trained on unlabeled data using a pretext task to learn visual representations.
Once trained, the parameters of the network are either frozen or partially fine-tuned depending on the requirements of the target application.
The pre-trained model is transferred to downstream tasks by attaching lightweight task-specific layers.
Performance on these downstream tasks serves as an indirect evaluation of the quality of the pretext task and the usefulness of the learned representations.

Common downstream tasks include image classification, regression tasks such as depth estimation, object detection, and semantic segmentation. This modular transfer process makes self-supervised learning a powerful and flexible framework for representation learning in domains where labeled data is scarce or expensive to obtain.

Reference: Jing, L., and Tian, Y. (2021). Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11), 4037–4058.

3. Self-Supervised Learning with TensorFlow¶

Pretext task: Rotation Prediction (RotNet)

One notable example of a pretext task is image rotation prediction, as proposed in RotNet (Gidaris et al., 2018). The central hypothesis is that a model can accurately predict the rotation applied to an image only if it has developed a form of visual commonsense — an understanding of what objects look like in their canonical, upright orientation. Recognizing that a bird is upside-down, for instance, requires the model to have learned what a bird looks like in the first place.

The figure above illustrates the data generation process. Given an input image $X$, four rotated versions are produced by applying rotations of $0°$, $90°$, $180°$, and $270°$, yielding $X^0$, $X^1$, $X^2$, and $X^3$ respectively. Each rotated image is paired with its corresponding rotation label $y \in \{0, 1, 2, 3\}$, which is determined entirely by the transformation applied — no human annotation is needed. The pretext task is thus a four-way classification problem over rotation labels.

The self-supervised training process proceeds as follows:

Each input image is rotated by one of four angles: $0°$, $90°$, $180°$, or $270°$.
The rotation label is automatically assigned based on the applied transformation.
The model is trained to predict which of the four rotations has been applied, using standard cross-entropy loss.
Once pre-training converges, the rotation prediction head is discarded and the encoder is transferred to a downstream task.

RotNet: supervised vs. self-supervised performance

Despite the complete absence of label supervision during pre-training, models trained with the RotNet pretext task achieve competitive downstream performance. The performance gap between a RotNet-based model and a fully supervised Network-in-Network (NIN) model is only 1.64 percentage points on standard benchmarks. This result demonstrates that meaningful and transferable representations can be learned from unlabeled data alone, approaching the quality of fully supervised counterparts with a simple and scalable pretext task.

Reference: Gidaris, S., Singh, P., and Komodakis, N. (2018). Unsupervised Representation Learning by Predicting Image Rotations. In Proceedings of ICLR.

Import Library

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

Load MNIST Data

(X_train, Y_train), (X_test, Y_test) = tf.keras.datasets.mnist.load_data()
XX_train = X_train[10000:11000]
YY_train = Y_train[10000:11000]
X_train = X_train[:10000]
Y_train = Y_train[:10000]
XX_test = X_test[300:600]
YY_test = Y_test[300:600]
X_test = X_test[:300]
Y_test = Y_test[:300]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 0s 0us/step

print('shape of x_train:', X_train.shape)
print('shape of y_train:', Y_train.shape)
print('shape of xx_train:', XX_train.shape)
print('shape of yy_train:', YY_train.shape)
print('shape of x_test:', X_test.shape)
print('shape of y_test:', Y_test.shape)
print('shape of xx_test:', XX_test.shape)
print('shape of yy_test:', YY_test.shape)

shape of x_train: (10000, 28, 28)
shape of y_train: (10000,)
shape of xx_train: (1000, 28, 28)
shape of yy_train: (1000,)
shape of x_test: (300, 28, 28)
shape of y_test: (300,)
shape of xx_test: (300, 28, 28)
shape of yy_test: (300,)

3.1. Build RotNet for Pretext Task¶

Dataset for Pretext Task (Rotation)

Need to generate rotated images and their labels to train the model for pretext task

[1, 0, 0, 0]: 0$^\circ $ rotation
[0, 1, 0, 0]: 90$^\circ $ rotation
[0, 0, 1, 0]: 180$^\circ $ rotation
[0, 0, 0, 1]: 270$^\circ $ rotation

n_samples = X_train.shape[0]
X_rotate = np.zeros(shape = (n_samples*4,
                             X_train.shape[1],
                             X_train.shape[2]))
Y_rotate = np.zeros(shape = (n_samples*4, 4))

for i in range(n_samples):
    img = X_train[i]
    X_rotate[4*i-4] = img
    Y_rotate[4*i-4] = tf.one_hot([0], depth = 4)

    # 90 degrees rotation
    X_rotate[4*i-3] = np.rot90(img, k = 1)
    Y_rotate[4*i-3] = tf.one_hot([1], depth = 4)

    # 180 degrees rotation
    X_rotate[4*i-2] = np.rot90(img, k = 2)
    Y_rotate[4*i-2] = tf.one_hot([2], depth = 4)

    # 270 degrees rotation
    X_rotate[4*i-1] = np.rot90(img, k = 3)
    Y_rotate[4*i-1] = tf.one_hot([3], depth = 4)

Plot Dataset for Pretext Task (Rotation)

plt.figure(figsize = (10, 10))

plt.subplot(141)
plt.imshow(X_rotate[12], cmap = 'gray')
plt.axis('off')

plt.subplot(142)
plt.imshow(X_rotate[13], cmap = 'gray')
plt.axis('off')

plt.subplot(143)
plt.imshow(X_rotate[14], cmap = 'gray')
plt.axis('off')

plt.subplot(144)
plt.imshow(X_rotate[15], cmap = 'gray')
plt.axis('off')

(-0.5, 27.5, 27.5, -0.5)

X_rotate = X_rotate.reshape(-1,28,28,1)

Build Model for Pretext Task (Rotation)

model_pretext = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters = 64,
                           kernel_size = (3,3),
                           strides = (2,2),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (28, 28, 1)),

    tf.keras.layers.MaxPool2D(pool_size = (2, 2),
                              strides = (2, 2)),

    tf.keras.layers.Conv2D(filters = 32,
                           kernel_size = (3,3),
                           strides = (1,1),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (7, 7, 64)),

    tf.keras.layers.MaxPool2D(pool_size = (2, 2),
                              strides = (2, 2)),

    tf.keras.layers.Conv2D(filters = 16,
                           kernel_size = (3,3),
                           strides = (2,2),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (3, 3, 32)),

    tf.keras.layers.Flatten(),

    tf.keras.layers.Dense(units = 4, activation = 'softmax')

])
model_pretext.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 14, 14, 64)        640       
                                                                 
 max_pooling2d (MaxPooling2  (None, 7, 7, 64)          0         
 D)                                                              
                                                                 
 conv2d_1 (Conv2D)           (None, 7, 7, 32)          18464     
                                                                 
 max_pooling2d_1 (MaxPoolin  (None, 3, 3, 32)          0         
 g2D)                                                            
                                                                 
 conv2d_2 (Conv2D)           (None, 2, 2, 16)          4624      
                                                                 
 flatten (Flatten)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 4)                 260       
                                                                 
=================================================================
Total params: 23988 (93.70 KB)
Trainable params: 23988 (93.70 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Training the model for the pretext task

model_pretext.compile(optimizer = 'adam',
                      loss = 'categorical_crossentropy',
                      metrics = 'accuracy')

model_pretext.fit(X_rotate,
                  Y_rotate,
                  batch_size = 192,
                  epochs = 50,
                  verbose = 0,
                  shuffle = False)

<keras.src.callbacks.History at 0x7e2741df5150>

3.2. Build Downstream Task (MNIST Image Classification)¶

Freezing trained parameters to transfer them for the downstream task

model_pretext.trainable = False

Reshape Dataset

XX_train = XX_train.reshape(-1,28,28,1)
XX_test = XX_test.reshape(-1,28,28,1)
YY_train = tf.one_hot(YY_train, 10,on_value = 1.0, off_value = 0.0)
YY_test = tf.one_hot(YY_test, 10,on_value = 1.0, off_value = 0.0)

Build Model

Model: two convolution layers and one fully connected layer

Two convolution layers are transferred from the model for the pretext task
Single fully connected layer is trained only

model_downstream = tf.keras.models.Sequential([
    model_pretext.get_layer(index = 0),
    model_pretext.get_layer(index = 1),
    model_pretext.get_layer(index = 2),
    model_pretext.get_layer(index = 3),

    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(units = 10, activation = 'softmax')
])

model_downstream.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 14, 14, 64)        640       
                                                                 
 max_pooling2d (MaxPooling2  (None, 7, 7, 64)          0         
 D)                                                              
                                                                 
 conv2d_1 (Conv2D)           (None, 7, 7, 32)          18464     
                                                                 
 max_pooling2d_1 (MaxPoolin  (None, 3, 3, 32)          0         
 g2D)                                                            
                                                                 
 flatten_1 (Flatten)         (None, 288)               0         
                                                                 
 dense_1 (Dense)             (None, 10)                2890      
                                                                 
=================================================================
Total params: 21994 (85.91 KB)
Trainable params: 2890 (11.29 KB)
Non-trainable params: 19104 (74.62 KB)
_________________________________________________________________

model_downstream.compile(optimizer = tf.keras.optimizers.SGD(learning_rate = 0.001,momentum = 0.9),
                         loss = 'categorical_crossentropy',
                         metrics = 'accuracy')

model_downstream.fit(XX_train,
                     YY_train,
                     batch_size = 64,
                     validation_split = 0.2,
                     epochs = 50,
                     verbose = 0,
                     callbacks = tf.keras.callbacks.EarlyStopping(monitor = 'accuracy', patience = 7))

<keras.src.callbacks.History at 0x7e26f4b98a00>

Downstream Task Trained Result (Image Classification Result)

name = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
idx = 9
img = XX_train[idx].reshape(-1,28,28,1)
label = YY_train[idx]
predict = model_downstream.predict(img)
mypred = np.argmax(predict, axis = 1)

plt.figure(figsize = (8, 4))
plt.subplot(1,2,1)
plt.imshow(img.reshape(28, 28), 'gray')
plt.axis('off')
plt.subplot(1,2,2)
plt.stem(predict[0])
plt.show()

print('Prediction : {}'.format(name[mypred[0]]))

1/1 [==============================] - 0s 18ms/step

Prediction : 2

3.3. Build Supervised Model for Comparison¶

Convolution Neural Networks for MNIST image classification

Model: Same model architecture with the model for the downstream task
The number of total parameter is the same with the model for the downstream task, but is has zero non-trainable parameters

model_sup = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters = 64,
                           kernel_size = (3,3),
                           strides = (2,2),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (28, 28, 1)),

    tf.keras.layers.MaxPool2D(pool_size = (2, 2),
                              strides = (2, 2)),

    tf.keras.layers.Conv2D(filters = 32,
                           kernel_size = (3,3),
                           strides = (1,1),
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (7, 7, 64)),

    tf.keras.layers.MaxPool2D(pool_size = (2, 2),
                              strides = (2, 2)),

    tf.keras.layers.Flatten(),

    tf.keras.layers.Dense(units = 10, activation = 'softmax')

])
model_sup.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d_3 (Conv2D)           (None, 14, 14, 64)        640       
                                                                 
 max_pooling2d_2 (MaxPoolin  (None, 7, 7, 64)          0         
 g2D)                                                            
                                                                 
 conv2d_4 (Conv2D)           (None, 7, 7, 32)          18464     
                                                                 
 max_pooling2d_3 (MaxPoolin  (None, 3, 3, 32)          0         
 g2D)                                                            
                                                                 
 flatten_2 (Flatten)         (None, 288)               0         
                                                                 
 dense_2 (Dense)             (None, 10)                2890      
                                                                 
=================================================================
Total params: 21994 (85.91 KB)
Trainable params: 21994 (85.91 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

model_sup.compile(optimizer = tf.keras.optimizers.SGD(learning_rate = 0.001, momentum = 0.9),
                  loss = 'categorical_crossentropy',
                  metrics = 'accuracy')
model_sup.fit(XX_train,
              YY_train,
              batch_size = 32,
              validation_split = 0.2,
              epochs = 50,
              verbose = 0)

<keras.src.callbacks.History at 0x7e26e27d8bb0>

Experiment: Comparing Self-Supervised and Supervised Learning

In this experiment, we compare the representation quality learned via self-supervised pre-training against a fully supervised baseline, using the MNIST dataset. The experimental setup is as follows.

(1) Pretext task (self-supervised pre-training): A convolutional encoder is trained on 10,000 MNIST images without any labels, using rotation prediction as the pretext task. The model learns to classify which of four rotations ($0°$, $90°$, $180°$, $270°$) has been applied to each image.

(2) Downstream task and supervised baseline: After pre-training, the rotation prediction head is discarded and the encoder is transferred to a digit classification task. A lightweight classifier is attached and trained on only 1,000 labeled MNIST images. Performance is evaluated on 300 labeled test images. The same architecture trained end-to-end with full supervision on the same 1,000 labeled images serves as the baseline for comparison.

(3) Key concepts: In conventional transfer learning, a network such as VGG-16 is pre-trained on a large labeled dataset such as ImageNet, and the learned weights are transferred to a target task. Self-supervised learning replicates this transfer learning pipeline but removes the dependency on labeled pre-training data entirely — the encoder is instead pre-trained on a large pool of unlabeled images using a pretext task. Comparing the downstream classification accuracy of the self-supervised model against the supervised baseline is therefore equivalent to comparing label-free transfer learning against fully supervised learning under the same labeled data budget. This comparison directly quantifies how much useful information the pretext task was able to extract from the unlabeled data.

test_self = model_downstream.evaluate(XX_test, YY_test, batch_size = 64, verbose = 2)

print("")
print('Self-supervised Learning Accuracy on Test Data:  {:.2f}%'.format(test_self[1]*100))

5/5 - 0s - loss: 11.6132 - accuracy: 0.8100 - 42ms/epoch - 8ms/step

Self-supervised Learning Accuracy on Test Data:  81.00%

test_sup = model_sup.evaluate(XX_test, YY_test, batch_size = 64, verbose = 2)

print("")
print('Supervised Learning Accuracy on Test Data:  {:.2f}%'.format(test_sup[1]*100))

5/5 - 0s - loss: 1.6948 - accuracy: 0.7867 - 33ms/epoch - 7ms/step

Supervised Learning Accuracy on Test Data:  78.67%

3.4. Conclusion¶

The experiment above demonstrates that self-supervised pre-training on unlabeled data can yield representations competitive with fully supervised learning, using only a fraction of the labeled data at the downstream stage. This result reinforces the central promise of self-supervised learning: the ability to extract meaningful structure from raw, unannotated data at scale.

However, the results also point to a fundamental design question. The quality of the learned representations is not an intrinsic property of the encoder architecture — it is a direct consequence of the pretext task used during pre-training. A poorly designed pretext task produces representations that are misaligned with the semantics required by downstream tasks. A well-designed pretext task, by contrast, forces the model to internalize the structural regularities of the data that are genuinely useful for a broad range of applications.

This leads to the most important principle in self-supervised learning:

The pretext task is the primary design choice. The richer and more semantically meaningful the self-supervised signal, the more transferable the learned representations will be.

The pretext tasks covered in this section — context prediction, jigsaw puzzle solving, colorization, super-resolution, inpainting, and rotation prediction — each capture a different aspect of visual structure. None of them is universally optimal. Selecting or designing an appropriate pretext task requires understanding both the nature of the data and the requirements of the intended downstream tasks. This challenge of pretext task design remains one of the central research questions in the field.

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')