Transformers

By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents

Source

CS25: Transformers United V2
- Lectures on Transformers
- By Steven Feng, Karan Singh, Chelsea Zou, Jenny Duan, Div Garg, Christopher Manning at Stanford University
- https://web.stanford.edu/class/cs25/
The Annotated Transformer
- Hands-on explanation and code
- By Harvard NLP Group
- http://nlp.seas.harvard.edu/2018/04/03/attention.html
Transformers from Scratch
- Hugging Face course module on Transformers
- By Hugging Face
- https://huggingface.co/course/chapter1/3
Transformers: State-of-the-Art Natural Language Processing
- Hugging Face Transformers documentation and tutorials
- By Hugging Face
- https://huggingface.co/transformers/
[1] Transformers (Original Paper)
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30.

1. From Sequential Modeling to Context Modeling¶

1.1. Motivation¶

Language is often treated as a time series for modeling convenience, but this framing is fundamentally incomplete. A sentence is not a simple left-to-right stream of tokens — it is a network of relationships between words, where meaning emerges from structure, reference, and dependency rather than from temporal proximity alone.

Consider the sentence:

"The trophy didn't fit in the suitcase because it was too big."

The pronoun "it" could refer to either "trophy" or "suitcase" — and resolving this ambiguity requires connecting "it" back to "trophy" across five intervening tokens. A model that processes tokens one by one must somehow preserve this referential relationship through every intermediate step. This illustrates a general principle: meaning is not always determined by neighboring words, and a language model must therefore learn not only the order of tokens, but the relationships between them — which may be local or arbitrarily distant.

1.2. The Bottleneck of Sequential Models¶

RNNs and LSTMs process language sequentially, accumulating past information into a hidden state $h_t$ updated at each time step:

$$h_t = f(x_t, h_{t-1}), \qquad h_t \in \mathbb{R}^d$$

This design forces the model to compress everything it has read — all preceding words, their relationships, and their context — into a single fixed-dimensional vector before processing the next token. As the sequence grows longer, earlier words gradually fade: the gradient signal from an early token must backpropagate through every subsequent time step, attenuating exponentially along the way. This is the vanishing gradient problem.

1.3. Multiple Meanings of a Single Word¶

Natural language is inherently ambiguous — a single word can carry different meanings depending on the context in which it appears.

Consider the following two sentences:

"I swam across the river to get to the other bank."
"I walked across the road to get cash from the bank."

In both sentences, the word "bank" appears, yet its meaning differs entirely based on the surrounding context. In the first sentence, "bank" refers to the edge of a river; in the second, it refers to a financial institution.

How does a model resolve this ambiguity? The key lies in identifying which words in the sentence are most informative. In the first sentence, the word "river" provides the critical clue; in the second, it is "cash".

This example illustrates a fundamental challenge in sequence modeling:

Not all elements of the input sequence contribute equally — some words carry far more contextual weight than others when assigning meaning or making predictions.

This motivates the need for a mechanism that can selectively focus on the most relevant parts of the input. That mechanism is attention.

1.4. From Sequential Processing to Direct Context Modeling¶

The deeper problem is architectural. Sequential hidden states conflate two distinct concerns: the order in which computation proceeds, and the structure of relationships between elements. Natural language requires the latter but does not strictly require the former.

A strong language model should satisfy the following properties:

A word's meaning depends on its full context, not just its position or immediate neighbors
Syntactic and semantic dependencies can span arbitrarily long distances
Multiple words may simultaneously be relevant to interpreting a single token

None of these properties are naturally expressed by a left-to-right recurrence. What is needed is a mechanism that allows any token to directly attend to all other tokens in the sequence and weigh their relevance — regardless of distance. This is the core intuition behind the attention mechanism, and it forms the architectural foundation of the Transformer.

2. The Intuition of Attention¶

2.1. The Core Idea¶

The attention mechanism addresses the bottleneck of sequential models by allowing any token in a sequence to directly interact with any other token, regardless of their distance. Rather than routing information through a chain of hidden states, attention computes a weighted combination of all token representations simultaneously — where the weights reflect how relevant each token is to the one being processed.

Intuitively, attention can be thought of as a soft, differentiable lookup. When interpreting the word "it" in the sentence "The trophy didn't fit in the suitcase because it was too big", a human reader does not replay the entire sentence sequentially — they immediately draw a connection to the most contextually relevant word, "trophy". Attention allows a model to do the same: for each token, it produces a distribution over all other tokens that represents where the model should "look" to gather relevant context.

This visualization demonstrates that attention allows a model to dynamically focus on different parts of the input, assigning varying degrees of importance depending on the context. This ability is what makes transformers highly effective in capturing long-range dependencies and meaning in natural language.

2.2. Attention as Weighted Aggregation¶

Formally, given a sequence of token representations, the attention mechanism computes a new representation for each token as a weighted sum of all token representations in the sequence:

$$\text{output}_i = \sum_{j} \alpha_{ij} \, v_j$$

where $v_j$ is the representation of token $j$, and $\alpha_{ij}$ is the attention weight assigned by token $i$ to token $j$. The weights satisfy:

$$\alpha_{ij} \geq 0, \qquad \sum_{j} \alpha_{ij} = 1$$

so the output is a convex combination of all token representations. A weight $\alpha_{ij}$ close to 1 means token $i$ relies heavily on token $j$ for its contextual representation; a weight close to 0 means token $j$ is largely ignored.

2.3. Analogy: Attention and Class Activation Maps¶

Class Activation Mapping (CAM) is a technique for visualizing which spatial regions of an image a CNN focuses on when making a classification decision. Given the feature map $F^l \in \mathbb{R}^{C \times H \times W}$ at the final convolutional layer and the classification weights $\omega^c \in \mathbb{R}^C$ for class $c$, the class activation map is:

$$\text{CAM}_c(i,j) = \sum_k \omega^c_{k} \cdot F^l_k(i,j)$$

This produces a spatial heatmap that highlights the regions most relevant to the predicted class — a weighted aggregation of feature channels, where the weights reflect class-specific importance.

The structural parallel to attention is direct. In CAM, the model asks: "which spatial locations matter for this classification?" In attention, each token asks: "which other tokens matter for interpreting me?" Both mechanisms answer their respective questions through a weighted sum — CAM over spatial positions, attention over sequence positions. In both cases, the weights are not hand-crafted but emerge from the learned parameters of the model.

There is a deeper conceptual similarity as well. CAM revealed something surprising: a network trained purely on image-level classification labels, with no spatial supervision, nonetheless learned to localize objects. The weights $\omega^{c}$ encode where to look, even though the training signal never explicitly said so. Attention in Transformers exhibits an analogous emergent property — models trained on next-token prediction learn to route information through linguistically meaningful connections, such as coreference and syntactic agreement, without ever being explicitly told to do so.

3. The Concepts of Query, Key, and Value¶

3.1. The Soft Lookup Analogy¶

Before introducing the mathematical formulation of attention, it is helpful to build intuition from a familiar concept in computer science: the dictionary lookup.

In Python, a dictionary maps keys to values and supports exact lookup:

library = {
    "deep learning"   : "Goodfellow et al., 2016",
    "transformers"    : "Vaswani et al., 2017",
    "computer vision" : "Szeliski, 2010",
}

query  = "transformers"
result = library[query]  # returns exactly "Vaswani et al., 2017"

This is a hard lookup — the query must match a key exactly, and only one value is returned. If the query does not match any key, the lookup fails entirely.

3.2. From Hard Lookup to Soft Lookup¶

Now consider a more flexible query:

query = "fast and strongly typed language"

library = {
    "python"  : "interpreted, dynamic typing, readable syntax",
    "c++"     : "compiled, static typing, high performance",
    "java"    : "compiled, static typing, platform independent",
    "haskell" : "functional, static typing, lazy evaluation",
}

keys   = list(library.keys())
values = list(library.values())

# similarity scores (conceptual)
scores = [0.05, 0.60, 0.30, 0.05]

# output: weighted sum of values
output = sum(s * v for s, v in zip(scores, values))

No key in the dictionary matches the query "fast and strongly typed language" exactly — a hard lookup would fail immediately. Yet intuitively, "c++" and "java" are far more relevant than "python" or "haskell", because their descriptions align closely with the concepts of compilation, static typing, and performance.

A soft lookup handles this gracefully. Rather than demanding an exact match, it:

Computes a similarity score between the query and every key
Converts scores into weights (via softmax, so they sum to 1)
Returns a weighted sum of all values — no entry is excluded entirely

In the example above, the output is dominated by the descriptions of "c++" and "java", while "python" and "haskell" contribute only marginally.

This softening is precisely what makes attention compatible with backpropagation. A hard argmax over keys produces no useful gradient; a differentiable weighted sum does — and can therefore be optimized end-to-end.

4. Word Tokenization and Embedding¶

Before a model can process language, raw text must be converted into a mathematical form amenable to computation. This requires two sequential steps: tokenization, which segments text into discrete units, and embedding, which maps those units into a continuous vector space.

4.1. Tokenization¶

A language model does not process raw text directly. Instead, the input is first decomposed into a sequence of elementary units known as tokens. A token is the minimum unit of input that a language model processes — the model reads, predicts, and generates text at the level of individual tokens.

Depending on the chosen strategy, tokens may correspond to whole words, subword units, or individual characters. The choice of tokenization scheme affects both the vocabulary size and the model's ability to handle rare or unseen words.

4.2. Embedding¶

Once the input has been tokenized, each token must be assigned a numerical representation. This is accomplished through an embedding, which maps each token to a real-valued vector in a high-dimensional space.

Formally, let the vocabulary consist of $V$ distinct tokens. Each token is assigned a unique vector in $\mathbb{R}^{d_{\text{model}}}$, where $d_{\text{model}}$ is a user-defined hyperparameter specifying the embedding dimension:

$$\text{token} \xrightarrow{\text{embedding}} \mathbf{x} \in \mathbb{R}^{d_{\text{model}}}$$

The resulting embedding vectors have several important properties. First, they are dense and continuous — in contrast to one-hot encodings, every dimension contributes meaningful information. Second, the vectors are learned parameters, updated during training to minimize the model's loss. Third, and most significantly, the geometry of the embedding space reflects semantic relationships: words with similar meanings tend to occupy nearby regions of the space, while semantically unrelated words are mapped to distant regions.

This geometric structure is not imposed explicitly but emerges naturally from training. As a concrete illustration, consider the embedding vectors for the words "man", "woman", "king", and "queen". Each word is represented as a vector over semantic dimensions such as gender, royalty, and humanness. When these vectors are projected into two-dimensional space, semantically related words appear in close proximity — "king" and "queen" cluster together, as do "man" and "woman" — while the directional offset between "man" and "woman" is approximately preserved in the offset between "king" and "queen". This reflects the model's ability to encode not only similarity but relational structure within the embedding space.

4.3. Positional Encoding¶

One of the key properties of the attention mechanism is that it treats its inputs as a set — it does not inherently account for the order or relative position of elements in a sequence. However, language is fundamentally ordered: "the dog bit the man" and "the man bit the dog" contain identical words but carry entirely different meanings. Positional information must therefore be explicitly injected into the input representations.

This raises a natural question: how should position be encoded and incorporated into the model?

4.3.1. The Binary Analogy¶

Before introducing the mathematical formulation, consider a simple analogy. Suppose we want to assign a unique identifier to each position in a sequence using binary numbers. Position 1 is encoded as 0001, position 2 as 0010, position 3 as 0011, and so on.

Each bit in the binary representation oscillates at a different frequency: the least significant bit flips at every step, the next bit flips every two steps, the next every four steps, and so on. Each position receives a unique code, and the rate of change differs systematically across dimensions — lower-order bits vary rapidly, while higher-order bits vary slowly.

4.3.2. Sinusoidal Positional Encoding¶

Instead of binary digits, each dimension of the positional encoding vector oscillates as a sinusoid whose frequency decreases geometrically with the dimension index:

$$\text{PE}_{(pos,\, 2i)} = \sin\left( \frac{pos}{10000^{2i/d_{\text{model}}}} \right), \qquad \text{PE}_{(pos,\, 2i+1)} = \cos\left( \frac{pos}{10000^{2i/d_{\text{model}}}} \right)$$

where $pos$ denotes the position in the sequence, $i$ is the dimension index, and $d_{\text{model}}$ is the embedding dimensionality. Even-indexed dimensions use the sine function; odd-indexed dimensions use the cosine function.

The correspondence to the binary analogy is direct: the highest-frequency sinusoid plays the role of the least significant bit, and the lowest-frequency sinusoid plays the role of the most significant bit.

4.3.3. Adding Positional Encodings to Token Embeddings¶

The positional encoding vector $\text{PE}_{pos} \in \mathbb{R}^{d_{\text{model}}}$ is added element-wise to the corresponding token embedding $\mathbf{e}_{pos} \in \mathbb{R}^{d_{\text{model}}}$:

$$X_i = E(\text{token}_i) + \text{PE}(\text{token}_i)$$

This additive operation allows the model to retain semantic information from the token embedding while simultaneously acquiring awareness of sequential order from the positional encoding.

A natural concern arises at this point. If the two vectors are simply added together, the resulting vector $X_i$ cannot be decomposed back into its two components — the semantic content and the positional signal are entangled. Does this mean that information is lost in the process?

The answer is that information is indeed mixed, but not lost in any harmful sense.

A useful analogy is that of a musical chord. When two notes are played simultaneously, the resulting sound cannot be cleanly separated back into its individual tones by a passive listener. Yet the chord as a whole conveys richer information than either note alone, and a trained musician can respond appropriately to it. In the same way, the summed vector $X_i$ encodes both what the token is and where it appears, and the model learns to respond to this joint signal.

Furthermore, in practice, the embedding and positional encoding occupy somewhat different regions of the high-dimensional space, and their sum tends to preserve both signals in a recoverable manner — not by explicit decomposition, but through the learned weights of the model. Empirically, models trained with this scheme do learn to attend to positional structure when it matters and to semantic content when it does not, which confirms that neither source of information is effectively suppressed by the addition.

4.4. Computing Similarity Between Token Vectors¶

Having represented each token as a vector in $\mathbb{R}^d$, the attention mechanism must determine how relevant each token is to every other token. This requires a measure of similarity between vectors.

A natural choice is the cosine similarity, which measures the angle between two vectors regardless of their magnitude:

$$\cos\theta = \frac{q \cdot k}{\|q\| \|k\|}$$

In practice, however, when the embedding vectors are normalized or when the magnitudes are controlled through training, the denominator becomes approximately constant. In this case, cosine similarity reduces to the dot product:

$$\text{score}(q, k) = q \cdot k = \sum_{i=1}^{d} q_i k_i$$

The intuition is straightforward: if two token representations point in similar directions in the embedding space, their dot product is large — they are semantically related. If they are orthogonal, the dot product is zero — they carry unrelated information.

In matrix form, when computing all query-key similarity pairs simultaneously:

$$\text{score}(Q, K) = Q K^T$$

Scaling the Dot Product

There is one practical issue with the raw dot product. As the dimension $d$ grows, the dot products tend to grow large in magnitude. To see why, consider $q$ and $k$ with i.i.d. entries drawn from $\mathcal{N}(0, 1)$. The dot product $q \cdot k = \sum_{i=1}^d q_i k_i$ has variance $d$, so its standard deviation grows as $\sqrt{d}$. Large dot products push the subsequent softmax into saturation — one score dominates and the remaining weights collapse toward zero, causing gradients to vanish.

To correct for this, the dot product is scaled by $\sqrt{d_k}$, where $d_k$ is the dimension of the key vectors:

$$\text{score}(Q, K) = \frac{Q K^T}{\sqrt{d_k}}$$

This scaling keeps the variance of the scores at 1 regardless of $d_k$, ensuring stable softmax outputs and well-behaved gradients throughout training. This operation is known as scaled dot-product attention.

4.5. Attention¶

Visualization of Attention as Matrix Operations

To build intuition for the attention computation, it is instructive to trace the dimensions of each matrix involved. Consider the special case of a single query vector $q \in \mathbb{R}^{1 \times d_k}$, with $m$ key and value vectors:

$$q \in \mathbb{R}^{1 \times d_k}, \qquad K \in \mathbb{R}^{m \times d_k}, \qquad V \in \mathbb{R}^{m \times d_v}$$

The attention computation proceeds in two steps. First, the similarity between the query and every key is computed via the scaled dot product:

$$q K^T \in \mathbb{R}^{1 \times m}$$

This produces a row vector of $m$ similarity scores — one for each key. Applying softmax converts these scores into attention weights $\alpha \in \mathbb{R}^{1 \times m}$, which sum to one.

Second, the output is computed as a weighted sum of the value vectors:

$$y = \alpha V \in \mathbb{R}^{1 \times d_v}$$

As illustrated in the figure below, the single query vector (a thin horizontal slice) is multiplied against $K^T$ to produce a thin score vector of length $m$. This score vector is then multiplied against $V$ to produce the output $y$ — again a thin horizontal vector of dimension $d_v$. The shape of the query propagates through the entire computation: a single query in, a single output vector out.

The output $y$ is not any single value vector, but a soft blend of all value vectors:

$$y = \alpha_1 v_1 + \alpha_2 v_2 + \cdots + \alpha_m v_m$$

dominated by the values whose keys were most similar to the query.

The General Case: Multiple Queries

In practice, a sequence of $m$ tokens is processed simultaneously rather than one query at a time. Each token acts as a query and must attend to all other tokens in the sequence. The computation therefore extends naturally to the case of multiple queries, stacked as rows of the query matrix $Q$:

$$Q \in \mathbb{R}^{m \times d_k}, \qquad K \in \mathbb{R}^{m \times d_k}, \qquad V \in \mathbb{R}^{m \times d_v}$$

As illustrated in the figure above, $Q$ and $K$ are now both block matrices rather than thin slices. The scaled dot-product between all query-key pairs is computed in a single matrix multiplication:

$$QK^T \in \mathbb{R}^{m \times m}$$

Each entry $(i, j)$ of this matrix represents the similarity score between the $i$-th query and the $j$-th key. Applying softmax row-wise converts each row into a valid probability distribution over the $m$ keys, yielding the attention weight matrix $A \in \mathbb{R}^{m \times m}$.

The output matrix is then obtained by multiplying the attention weights against the value matrix:

$$Y = AV \in \mathbb{R}^{m \times d_v}$$

Each row $y_i$ of $Y$ is the output representation for the $i$-th token — a weighted sum of all value vectors, where the weights reflect how closely each key matched that token's query:

$$y_i = \sum_{j=1}^{m} \alpha_{ij} v_j$$

The full attention computation for an entire input sequence of $m$ tokens is therefore carried out in three matrix operations: $QK^T$, softmax, and multiplication by $V$. This is computationally efficient and, crucially, fully parallelizable — all tokens are processed simultaneously without the sequential dependencies that characterize recurrent architectures.

4.6. Put Them Together¶

(1) Projecting Inputs into Query, Key, and Value Spaces

So far, we have treated $Q$, $K$, and $V$ as given. In practice, however, the model receives a single input: the embedded token matrix $X \in \mathbb{R}^{N \times d_{\text{model}}}$, where $N$ is the number of tokens in the sequence and $d_{\text{model}}$ is the embedding dimension. The question then arises naturally — where do $Q$, $K$, and $V$ come from?

The answer is that $Q$, $K$, and $V$ are each obtained by linearly projecting the same input matrix $X$ through three separate learned weight matrices:

$$Q = X W^Q, \qquad K = X W^K, \qquad V = X W^V$$

where the projection matrices are:

$$W^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}, \qquad W^K \in \mathbb{R}^{d_{\text{model}} \times d_k}, \qquad W^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$$

As illustrated in the figure below, the same input $X \in \mathbb{R}^{N \times d_{\text{model}}}$ is multiplied by each of the three weight matrices independently, producing three distinct projections of size $N \times d_{\text{model}}$. Each projection extracts a different aspect of the input: the query projection encodes what each token is looking for, the key projection encodes what each token has to offer, and the value projection encodes the content to be aggregated.

These three weight matrices, $W^Q$, $W^K$, and $W^V$, are learnable parameters of the model. They are initialized randomly and updated through gradient descent during training. The model thereby learns, from data alone, how to project the input into a space where meaningful similarity comparisons can be made. There is no hand-crafted notion of relevance — the attention pattern that emerges is entirely determined by what the task requires.

In other words, for attention to work correctly, the model must learn to project the input into query, key, and value spaces that are appropriate for the task at hand. This is precisely what training achieves: by optimizing $W^Q$, $W^K$, and $W^V$ through gradient descent, the model learns which aspects of each token should be compared, which should be retrieved, and how the resulting outputs should be composed. The quality of attention is therefore inseparable from the quality of these three learned projections.

(2) Compute attention weighting and extract features with high attention

How similar is each key (K) to the desired query (Q)?

Return the values that have the highest attention

This yields the output, which is equivalent to the equation below

$$Y = \text{Attention}(Q,K,V) \equiv \text{Softmax} \left[ \dfrac{Q K^T}{\sqrt{d_k}} \right] V $$

4.7. Multi-Head Attention¶

The attention operation described in the previous section constitutes a single self-attention head. A single head projects the input into one query, key, and value space and learns one pattern of relevance across the sequence. While this is already expressive, it is limited to attending to one type of relationship at a time.

In practice, language exhibits multiple types of dependency simultaneously. A single token may be syntactically related to one token, coreferentially linked to another, and semantically associated with a third. A single attention head cannot capture all of these patterns at once, as it produces a single weighted combination of the value vectors.

Multi-head attention addresses this limitation by running $h$ attention heads in parallel, each with its own independent projection matrices $W^Q_i$, $W^K_i$, and $W^V_i$:

$$\text{head}_i = \text{Attention}(X W^Q_i,\ X W^K_i,\ X W^V_i), \qquad i = 1, \ldots, h$$

Each head operates in a lower-dimensional subspace and learns to attend to a different aspect of the input. The outputs of all heads are then concatenated and passed through a final linear projection $W^O$:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O$$

This allows the model to jointly attend to information from multiple representational subspaces, making multi-head attention significantly more expressive than a single head alone. Each attention head serves as a modular building block that can be composed into larger network architectures, as will be discussed in the following section.

4.8. Basic Transformer Block¶

Having introduced tokenization, embedding, positional encoding, and multi-head attention, we are now in a position to assemble these components into a single architectural unit known as the Transformer block.

A Transformer block consists of two sub-layers. The first is a multi-head self-attention layer, which allows each token to gather information from all other tokens in the sequence. The second is a position-wise feed-forward network (MLP), which applies a nonlinear transformation independently to each token representation. The specific configuration of these sub-layers depends on the purpose of the model.

Two additional components are incorporated to improve training stability and optimization efficiency.

The first is a residual connection, which bypasses each sub-layer by adding the input directly to the output:

$$x \leftarrow x + \text{SubLayer}(x)$$

Residual connections allow gradients to flow directly through the network without passing through the sub-layer, alleviating the vanishing gradient problem in deep architectures.

The second is layer normalization, applied after each residual addition:

$$x \leftarrow \text{LayerNorm}(x + \text{SubLayer}(x))$$

Layer normalization stabilizes the distribution of activations across the embedding dimension, leading to faster and more reliable convergence during training.

Together, these components form a single Transformer block. Multiple such blocks are stacked sequentially to form the full model, with each block refining the token representations produced by the previous one.

5. Transformers in Practice¶

Encoder only models
- To produce embeddings from input dadta
- Analyzing text, not generating text
- e.g. BERT
Decoder only models
- Autoregressively generating text with prompts
- Careful with attention: Causal attention
- e.g. GPT
Encoder-decoder models
- Cross attention: Query (Q) from decoder, key (K) & value (V) from encoder
- e.g. BART

5.1. Encoder only transformers¶

Given input sequences, produce fixed-length vectors or class labels
The model is trained to predict the missing masks at the output nodes.

5.2. Decoder only transformers¶

The encoder might not be necessary if we simply learn to predict the following sequence.
Masked attention, i.e. causal attention
- To ensure that the network does not cheat by looking ahead in the sequence

5.3. Encoder-decoder transformers¶

Query (Q) vectors come from the sequence being generated
Key (K) and value (V) vectors come from the representation $\textbf{z}$ (output of encoder)

6. Transformers for Classification¶

6.1. Loading the dataset and defining the dataloader¶

Download data

Train X here

Train y here

import numpy as np
import math
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split
import random

random.seed(42)

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

X = np.load('/content/drive/MyDrive/DL/DL_data/TransformerData/classification_X.npy')
y = np.load('/content/drive/MyDrive/DL/DL_data/TransformerData/classification_y.npy')

print(f"Loaded X {X.shape}, y {y.shape}")

Loaded X (10000, 2000), y (10000,)

class TimeSeriesDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.from_numpy(X).float()
        self.y = torch.from_numpy(y).long()
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

ds = TimeSeriesDataset(X, y)
n_train = int(0.8 * len(ds))
torch.manual_seed(42)
generator = torch.Generator().manual_seed(42)
train_ds, test_ds = random_split(ds, [n_train, len(ds) - n_train], generator=generator)

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_ds,  batch_size=64, shuffle=False)

print(f"Train samples: {len(train_ds)}, Test samples: {len(test_ds)}")

Train samples: 8000, Test samples: 2000

Visualize a few samples

num_examples = 3
plt.figure(figsize=(6, 2 * num_examples))
for i in range(num_examples):
    sig, label = ds[i]
    plt.subplot(num_examples, 1, i+1)
    plt.plot(sig.numpy(), label=f"class={label.item()}")
    plt.title(f"Example {i} — class {label.item()}")
plt.tight_layout()
plt.show()

6.2. Define Necessary Modules and Architecture¶

Self attention head and multi-head attention

class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k, dropout=0.0):
        super().__init__()
        self.d_k = d_k
        self.dropout = nn.Dropout(dropout) if dropout > 0 else None

    def forward(self, Q, K, V):
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        weights = F.softmax(scores, dim=-1)
        if self.dropout:
            weights = self.dropout(weights)
        out = torch.matmul(weights, V)
        return out

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads, dropout=0.0):
        super().__init__()
        assert d_model % n_heads == 0
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.q_lin = nn.Linear(d_model, d_model)
        self.k_lin = nn.Linear(d_model, d_model)
        self.v_lin = nn.Linear(d_model, d_model)

        self.attn = ScaledDotProductAttention(self.d_k, dropout)
        self.fc   = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout) if dropout > 0 else None
        self.norm    = nn.LayerNorm(d_model)

    def forward(self, x):
        B, T, _ = x.size()
        Q = self.q_lin(x).view(B, T, self.n_heads, self.d_k).permute(0,2,1,3)
        K = self.k_lin(x).view(B, T, self.n_heads, self.d_k).permute(0,2,1,3)
        V = self.v_lin(x).view(B, T, self.n_heads, self.d_k).permute(0,2,1,3)

        out = self.attn(Q, K, V)

        out = out.permute(0,2,1,3).contiguous().view(B, T, -1)
        out = self.fc(out)
        if self.dropout:
            out = self.dropout(out)

        return self.norm(x + out)

Design (single) encoder layer based on attention heads above

class EncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=nhead,
            dropout=dropout,
            batch_first=True
        )
        self.linear1 = nn.Linear(d_model, d_ff)
        self.act     = nn.ReLU()
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(d_ff, d_model)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.drop1 = nn.Dropout(dropout)
        self.drop2 = nn.Dropout(dropout)

        self.attn_weights = None

    def forward(self, x):
        attn_out, weights = self.self_attn(
            x, x, x,
            need_weights=True,
            average_attn_weights=False
        )
        self.attn_weights = weights

        x = x + self.drop1(attn_out)
        x = self.norm1(x)

        ff = self.linear2(self.dropout(self.act(self.linear1(x))))
        x = x + self.drop2(ff)
        x = self.norm2(x)

        return x

Stack encoder layers to make encoder transformer
- MLP added at the end for classification

class Transformer(nn.Module):
    def __init__(self, seq_len=2000, d_model=64, nhead=4, nlayers=3, d_ff=128):
        super().__init__()

        self.proj = nn.Sequential(
            nn.Linear(1, d_model),
            nn.LayerNorm(d_model)
        )
        self.pos = nn.Parameter(torch.randn(1, seq_len, d_model))

        self.layers = nn.ModuleList([
            EncoderLayer(d_model, nhead, d_ff, dropout=0.1)
            for _ in range(nlayers)
        ])

        self.cls = nn.Sequential(
            nn.Linear(d_model, 64),
            nn.ReLU(),
            nn.Linear(64, 2)
        )

        self.attn_maps = []

    def forward(self, x):
        B, T = x.shape
        x = self.proj(x.unsqueeze(-1))
        x = x + 0.1 * self.pos[:, :T]

        self.attn_maps = []

        for lyr in self.layers:
            x = lyr(x)
            self.attn_maps.append(lyr.attn_weights)

        logits = self.cls(x.mean(dim=1))
        return logits

6.3. Training¶

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model  = Transformer(seq_len=X.shape[1]).to(device)

lr        = 5e-4
epochs    = 5
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

for epoch in range(1, epochs + 1):
    model.train()
    total_loss    = 0.0
    total_correct = 0
    total         = 0

    for Xb, yb in train_loader:
        Xb, yb = Xb.to(device), yb.to(device)

        optimizer.zero_grad()
        logits = model(Xb)
        loss   = criterion(logits, yb)
        loss.backward()
        optimizer.step()

        total_loss    += loss.item() * yb.size(0)
        preds          = logits.argmax(dim=1)
        total_correct += (preds == yb).sum().item()
        total         += yb.size(0)

    train_loss = total_loss / total
    train_acc  = total_correct / total * 100
    print(f"Epoch {epoch}/{epochs} | "
          f"Train loss: {train_loss:.4f} | "
          f"Train acc: {train_acc:5.2f}%")

Epoch 1/5 | Train loss: 0.4441 | Train acc: 75.31%
Epoch 2/5 | Train loss: 0.0445 | Train acc: 98.78%
Epoch 3/5 | Train loss: 0.0339 | Train acc: 98.83%
Epoch 4/5 | Train loss: 0.0296 | Train acc: 98.99%
Epoch 5/5 | Train loss: 0.0250 | Train acc: 99.26%

You can skip training by downloading pretrained model here

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = Transformer(seq_len=X.shape[1], d_model=64, nhead=4, nlayers=3, d_ff=128)
model = model.to(device)

load_path = '/content/drive/MyDrive/DL/DL_data/TransformerData/transformer_weights.pth'
model.load_state_dict(torch.load(load_path, map_location=device))
model.eval()

print(f"Loaded model weights from {load_path}")

Loaded model weights from /content/drive/MyDrive/TransformerData/transformer_weights.pth

6.4. Analysis of Results¶

Validation accuracy and visualization of samples

model.eval()

criterion = nn.CrossEntropyLoss()
test_loss = 0.0
test_correct = 0
total = 0

with torch.no_grad():
    for batch in test_loader:
        Xb, yb = batch
        Xb, yb = Xb.to(device), yb.to(device)

        logits = model(Xb)
        loss = criterion(logits, yb)

        bs = yb.size(0)
        test_loss += loss.item() * bs

        preds = logits.argmax(dim=1)
        correct = (preds == yb).sum().item()
        test_correct += correct

        total += bs

val_loss = test_loss / total
val_acc = test_correct / total * 100
print("Val Loss:", round(val_loss, 4), " Val Acc:", round(val_acc, 2), "%")

all_indices = list(range(len(test_ds)))
sample_indices = random.sample(all_indices, 3)

for i in range(3):
    idx = sample_indices[i]
    sig, true_label = test_ds[idx]
    sig_np = sig.numpy()

    with torch.no_grad():
        out = model(sig.unsqueeze(0).to(device))
        pred_label = out.argmax(dim=1).cpu().item()

    plt.figure(figsize=(6, 2))
    plt.plot(sig_np)
    plt.title("Sample " + str(i+1) +
              " (idx=" + str(idx) + ") — True: " + str(true_label.item()) +
              ", Pred: " + str(pred_label))
    plt.xlabel("Timestep")
    plt.ylabel("Signal")
    plt.show()

Val Loss: 0.0132  Val Acc: 99.65 %

Visualization of attention maps

sig_idx = 500
sig     = X[test_ds.indices[sig_idx]]
sig_t   = torch.tensor(sig).unsqueeze(0).to(device)

model.eval()
with torch.no_grad():
    _ = model(sig_t)

attn_all = torch.stack(model.attn_maps)

attn_avg = attn_all.mean(dim=2).mean(dim=0)[0]

plt.figure(figsize=(5,4))
plt.title("Average attention (all layers & heads)")
plt.imshow(attn_avg.cpu(), origin='lower', aspect='auto')
plt.colorbar(label="attention weight")
plt.xlabel("Key timestep"); plt.ylabel("Query timestep")
plt.clim([0,0.015])
plt.tight_layout()
plt.show()

layer_id = 2

maps_per_head = attn_all[layer_id, 0]

n_heads = maps_per_head.shape[0]
cols    = min(n_heads, 4)
rows    = (n_heads + cols - 1) // cols

fig, axes = plt.subplots(rows, cols, figsize=(8, 3))
axes = axes.flatten()

for h in range(n_heads):
    ax = axes[h]
    im = ax.imshow(maps_per_head[h].cpu(), origin='lower', aspect='auto', vmin=0, vmax=0.002)
    ax.set_title(f"Layer {layer_id}, Head {h}")
    ax.set_xlabel("Key timestep")
    ax.set_ylabel("Query timestep")

for ax in axes[n_heads:]:
    ax.axis("off")

plt.tight_layout()
plt.show()

Overlay on top of original signal

sig     = X[test_ds.indices[sig_idx]]
sig_t   = torch.tensor(sig).unsqueeze(0).to(device)

model.eval()
with torch.no_grad():
    _ = model(sig_t)

attn_all = torch.stack(model.attn_maps)

score = attn_all.mean(dim=2).mean(dim=0).mean(dim=1)[0].cpu().numpy()
min_s = score.min()
max_s = score.max()
score = (score - min_s) / (max_s - min_s + 1e-9)

thr = 0.1
mask = score >= thr

plt.figure(figsize=(6, 3))
plt.plot(sig, label="Signal (class=1)")
plt.fill_between(
    np.arange(len(sig)),
    sig.min(), sig.max(),
    where=mask,
    color="red", alpha=0.25,
    label=f"Attention ≥ {thr:.2f}"
)
plt.xlabel("Timestep")
plt.ylabel("Signal amplitude")
plt.tight_layout()
plt.show()

7. Time-series Forcasting with Transformers¶

7.1. Loading the dataset and defining the dataloader¶

Download data

Raw signal here

import numpy as np
import math
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split
import random

random.seed(42)

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

signal = np.load('/content/drive/MyDrive/DL/DL_data/TransformerData/raw_signal.npy')
print(f"Loaded raw signal: {signal.shape}")

Loaded raw signal: (20000,)

Because the raw data is a single sequence, we process the data into slices
- Each element at time $t$ is now a vector, not a single scalar

n_step   = 25
n_input  = 100
n_output = 100
stride   = 5
HOLDOUT  = 5_000

signal = (signal - signal.mean()) / signal.std()

train_signal = signal[:-HOLDOUT]
val_signal   = signal[-HOLDOUT:]

total_in_len  = n_step * n_input
total_out_len = n_output
n_samples = (len(train_signal) - total_in_len - total_out_len) // stride
print(f"Total samples: {n_samples}")

X = np.zeros((n_samples, n_step, n_input), dtype=np.float32)
y = np.zeros((n_samples, n_output),        dtype=np.float32)

for i in range(n_samples):
    start = i * stride
    end   = start + total_in_len
    X[i]  = train_signal[start:end].reshape(n_step, n_input)
    y[i]  = train_signal[end:end + n_output]

print(f"Sliced X {X.shape}, y {y.shape}")

Total samples: 2480
Sliced X (2480, 25, 100), y (2480, 100)

class ForecastDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.from_numpy(X).float()
        self.y = torch.from_numpy(y).float()
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

ds = ForecastDataset(X, y)
n_train = int(0.8 * len(ds))
train_ds, test_ds = random_split(ds, [n_train, len(ds) - n_train])

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_ds,  batch_size=64, shuffle=False)

print(f"Train samples: {len(train_ds)}, Test samples: {len(test_ds)}")

Train samples: 1984, Test samples: 496

Visualize the raw data and a few slices

plt.figure(figsize=(7, 3))
plt.plot(signal)
plt.title("Full signal")
plt.xlabel("Timestep"); plt.ylabel("Amplitude")
plt.show()

example_x = X[0].reshape(-1)
plt.figure(figsize=(7, 3))
plt.plot(example_x)
plt.title("Example input slice")
plt.xlabel("Timestep in slice"); plt.ylabel("Amplitude")
plt.show()

7.2. Define Necessary Modules and Architecture¶

Masked attention
- Incorporation of causal mask

class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k, dropout=0.0):
        super().__init__()
        self.d_k    = d_k
        self.dropout = nn.Dropout(dropout) if dropout else None

    def forward(self, Q, K, V, mask=None):

        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(~mask, -1e9)

        weights = F.softmax(scores, dim=-1)
        if self.dropout:
            weights = self.dropout(weights)

        out = torch.matmul(weights, V)
        return out, weights

class MaskedMultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads, dropout=0.0):
        super().__init__()
        self.d_model  = d_model
        self.n_heads  = n_heads
        self.head_dim = d_model // n_heads

        self.q_lin = nn.Linear(d_model, d_model)
        self.k_lin = nn.Linear(d_model, d_model)
        self.v_lin = nn.Linear(d_model, d_model)

        self.sdp_attention = ScaledDotProductAttention(self.head_dim, dropout)
        self.out_lin       = nn.Linear(d_model, d_model)
        self.dropout       = nn.Dropout(dropout) if dropout else None

    def forward(self, query, key, value, mask):

        B, T, _ = query.size()

        def _split(x):
            return (x.view(B, T, self.n_heads, self.head_dim)
                     .permute(0, 2, 1, 3)
                     .reshape(-1, T, self.head_dim))

        q = _split(self.q_lin(query))
        k = _split(self.k_lin(key))
        v = _split(self.v_lin(value))

        mask_heads = mask.repeat_interleave(self.n_heads, dim=0)

        attn_out, attn_w = self.sdp_attention(q, k, v, mask_heads)

        attn_out = (attn_out.view(B, self.n_heads, T, self.head_dim)
                              .permute(0, 2, 1, 3)
                              .contiguous()
                              .view(B, T, self.d_model))

        out = self.out_lin(attn_out)
        if self.dropout:
            out = self.dropout(out)

        attn_w = attn_w.view(B, self.n_heads, T, T)
        return out, attn_w

def make_causal_mask(seq: torch.Tensor) -> torch.Tensor:
    B, T, _ = seq.size()
    mask = 1 - torch.triu(torch.ones(T, T, device=seq.device), diagonal=1)
    return mask.bool().unsqueeze(0).expand(B, -1, -1)

Define a single decoder layer (block)

class DecoderLayer(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MaskedMultiHeadAttention(d_model, n_heads, dropout)
        self.linear1   = nn.Linear(d_model, d_ff)
        self.act       = nn.ReLU()
        self.dropout   = nn.Dropout(dropout)
        self.linear2   = nn.Linear(d_ff, d_model)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.drop1 = nn.Dropout(dropout)
        self.drop2 = nn.Dropout(dropout)

        self.attn_weights = None

    def forward(self, x, mask):
        attn_out, w = self.self_attn(x, x, x, mask)
        self.attn_weights = w

        x = self.norm1(x + self.drop1(attn_out))
        ff = self.linear2(self.dropout(self.act(self.linear1(x))))
        x = self.norm2(x + self.drop2(ff))

        return x

Build decoder (only) transformer by stacking decoder blocks

class Transformer(nn.Module):
    def __init__(
        self,
        n_step   = 25,
        n_input  = 100,
        n_output = 100,
        d_model  = 128,
        n_heads  = 8,
        n_layers = 2,
        d_ff     = 512
    ):
        super().__init__()
        self.proj = nn.Sequential(
            nn.Linear(n_input, d_model),
            nn.LayerNorm(d_model)
        )
        self.pos = nn.Parameter(torch.randn(1, n_step, d_model))
        self.layers = nn.ModuleList([
            DecoderLayer(d_model, n_heads, d_ff, dropout=0.1)
            for _ in range(n_layers)
        ])

        self.mlp = nn.Sequential(
            nn.Linear(d_model * n_step, 256),
            nn.ReLU(),
            nn.Linear(256, n_output)
        )

        self.attn_maps = []

    def forward(self, x, return_attn=False):
        x = self.proj(x)
        x = x + 0.1 * self.pos[:, : x.size(1)]

        mask = make_causal_mask(x)

        self.attn_maps = []
        for layer in self.layers:
            x = layer(x, mask)
            self.attn_maps.append(layer.attn_weights)
        out = self.mlp(x.flatten(start_dim=1))

        if return_attn:
            return out, self.attn_maps
        return out

7.3. Training¶

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = Transformer(
    n_step   = X.shape[1],
    n_input  = X.shape[2],
    n_output = y.shape[1]
).to(device)

lr = 5e-4
epochs = 10
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

for epoch in range(1, epochs + 1):
    model.train()
    total_loss = 0.0
    total      = 0

    for Xb, yb in train_loader:
        Xb, yb = Xb.to(device), yb.to(device)

        optimizer.zero_grad()
        preds = model(Xb)
        loss  = criterion(preds, yb)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * yb.size(0)
        total      += yb.size(0)

    train_loss = total_loss / total
    print(f"Epoch {epoch}/{epochs} | Train MSE: {train_loss:.6f}")

Epoch 1/10 | Train MSE: 0.300282
Epoch 2/10 | Train MSE: 0.070095
Epoch 3/10 | Train MSE: 0.049815
Epoch 4/10 | Train MSE: 0.042303
Epoch 5/10 | Train MSE: 0.039264
Epoch 6/10 | Train MSE: 0.034764
Epoch 7/10 | Train MSE: 0.032821
Epoch 8/10 | Train MSE: 0.031844
Epoch 9/10 | Train MSE: 0.031493
Epoch 10/10 | Train MSE: 0.029679

You can skip training by downloading pretrained model here

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = Transformer(
    n_step   = n_step,
    n_input  = n_input,
    n_output = n_output,
    d_model  = 128,
    n_heads  = 8,
    n_layers = 2,
    d_ff     = 512
).to(device)

load_path = "/content/drive/MyDrive/DL/DL_data/TransformerData/forecast_weights.pth"
model.load_state_dict(torch.load(load_path, map_location=device))
model.eval()

print(f"Loaded model weights from {load_path}")

Loaded model weights from /content/drive/MyDrive/DL/DL_data/TransformerData/forecast_weights.pth

7.4. Analysis of Results¶

Visualize a single prediction
- Given 25 timesteps, predict 1

horizon  = 2500
val_seg  = signal[-(n_step * n_input + horizon):]

window_flat = val_seg[: n_step * n_input]
gt_future   = val_seg[n_step * n_input:]

window = (torch.from_numpy(window_flat.astype(np.float32))
              .view(1, n_step, n_input)
              .to(device))

model.eval()
with torch.no_grad():
    one_pred = model(window).cpu().numpy().ravel()

plt.figure(figsize=(4.3, 3))
plt.plot(range(n_step * n_input), window_flat, label="Given")
plt.plot(range(n_step * n_input, n_step * n_input + n_input),
         one_pred, 'r', label="Predicted")
plt.plot(range(n_step * n_input, n_step * n_input + n_input),
         gt_future[:n_input], 'b--', label="Ground truth")
plt.axvline(n_step * n_input, color='k', linestyle='--')
plt.xlabel("Timestep"); plt.ylabel("Amplitude"); plt.legend()
plt.show()

Roll-out the predictions 25 times
- Autoregressive prediction (forecasting)

preds = []
roll_window = window.clone()

with torch.no_grad():
    for _ in range(horizon // n_input):
        out = model(roll_window)
        preds.append(out.cpu().numpy().ravel())
        roll_window = torch.cat([roll_window[:, 1:], out.unsqueeze(1)], dim=1)

preds = np.concatenate(preds)

plt.figure(figsize=(8, 3))
plt.plot(range(n_step * n_input), window_flat, label="Given")
plt.plot(range(n_step * n_input, n_step * n_input + horizon),
         preds, 'r', label="Predicted")
plt.plot(range(n_step * n_input, n_step * n_input + horizon),
         gt_future, 'b--', label="Ground truth")
plt.axvline(n_step * n_input, color='k', linestyle='--')
plt.xlabel("Timestep"); plt.ylabel("Amplitude"); plt.legend()
plt.show()

8. Solving Engineering Problems with Transformers (1)¶

Making a Surrogate Model

Sequence-to-sequence mapping: From load profiles to stress profiles

8.1. Loading the dataset and defining the dataloader¶

Download data

Load profiles here

Stress profiles here

import numpy as np
import math
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split
import random

random.seed(42)

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

loads   = np.load('/content/drive/MyDrive/DL/DL_data/TransformerData/loads.npy')      # shape (N, T_load)
stresses = np.load('/content/drive/MyDrive/DL/DL_data/TransformerData/stresses.npy')  # shape (N, T_stress)
print(f"Loaded loads:   {loads.shape}")
print(f"Loaded stresses:{stresses.shape}")

Loaded loads:   (10000, 200)
Loaded stresses:(10000, 200)

loads    = (loads    - loads.mean())    / loads.std()
stresses = (stresses - stresses.mean()) / stresses.std()

class CantileverDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.from_numpy(X).float()
        self.y = torch.from_numpy(y).float()
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

ds = CantileverDataset(loads, stresses)
n_train = int(0.8 * len(ds))
train_ds, test_ds = random_split(ds, [n_train, len(ds) - n_train])

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_ds,  batch_size=64, shuffle=False)

print(f"Train samples: {len(train_ds)}, Test samples: {len(test_ds)}")

Train samples: 8000, Test samples: 2000

Visualize a single pair

x0, y0 = ds[0]
plt.figure(figsize=(6,4))
plt.subplot(2,1,1)
plt.plot(x0.numpy())
plt.title("Load sequence (example 0)")
plt.xlabel("Spatial index"); plt.ylabel("Load")

plt.subplot(2,1,2)
plt.plot(y0.numpy())
plt.title("Stress profile (example 0)")
plt.xlabel("Spatial index"); plt.ylabel("Stress")
plt.tight_layout()
plt.show()

8.2. Define Transformer¶

Using PyTorch's modules simplifies the model definition

class TransformerEncoderRegressor(nn.Module):
    def __init__(self, seq_len, d_model=64, nhead=8, num_layers=3, dim_feedforward=256, dropout=0.1):
        super().__init__()
        self.input_proj = nn.Linear(1, d_model)
        pe = torch.zeros(seq_len, d_model)
        pos = torch.arange(seq_len).unsqueeze(1).float()
        div = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(pos * div)
        pe[:, 1::2] = torch.cos(pos * div)
        self.register_buffer("pos_encoding", pe.unsqueeze(0))

        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

        self.output_head = nn.Linear(d_model, 1)

    def forward(self, x):
        h = self.input_proj(x)
        h = h + self.pos_encoding
        h = self.encoder(h)
        return self.output_head(h)

8.3. Training¶

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = TransformerEncoderRegressor(loads.shape[1]).to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)

epochs = 10

for epoch in range(1, epochs + 1):
    model.train()
    train_loss = 0.0
    for src, tgt in train_loader:
        src = src.unsqueeze(-1).to(device)
        tgt = tgt.unsqueeze(-1).to(device)

        pred = model(src)
        loss = criterion(pred, tgt)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss += loss.item()

    train_loss /= len(train_loader)

    model.eval()
    test_loss = 0.0
    with torch.no_grad():
        for src, tgt in test_loader:

            src = src.unsqueeze(-1).to(device)
            tgt = tgt.unsqueeze(-1).to(device)

            pred = model(src)
            loss = criterion(pred, tgt)
            test_loss += loss.item()

    test_loss /= len(test_loader)

    print(f"Epoch {epoch}/{epochs} | Train MSE: {train_loss:.6f}")

Epoch 1/10 | Train MSE: 0.209771
Epoch 2/10 | Train MSE: 0.019389
Epoch 3/10 | Train MSE: 0.012241
Epoch 4/10 | Train MSE: 0.009195
Epoch 5/10 | Train MSE: 0.007276
Epoch 6/10 | Train MSE: 0.006162
Epoch 7/10 | Train MSE: 0.005284
Epoch 8/10 | Train MSE: 0.004806
Epoch 9/10 | Train MSE: 0.004059
Epoch 10/10 | Train MSE: 0.003768

8.4. Analysis of Results¶

Visualize GT stress profile vs. predicted stress profiles for a few samples

model.eval()

all_preds = []
with torch.no_grad():
    for src, _ in test_loader:
        src = src.unsqueeze(-1).to(device)
        out = model(src).cpu().squeeze(-1)
        all_preds.append(out)
preds = torch.cat(all_preds, dim=0)

n_examples = 5
indices    = random.sample(range(len(test_ds)), n_examples)

fig, axes = plt.subplots(n_examples, 2, figsize=(7, 2*n_examples))
for i, idx in enumerate(indices):
    load, true = test_ds[idx]
    pred       = preds[idx]

    ax = axes[i, 0]
    ax.plot(load.numpy(), linewidth=2)
    ax.set_title(f"Load profile #{idx}")
    ax.set_xlabel("Spatial index")
    ax.set_ylabel("Load")

    ax = axes[i, 1]
    ax.plot(true.numpy(), label="True", linewidth=2)
    ax.plot(pred.numpy(), "--", label="Predicted", linewidth=2)
    ax.set_title(f"Stress profile #{idx}")
    ax.set_xlabel("Spatial index")
    ax.set_ylabel("Stress")
    ax.legend()

plt.tight_layout()
plt.show()

9. Solving Engineering Problems with Transformers (2)¶

9.1. Temporal Flow Field Prediction¶

Sequence prediction with flow field snapshots

Download data

Flow field snapshots here

import numpy as np
import math
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split
import random

random.seed(42)

u = np.load('/content/drive/MyDrive/DL/DL_data/TransformerData/u_field.npy')
print(f"Loaded u_field: {u.shape}")

window_size = 5
validation_frames = 5
n_batch = 16
nx, ny, _ = u.shape

n_snap = u.shape[2]
flat = u.transpose(2, 0, 1).reshape(n_snap, -1)

train_flat = flat[:-validation_frames]
val_flat   = flat[-(validation_frames + window_size):]

class FlowfieldDataset(Dataset):
    def __init__(self, arr, ctx):
        xs, ys = [], []
        for i in range(len(arr) - ctx):
            xs.append(arr[i:i+ctx])
            ys.append(arr[i+ctx])
        self.x = torch.from_numpy(np.stack(xs)).float()
        self.y = torch.from_numpy(np.stack(ys)).float()
    def __len__(self):
        return len(self.x)
    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

train_ds = FlowfieldDataset(train_flat, window_size)
val_ds   = FlowfieldDataset(val_flat,   window_size)

train_loader = DataLoader(train_ds, batch_size=n_batch, shuffle=True)
val_loader   = DataLoader(val_ds,   batch_size=n_batch, shuffle=False)

print(f"Train samples: {len(train_ds)}, Val samples: {len(val_ds)}")

Loaded u_field: (199, 449, 151)
Train samples: 141, Val samples: 5

class TransformerDecoder(nn.Module):
    def __init__(
        self,
        ctx,
        feat_dim,
        d_model=512,
        nhead=8,
        num_layers=4,
        dim_feedforward=2048,
        dropout=0.1
    ):
        super().__init__()
        self.in_proj = nn.Linear(feat_dim, d_model)

        pe = torch.zeros(ctx, d_model)
        pos = torch.arange(ctx).unsqueeze(1).float()
        div = torch.exp(
            torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(pos * div)
        pe[:, 1::2] = torch.cos(pos * div)
        self.pos_encoding = pe.unsqueeze(0)

        mask = torch.triu(torch.ones(ctx, ctx), diagonal=1).bool()
        self.causal_mask = mask

        layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )
        self.encoder = nn.TransformerEncoder(layer, num_layers=num_layers)

        self.out_proj = nn.Linear(d_model, feat_dim)

    def forward(self, x):
        pos = self.pos_encoding.to(x.device)
        mask = self.causal_mask.to(x.device)

        h = self.in_proj(x) + pos
        h = self.encoder(h, mask)
        return self.out_proj(h[:, -1])

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

feat_dim = train_flat.shape[1]
model = TransformerDecoder(window_size, feat_dim).to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
epochs = 200

for epoch in range(1, epochs + 1):
    model.train()
    train_loss = 0.0

    for src, tgt in train_loader:
        src, tgt = src.to(device), tgt.to(device)
        pred = model(src)
        loss = criterion(pred, tgt)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    train_loss /= len(train_loader)

    print(f"Epoch {epoch}/{epochs} | Train MSE: {train_loss:.6f}")

Epoch 1/200 | Train MSE: 0.471873
Epoch 2/200 | Train MSE: 0.081464
Epoch 3/200 | Train MSE: 0.049184
Epoch 4/200 | Train MSE: 0.036663
Epoch 5/200 | Train MSE: 0.031704
Epoch 6/200 | Train MSE: 0.028616
Epoch 7/200 | Train MSE: 0.027344
Epoch 8/200 | Train MSE: 0.026692
Epoch 9/200 | Train MSE: 0.026357
Epoch 10/200 | Train MSE: 0.025579
Epoch 11/200 | Train MSE: 0.025638
Epoch 12/200 | Train MSE: 0.025994
Epoch 13/200 | Train MSE: 0.025940
Epoch 14/200 | Train MSE: 0.025399
Epoch 15/200 | Train MSE: 0.024975
Epoch 16/200 | Train MSE: 0.024843
Epoch 17/200 | Train MSE: 0.024615
Epoch 18/200 | Train MSE: 0.024873
Epoch 19/200 | Train MSE: 0.024341
Epoch 20/200 | Train MSE: 0.024350
Epoch 21/200 | Train MSE: 0.024450
Epoch 22/200 | Train MSE: 0.024346
Epoch 23/200 | Train MSE: 0.024502
Epoch 24/200 | Train MSE: 0.024251
Epoch 25/200 | Train MSE: 0.024201
Epoch 26/200 | Train MSE: 0.024217
Epoch 27/200 | Train MSE: 0.023793
Epoch 28/200 | Train MSE: 0.023560
Epoch 29/200 | Train MSE: 0.023303
Epoch 30/200 | Train MSE: 0.023394
Epoch 31/200 | Train MSE: 0.023019
Epoch 32/200 | Train MSE: 0.023095
Epoch 33/200 | Train MSE: 0.023247
Epoch 34/200 | Train MSE: 0.023687
Epoch 35/200 | Train MSE: 0.024170
Epoch 36/200 | Train MSE: 0.022834
Epoch 37/200 | Train MSE: 0.023387
Epoch 38/200 | Train MSE: 0.022983
Epoch 39/200 | Train MSE: 0.022832
Epoch 40/200 | Train MSE: 0.022493
Epoch 41/200 | Train MSE: 0.022938
Epoch 42/200 | Train MSE: 0.022244
Epoch 43/200 | Train MSE: 0.022374
Epoch 44/200 | Train MSE: 0.022242
Epoch 45/200 | Train MSE: 0.022413
Epoch 46/200 | Train MSE: 0.022600
Epoch 47/200 | Train MSE: 0.022384
Epoch 48/200 | Train MSE: 0.022211
Epoch 49/200 | Train MSE: 0.022792
Epoch 50/200 | Train MSE: 0.022294
Epoch 51/200 | Train MSE: 0.022280
Epoch 52/200 | Train MSE: 0.022254
Epoch 53/200 | Train MSE: 0.021764
Epoch 54/200 | Train MSE: 0.022045
Epoch 55/200 | Train MSE: 0.021411
Epoch 56/200 | Train MSE: 0.021193
Epoch 57/200 | Train MSE: 0.021285
Epoch 58/200 | Train MSE: 0.020814
Epoch 59/200 | Train MSE: 0.021403
Epoch 60/200 | Train MSE: 0.021799
Epoch 61/200 | Train MSE: 0.019091
Epoch 62/200 | Train MSE: 0.016746
Epoch 63/200 | Train MSE: 0.016406
Epoch 64/200 | Train MSE: 0.012520
Epoch 65/200 | Train MSE: 0.009914
Epoch 66/200 | Train MSE: 0.008998
Epoch 67/200 | Train MSE: 0.005871
Epoch 68/200 | Train MSE: 0.004704
Epoch 69/200 | Train MSE: 0.004294
Epoch 70/200 | Train MSE: 0.004045
Epoch 71/200 | Train MSE: 0.004090
Epoch 72/200 | Train MSE: 0.003760
Epoch 73/200 | Train MSE: 0.003565
Epoch 74/200 | Train MSE: 0.003584
Epoch 75/200 | Train MSE: 0.003253
Epoch 76/200 | Train MSE: 0.003321
Epoch 77/200 | Train MSE: 0.003248
Epoch 78/200 | Train MSE: 0.003321
Epoch 79/200 | Train MSE: 0.003134
Epoch 80/200 | Train MSE: 0.003625
Epoch 81/200 | Train MSE: 0.004075
Epoch 82/200 | Train MSE: 0.004519
Epoch 83/200 | Train MSE: 0.004351
Epoch 84/200 | Train MSE: 0.003785
Epoch 85/200 | Train MSE: 0.004119
Epoch 86/200 | Train MSE: 0.004066
Epoch 87/200 | Train MSE: 0.003404
Epoch 88/200 | Train MSE: 0.002902
Epoch 89/200 | Train MSE: 0.002734
Epoch 90/200 | Train MSE: 0.002567
Epoch 91/200 | Train MSE: 0.002440
Epoch 92/200 | Train MSE: 0.002355
Epoch 93/200 | Train MSE: 0.002237
Epoch 94/200 | Train MSE: 0.002226
Epoch 95/200 | Train MSE: 0.002230
Epoch 96/200 | Train MSE: 0.002061
Epoch 97/200 | Train MSE: 0.002004
Epoch 98/200 | Train MSE: 0.001957
Epoch 99/200 | Train MSE: 0.001821
Epoch 100/200 | Train MSE: 0.001756
Epoch 101/200 | Train MSE: 0.001713
Epoch 102/200 | Train MSE: 0.001705
Epoch 103/200 | Train MSE: 0.001663
Epoch 104/200 | Train MSE: 0.001434
Epoch 105/200 | Train MSE: 0.001400
Epoch 106/200 | Train MSE: 0.001283
Epoch 107/200 | Train MSE: 0.001279
Epoch 108/200 | Train MSE: 0.001213
Epoch 109/200 | Train MSE: 0.001170
Epoch 110/200 | Train MSE: 0.001132
Epoch 111/200 | Train MSE: 0.001067
Epoch 112/200 | Train MSE: 0.001081
Epoch 113/200 | Train MSE: 0.001164
Epoch 114/200 | Train MSE: 0.001159
Epoch 115/200 | Train MSE: 0.001198
Epoch 116/200 | Train MSE: 0.001116
Epoch 117/200 | Train MSE: 0.000985
Epoch 118/200 | Train MSE: 0.000928
Epoch 119/200 | Train MSE: 0.000962
Epoch 120/200 | Train MSE: 0.001027
Epoch 121/200 | Train MSE: 0.000998
Epoch 122/200 | Train MSE: 0.000987
Epoch 123/200 | Train MSE: 0.000970
Epoch 124/200 | Train MSE: 0.000970
Epoch 125/200 | Train MSE: 0.000860
Epoch 126/200 | Train MSE: 0.000844
Epoch 127/200 | Train MSE: 0.000843
Epoch 128/200 | Train MSE: 0.000798
Epoch 129/200 | Train MSE: 0.000806
Epoch 130/200 | Train MSE: 0.000897
Epoch 131/200 | Train MSE: 0.000897
Epoch 132/200 | Train MSE: 0.000923
Epoch 133/200 | Train MSE: 0.001313
Epoch 134/200 | Train MSE: 0.001454
Epoch 135/200 | Train MSE: 0.020584
Epoch 136/200 | Train MSE: 0.022535
Epoch 137/200 | Train MSE: 0.021590
Epoch 138/200 | Train MSE: 0.020041
Epoch 139/200 | Train MSE: 0.023718
Epoch 140/200 | Train MSE: 0.021091
Epoch 141/200 | Train MSE: 0.021413
Epoch 142/200 | Train MSE: 0.020121
Epoch 143/200 | Train MSE: 0.020144
Epoch 144/200 | Train MSE: 0.020226
Epoch 145/200 | Train MSE: 0.019975
Epoch 146/200 | Train MSE: 0.019779
Epoch 147/200 | Train MSE: 0.019280
Epoch 148/200 | Train MSE: 0.018916
Epoch 149/200 | Train MSE: 0.018796
Epoch 150/200 | Train MSE: 0.018856
Epoch 151/200 | Train MSE: 0.019137
Epoch 152/200 | Train MSE: 0.019022
Epoch 153/200 | Train MSE: 0.018929
Epoch 154/200 | Train MSE: 0.018814
Epoch 155/200 | Train MSE: 0.018956
Epoch 156/200 | Train MSE: 0.018932
Epoch 157/200 | Train MSE: 0.018645
Epoch 158/200 | Train MSE: 0.018575
Epoch 159/200 | Train MSE: 0.018630
Epoch 160/200 | Train MSE: 0.019229
Epoch 161/200 | Train MSE: 0.018582
Epoch 162/200 | Train MSE: 0.018830
Epoch 163/200 | Train MSE: 0.020314
Epoch 164/200 | Train MSE: 0.019554
Epoch 165/200 | Train MSE: 0.019587
Epoch 166/200 | Train MSE: 0.019357
Epoch 167/200 | Train MSE: 0.018144
Epoch 168/200 | Train MSE: 0.018054
Epoch 169/200 | Train MSE: 0.017142
Epoch 170/200 | Train MSE: 0.016989
Epoch 171/200 | Train MSE: 0.017748
Epoch 172/200 | Train MSE: 0.015638
Epoch 173/200 | Train MSE: 0.013608
Epoch 174/200 | Train MSE: 0.015236
Epoch 175/200 | Train MSE: 0.015838
Epoch 176/200 | Train MSE: 0.013995
Epoch 177/200 | Train MSE: 0.014487
Epoch 178/200 | Train MSE: 0.013490
Epoch 179/200 | Train MSE: 0.012588
Epoch 180/200 | Train MSE: 0.011986
Epoch 181/200 | Train MSE: 0.011096
Epoch 182/200 | Train MSE: 0.010133
Epoch 183/200 | Train MSE: 0.009637
Epoch 184/200 | Train MSE: 0.011238
Epoch 185/200 | Train MSE: 0.010846
Epoch 186/200 | Train MSE: 0.014309
Epoch 187/200 | Train MSE: 0.016573
Epoch 188/200 | Train MSE: 0.014186
Epoch 189/200 | Train MSE: 0.010622
Epoch 190/200 | Train MSE: 0.008895
Epoch 191/200 | Train MSE: 0.007585
Epoch 192/200 | Train MSE: 0.011921
Epoch 193/200 | Train MSE: 0.013625
Epoch 194/200 | Train MSE: 0.011989
Epoch 195/200 | Train MSE: 0.011122
Epoch 196/200 | Train MSE: 0.010658
Epoch 197/200 | Train MSE: 0.011197
Epoch 198/200 | Train MSE: 0.011081
Epoch 199/200 | Train MSE: 0.010985
Epoch 200/200 | Train MSE: 0.012367

You can skip training by downloading pretrained model here

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
feat_dim = train_flat.shape[1]
model = TransformerDecoder(window_size, feat_dim).to(device)

load_path = '/content/drive/MyDrive/DL/DL_data/TransformerData/u_forecast_weights.pth'
state     = torch.load(load_path, map_location=device)
model.load_state_dict(state)
model.eval()

print(f"Loaded TransformerDecoder weights from {load_path}")

Loaded TransformerDecoder weights from /content/drive/MyDrive/DL/DL_data/TransformerData/u_forecast_weights.pth

model.eval()

preds = []
with torch.no_grad():
    for seq, _ in train_ds:
        inp = seq.unsqueeze(0).to(device)
        out = model(inp).cpu().squeeze(0)
        preds.append(out.numpy())

n_examples = 5
indices = random.sample(range(len(train_ds)), n_examples)

fig, axes = plt.subplots(n_examples, 2, figsize=(10, 3*n_examples))
for i, idx in enumerate(indices):
    _, true_frame = train_ds[idx]
    pred_frame    = preds[idx]

    true_img = true_frame.numpy().reshape(nx, ny)
    pred_img = pred_frame.reshape(nx, ny)

    ax = axes[i, 0]
    ax.imshow(true_img, origin="lower")
    ax.set_title(f"True snapshot #{idx}")
    ax.axis("off")

    ax = axes[i, 1]
    ax.imshow(pred_img, origin="lower")
    ax.set_title(f"Predicted snapshot #{idx}")
    ax.axis("off")

plt.tight_layout()
plt.show()

model.eval()

last_window = flat[-(validation_frames + window_size):-validation_frames]
future_true = flat[-validation_frames:]

input_tensor = torch.from_numpy(last_window).float().unsqueeze(0).to(device)
predictions = []

with torch.no_grad():
    for _ in range(validation_frames):
        out = model(input_tensor)
        frame_pred = out[0]
        predictions.append(frame_pred.cpu().numpy())
        input_tensor = torch.cat([
            input_tensor[:, 1:, :],
            frame_pred.unsqueeze(0).unsqueeze(0).to(device)
        ], dim=1)

pred_np = np.stack(predictions)
gt = future_true.reshape(validation_frames, nx, ny)
pr = pred_np.reshape(validation_frames, nx, ny)
err = pr - gt

fig, axes = plt.subplots(validation_frames, 2, figsize=(8, 3*validation_frames))
for t in range(validation_frames):
    ax = axes[t, 0]
    ax.imshow(gt[t], origin='lower')
    ax.set_title(f"t = −{validation_frames-t}  true")
    ax.axis("off")

    ax = axes[t, 1]
    ax.imshow(pr[t], origin='lower')
    ax.set_title(f"t = −{validation_frames-t}  pred")
    ax.axis("off")

plt.tight_layout()
plt.show()

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')