Transformers
Table of Contents
Source
CS25: Transformers United V2
The Annotated Transformer
Transformers from Scratch
Transformers: State-of-the-Art Natural Language Processing
[1] Transformers (Original Paper)
Language is often treated as a time series for modeling convenience, but this framing is fundamentally incomplete. A sentence is not a simple left-to-right stream of tokens — it is a network of relationships between words, where meaning emerges from structure, reference, and dependency rather than from temporal proximity alone.
Consider the sentence:
"The trophy didn't fit in the suitcase because it was too big."
The pronoun "it" could refer to either "trophy" or "suitcase" — and resolving this ambiguity requires connecting "it" back to "trophy" across five intervening tokens. A model that processes tokens one by one must somehow preserve this referential relationship through every intermediate step. This illustrates a general principle: meaning is not always determined by neighboring words, and a language model must therefore learn not only the order of tokens, but the relationships between them — which may be local or arbitrarily distant.
RNNs and LSTMs process language sequentially, accumulating past information into a hidden state $h_t$ updated at each time step:
This design forces the model to compress everything it has read — all preceding words, their relationships, and their context — into a single fixed-dimensional vector before processing the next token. As the sequence grows longer, earlier words gradually fade: the gradient signal from an early token must backpropagate through every subsequent time step, attenuating exponentially along the way. This is the vanishing gradient problem.
Natural language is inherently ambiguous — a single word can carry different meanings depending on the context in which it appears.
Consider the following two sentences:
"I swam across the river to get to the other bank.""I walked across the road to get cash from the bank."In both sentences, the word "bank" appears, yet its meaning differs entirely based on the surrounding context. In the first sentence, "bank" refers to the edge of a river; in the second, it refers to a financial institution.
How does a model resolve this ambiguity? The key lies in identifying which words in the sentence are most informative. In the first sentence, the word "river" provides the critical clue; in the second, it is "cash".
This example illustrates a fundamental challenge in sequence modeling:
Not all elements of the input sequence contribute equally — some words carry far more contextual weight than others when assigning meaning or making predictions.
This motivates the need for a mechanism that can selectively focus on the most relevant parts of the input. That mechanism is attention.
The deeper problem is architectural. Sequential hidden states conflate two distinct concerns: the order in which computation proceeds, and the structure of relationships between elements. Natural language requires the latter but does not strictly require the former.
A strong language model should satisfy the following properties:
None of these properties are naturally expressed by a left-to-right recurrence. What is needed is a mechanism that allows any token to directly attend to all other tokens in the sequence and weigh their relevance — regardless of distance. This is the core intuition behind the attention mechanism, and it forms the architectural foundation of the Transformer.
The attention mechanism addresses the bottleneck of sequential models by allowing any token in a sequence to directly interact with any other token, regardless of their distance. Rather than routing information through a chain of hidden states, attention computes a weighted combination of all token representations simultaneously — where the weights reflect how relevant each token is to the one being processed.
Intuitively, attention can be thought of as a soft, differentiable lookup. When interpreting the word "it" in the sentence "The trophy didn't fit in the suitcase because it was too big", a human reader does not replay the entire sentence sequentially — they immediately draw a connection to the most contextually relevant word, "trophy". Attention allows a model to do the same: for each token, it produces a distribution over all other tokens that represents where the model should "look" to gather relevant context.
This visualization demonstrates that attention allows a model to dynamically focus on different parts of the input, assigning varying degrees of importance depending on the context. This ability is what makes transformers highly effective in capturing long-range dependencies and meaning in natural language.
Formally, given a sequence of token representations, the attention mechanism computes a new representation for each token as a weighted sum of all token representations in the sequence:
where $v_j$ is the representation of token $j$, and $\alpha_{ij}$ is the attention weight assigned by token $i$ to token $j$. The weights satisfy:
so the output is a convex combination of all token representations. A weight $\alpha_{ij}$ close to 1 means token $i$ relies heavily on token $j$ for its contextual representation; a weight close to 0 means token $j$ is largely ignored.
Class Activation Mapping (CAM) is a technique for visualizing which spatial regions of an image a CNN focuses on when making a classification decision. Given the feature map $F^l \in \mathbb{R}^{C \times H \times W}$ at the final convolutional layer and the classification weights $\omega^c \in \mathbb{R}^C$ for class $c$, the class activation map is:
This produces a spatial heatmap that highlights the regions most relevant to the predicted class — a weighted aggregation of feature channels, where the weights reflect class-specific importance.
The structural parallel to attention is direct. In CAM, the model asks: "which spatial locations matter for this classification?" In attention, each token asks: "which other tokens matter for interpreting me?" Both mechanisms answer their respective questions through a weighted sum — CAM over spatial positions, attention over sequence positions. In both cases, the weights are not hand-crafted but emerge from the learned parameters of the model.
There is a deeper conceptual similarity as well. CAM revealed something surprising: a network trained purely on image-level classification labels, with no spatial supervision, nonetheless learned to localize objects. The weights $\omega^{c}$ encode where to look, even though the training signal never explicitly said so. Attention in Transformers exhibits an analogous emergent property — models trained on next-token prediction learn to route information through linguistically meaningful connections, such as coreference and syntactic agreement, without ever being explicitly told to do so.
Before introducing the mathematical formulation of attention, it is helpful to build intuition from a familiar concept in computer science: the dictionary lookup.
In Python, a dictionary maps keys to values and supports exact lookup:
library = {
"deep learning" : "Goodfellow et al., 2016",
"transformers" : "Vaswani et al., 2017",
"computer vision" : "Szeliski, 2010",
}
query = "transformers"
result = library[query] # returns exactly "Vaswani et al., 2017"
This is a hard lookup — the query must match a key exactly, and only one value is returned. If the query does not match any key, the lookup fails entirely.
Now consider a more flexible query:
query = "fast and strongly typed language"
library = {
"python" : "interpreted, dynamic typing, readable syntax",
"c++" : "compiled, static typing, high performance",
"java" : "compiled, static typing, platform independent",
"haskell" : "functional, static typing, lazy evaluation",
}
keys = list(library.keys())
values = list(library.values())
# similarity scores (conceptual)
scores = [0.05, 0.60, 0.30, 0.05]
# output: weighted sum of values
output = sum(s * v for s, v in zip(scores, values))
No key in the dictionary matches the query "fast and strongly typed language" exactly —
a hard lookup would fail immediately. Yet intuitively, "c++" and "java" are far more
relevant than "python" or "haskell", because their descriptions align closely with
the concepts of compilation, static typing, and performance.
A soft lookup handles this gracefully. Rather than demanding an exact match, it:
In the example above, the output is dominated by the descriptions of "c++" and "java", while "python" and "haskell" contribute only marginally.
This softening is precisely what makes attention compatible with backpropagation. A hard argmax over keys produces no useful gradient; a differentiable weighted sum does — and can therefore be optimized end-to-end.
Before a model can process language, raw text must be converted into a mathematical form amenable to computation. This requires two sequential steps: tokenization, which segments text into discrete units, and embedding, which maps those units into a continuous vector space.
A language model does not process raw text directly. Instead, the input is first decomposed into a sequence of elementary units known as tokens. A token is the minimum unit of input that a language model processes — the model reads, predicts, and generates text at the level of individual tokens.
Depending on the chosen strategy, tokens may correspond to whole words, subword units, or individual characters. The choice of tokenization scheme affects both the vocabulary size and the model's ability to handle rare or unseen words.
Once the input has been tokenized, each token must be assigned a numerical representation. This is accomplished through an embedding, which maps each token to a real-valued vector in a high-dimensional space.
Formally, let the vocabulary consist of $V$ distinct tokens. Each token is assigned a unique vector in $\mathbb{R}^{d_{\text{model}}}$, where $d_{\text{model}}$ is a user-defined hyperparameter specifying the embedding dimension:
The resulting embedding vectors have several important properties. First, they are dense and continuous — in contrast to one-hot encodings, every dimension contributes meaningful information. Second, the vectors are learned parameters, updated during training to minimize the model's loss. Third, and most significantly, the geometry of the embedding space reflects semantic relationships: words with similar meanings tend to occupy nearby regions of the space, while semantically unrelated words are mapped to distant regions.
This geometric structure is not imposed explicitly but emerges naturally from training. As a concrete illustration, consider the embedding vectors for the words "man", "woman", "king", and "queen". Each word is represented as a vector over semantic dimensions such as gender, royalty, and humanness. When these vectors are projected into two-dimensional space, semantically related words appear in close proximity — "king" and "queen" cluster together, as do "man" and "woman" — while the directional offset between "man" and "woman" is approximately preserved in the offset between "king" and "queen". This reflects the model's ability to encode not only similarity but relational structure within the embedding space.
One of the key properties of the attention mechanism is that it treats its inputs as a set — it does not inherently account for the order or relative position of elements in a sequence. However, language is fundamentally ordered: "the dog bit the man" and "the man bit the dog" contain identical words but carry entirely different meanings. Positional information must therefore be explicitly injected into the input representations.
This raises a natural question: how should position be encoded and incorporated into the model?
Before introducing the mathematical formulation, consider a simple analogy. Suppose we want to assign a unique identifier to each position in a sequence using binary numbers. Position 1 is encoded as 0001, position 2 as 0010, position 3 as 0011, and so on.
Each bit in the binary representation oscillates at a different frequency: the least significant bit flips at every step, the next bit flips every two steps, the next every four steps, and so on. Each position receives a unique code, and the rate of change differs systematically across dimensions — lower-order bits vary rapidly, while higher-order bits vary slowly.
Instead of binary digits, each dimension of the positional encoding vector oscillates as a sinusoid whose frequency decreases geometrically with the dimension index:
where $pos$ denotes the position in the sequence, $i$ is the dimension index, and $d_{\text{model}}$ is the embedding dimensionality. Even-indexed dimensions use the sine function; odd-indexed dimensions use the cosine function.
The correspondence to the binary analogy is direct: the highest-frequency sinusoid plays the role of the least significant bit, and the lowest-frequency sinusoid plays the role of the most significant bit.
The positional encoding vector $\text{PE}_{pos} \in \mathbb{R}^{d_{\text{model}}}$ is added element-wise to the corresponding token embedding $\mathbf{e}_{pos} \in \mathbb{R}^{d_{\text{model}}}$:
This additive operation allows the model to retain semantic information from the token embedding while simultaneously acquiring awareness of sequential order from the positional encoding.
A natural concern arises at this point. If the two vectors are simply added together, the resulting vector $X_i$ cannot be decomposed back into its two components — the semantic content and the positional signal are entangled. Does this mean that information is lost in the process?
The answer is that information is indeed mixed, but not lost in any harmful sense.
A useful analogy is that of a musical chord. When two notes are played simultaneously, the resulting sound cannot be cleanly separated back into its individual tones by a passive listener. Yet the chord as a whole conveys richer information than either note alone, and a trained musician can respond appropriately to it. In the same way, the summed vector $X_i$ encodes both what the token is and where it appears, and the model learns to respond to this joint signal.
Furthermore, in practice, the embedding and positional encoding occupy somewhat different regions of the high-dimensional space, and their sum tends to preserve both signals in a recoverable manner — not by explicit decomposition, but through the learned weights of the model. Empirically, models trained with this scheme do learn to attend to positional structure when it matters and to semantic content when it does not, which confirms that neither source of information is effectively suppressed by the addition.
Having represented each token as a vector in $\mathbb{R}^d$, the attention mechanism must determine how relevant each token is to every other token. This requires a measure of similarity between vectors.
A natural choice is the cosine similarity, which measures the angle between two vectors regardless of their magnitude:
In practice, however, when the embedding vectors are normalized or when the magnitudes are controlled through training, the denominator becomes approximately constant. In this case, cosine similarity reduces to the dot product:
The intuition is straightforward: if two token representations point in similar directions in the embedding space, their dot product is large — they are semantically related. If they are orthogonal, the dot product is zero — they carry unrelated information.
In matrix form, when computing all query-key similarity pairs simultaneously:
Scaling the Dot Product
There is one practical issue with the raw dot product. As the dimension $d$ grows, the dot products tend to grow large in magnitude. To see why, consider $q$ and $k$ with i.i.d. entries drawn from $\mathcal{N}(0, 1)$. The dot product $q \cdot k = \sum_{i=1}^d q_i k_i$ has variance $d$, so its standard deviation grows as $\sqrt{d}$. Large dot products push the subsequent softmax into saturation — one score dominates and the remaining weights collapse toward zero, causing gradients to vanish.
To correct for this, the dot product is scaled by $\sqrt{d_k}$, where $d_k$ is the dimension of the key vectors:
This scaling keeps the variance of the scores at 1 regardless of $d_k$, ensuring stable softmax outputs and well-behaved gradients throughout training. This operation is known as scaled dot-product attention.
Visualization of Attention as Matrix Operations
To build intuition for the attention computation, it is instructive to trace the dimensions of each matrix involved. Consider the special case of a single query vector $q \in \mathbb{R}^{1 \times d_k}$, with $m$ key and value vectors:
The attention computation proceeds in two steps. First, the similarity between the query and every key is computed via the scaled dot product:
This produces a row vector of $m$ similarity scores — one for each key. Applying softmax converts these scores into attention weights $\alpha \in \mathbb{R}^{1 \times m}$, which sum to one.
Second, the output is computed as a weighted sum of the value vectors:
As illustrated in the figure below, the single query vector (a thin horizontal slice) is multiplied against $K^T$ to produce a thin score vector of length $m$. This score vector is then multiplied against $V$ to produce the output $y$ — again a thin horizontal vector of dimension $d_v$. The shape of the query propagates through the entire computation: a single query in, a single output vector out.
The output $y$ is not any single value vector, but a soft blend of all value vectors:
dominated by the values whose keys were most similar to the query.
The General Case: Multiple Queries
In practice, a sequence of $m$ tokens is processed simultaneously rather than one query at a time. Each token acts as a query and must attend to all other tokens in the sequence. The computation therefore extends naturally to the case of multiple queries, stacked as rows of the query matrix $Q$:
As illustrated in the figure above, $Q$ and $K$ are now both block matrices rather than thin slices. The scaled dot-product between all query-key pairs is computed in a single matrix multiplication:
Each entry $(i, j)$ of this matrix represents the similarity score between the $i$-th query and the $j$-th key. Applying softmax row-wise converts each row into a valid probability distribution over the $m$ keys, yielding the attention weight matrix $A \in \mathbb{R}^{m \times m}$.
The output matrix is then obtained by multiplying the attention weights against the value matrix:
Each row $y_i$ of $Y$ is the output representation for the $i$-th token — a weighted sum of all value vectors, where the weights reflect how closely each key matched that token's query:
The full attention computation for an entire input sequence of $m$ tokens is therefore carried out in three matrix operations: $QK^T$, softmax, and multiplication by $V$. This is computationally efficient and, crucially, fully parallelizable — all tokens are processed simultaneously without the sequential dependencies that characterize recurrent architectures.
(1) Projecting Inputs into Query, Key, and Value Spaces
So far, we have treated $Q$, $K$, and $V$ as given. In practice, however, the model receives a single input: the embedded token matrix $X \in \mathbb{R}^{N \times d_{\text{model}}}$, where $N$ is the number of tokens in the sequence and $d_{\text{model}}$ is the embedding dimension. The question then arises naturally — where do $Q$, $K$, and $V$ come from?
The answer is that $Q$, $K$, and $V$ are each obtained by linearly projecting the same input matrix $X$ through three separate learned weight matrices:
where the projection matrices are:
As illustrated in the figure below, the same input $X \in \mathbb{R}^{N \times d_{\text{model}}}$ is multiplied by each of the three weight matrices independently, producing three distinct projections of size $N \times d_{\text{model}}$. Each projection extracts a different aspect of the input: the query projection encodes what each token is looking for, the key projection encodes what each token has to offer, and the value projection encodes the content to be aggregated.
These three weight matrices, $W^Q$, $W^K$, and $W^V$, are learnable parameters of the model. They are initialized randomly and updated through gradient descent during training. The model thereby learns, from data alone, how to project the input into a space where meaningful similarity comparisons can be made. There is no hand-crafted notion of relevance — the attention pattern that emerges is entirely determined by what the task requires.
In other words, for attention to work correctly, the model must learn to project the input into query, key, and value spaces that are appropriate for the task at hand. This is precisely what training achieves: by optimizing $W^Q$, $W^K$, and $W^V$ through gradient descent, the model learns which aspects of each token should be compared, which should be retrieved, and how the resulting outputs should be composed. The quality of attention is therefore inseparable from the quality of these three learned projections.
(2) Compute attention weighting and extract features with high attention
The attention operation described in the previous section constitutes a single self-attention head. A single head projects the input into one query, key, and value space and learns one pattern of relevance across the sequence. While this is already expressive, it is limited to attending to one type of relationship at a time.
In practice, language exhibits multiple types of dependency simultaneously. A single token may be syntactically related to one token, coreferentially linked to another, and semantically associated with a third. A single attention head cannot capture all of these patterns at once, as it produces a single weighted combination of the value vectors.
Multi-head attention addresses this limitation by running $h$ attention heads in parallel, each with its own independent projection matrices $W^Q_i$, $W^K_i$, and $W^V_i$:
Each head operates in a lower-dimensional subspace and learns to attend to a different aspect of the input. The outputs of all heads are then concatenated and passed through a final linear projection $W^O$:
This allows the model to jointly attend to information from multiple representational subspaces, making multi-head attention significantly more expressive than a single head alone. Each attention head serves as a modular building block that can be composed into larger network architectures, as will be discussed in the following section.
Having introduced tokenization, embedding, positional encoding, and multi-head attention, we are now in a position to assemble these components into a single architectural unit known as the Transformer block.
A Transformer block consists of two sub-layers. The first is a multi-head self-attention layer, which allows each token to gather information from all other tokens in the sequence. The second is a position-wise feed-forward network (MLP), which applies a nonlinear transformation independently to each token representation. The specific configuration of these sub-layers depends on the purpose of the model.
Two additional components are incorporated to improve training stability and optimization efficiency.
The first is a residual connection, which bypasses each sub-layer by adding the input directly to the output:
Residual connections allow gradients to flow directly through the network without passing through the sub-layer, alleviating the vanishing gradient problem in deep architectures.
The second is layer normalization, applied after each residual addition:
Layer normalization stabilizes the distribution of activations across the embedding dimension, leading to faster and more reliable convergence during training.
Together, these components form a single Transformer block. Multiple such blocks are stacked sequentially to form the full model, with each block refining the token representations produced by the previous one.
import numpy as np
import math
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split
import random
random.seed(42)
from google.colab import drive
drive.mount('/content/drive')
X = np.load('/content/drive/MyDrive/DL/DL_data/TransformerData/classification_X.npy')
y = np.load('/content/drive/MyDrive/DL/DL_data/TransformerData/classification_y.npy')
print(f"Loaded X {X.shape}, y {y.shape}")
class TimeSeriesDataset(Dataset):
def __init__(self, X, y):
self.X = torch.from_numpy(X).float()
self.y = torch.from_numpy(y).long()
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
ds = TimeSeriesDataset(X, y)
n_train = int(0.8 * len(ds))
torch.manual_seed(42)
generator = torch.Generator().manual_seed(42)
train_ds, test_ds = random_split(ds, [n_train, len(ds) - n_train], generator=generator)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=64, shuffle=False)
print(f"Train samples: {len(train_ds)}, Test samples: {len(test_ds)}")
Visualize a few samples
num_examples = 3
plt.figure(figsize=(6, 2 * num_examples))
for i in range(num_examples):
sig, label = ds[i]
plt.subplot(num_examples, 1, i+1)
plt.plot(sig.numpy(), label=f"class={label.item()}")
plt.title(f"Example {i} — class {label.item()}")
plt.tight_layout()
plt.show()
class ScaledDotProductAttention(nn.Module):
def __init__(self, d_k, dropout=0.0):
super().__init__()
self.d_k = d_k
self.dropout = nn.Dropout(dropout) if dropout > 0 else None
def forward(self, Q, K, V):
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
weights = F.softmax(scores, dim=-1)
if self.dropout:
weights = self.dropout(weights)
out = torch.matmul(weights, V)
return out
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, n_heads, dropout=0.0):
super().__init__()
assert d_model % n_heads == 0
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.q_lin = nn.Linear(d_model, d_model)
self.k_lin = nn.Linear(d_model, d_model)
self.v_lin = nn.Linear(d_model, d_model)
self.attn = ScaledDotProductAttention(self.d_k, dropout)
self.fc = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout) if dropout > 0 else None
self.norm = nn.LayerNorm(d_model)
def forward(self, x):
B, T, _ = x.size()
Q = self.q_lin(x).view(B, T, self.n_heads, self.d_k).permute(0,2,1,3)
K = self.k_lin(x).view(B, T, self.n_heads, self.d_k).permute(0,2,1,3)
V = self.v_lin(x).view(B, T, self.n_heads, self.d_k).permute(0,2,1,3)
out = self.attn(Q, K, V)
out = out.permute(0,2,1,3).contiguous().view(B, T, -1)
out = self.fc(out)
if self.dropout:
out = self.dropout(out)
return self.norm(x + out)
class EncoderLayer(nn.Module):
def __init__(self, d_model, nhead, d_ff, dropout=0.1):
super().__init__()
self.self_attn = nn.MultiheadAttention(
embed_dim=d_model,
num_heads=nhead,
dropout=dropout,
batch_first=True
)
self.linear1 = nn.Linear(d_model, d_ff)
self.act = nn.ReLU()
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(d_ff, d_model)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.drop1 = nn.Dropout(dropout)
self.drop2 = nn.Dropout(dropout)
self.attn_weights = None
def forward(self, x):
attn_out, weights = self.self_attn(
x, x, x,
need_weights=True,
average_attn_weights=False
)
self.attn_weights = weights
x = x + self.drop1(attn_out)
x = self.norm1(x)
ff = self.linear2(self.dropout(self.act(self.linear1(x))))
x = x + self.drop2(ff)
x = self.norm2(x)
return x
class Transformer(nn.Module):
def __init__(self, seq_len=2000, d_model=64, nhead=4, nlayers=3, d_ff=128):
super().__init__()
self.proj = nn.Sequential(
nn.Linear(1, d_model),
nn.LayerNorm(d_model)
)
self.pos = nn.Parameter(torch.randn(1, seq_len, d_model))
self.layers = nn.ModuleList([
EncoderLayer(d_model, nhead, d_ff, dropout=0.1)
for _ in range(nlayers)
])
self.cls = nn.Sequential(
nn.Linear(d_model, 64),
nn.ReLU(),
nn.Linear(64, 2)
)
self.attn_maps = []
def forward(self, x):
B, T = x.shape
x = self.proj(x.unsqueeze(-1))
x = x + 0.1 * self.pos[:, :T]
self.attn_maps = []
for lyr in self.layers:
x = lyr(x)
self.attn_maps.append(lyr.attn_weights)
logits = self.cls(x.mean(dim=1))
return logits
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = Transformer(seq_len=X.shape[1]).to(device)
lr = 5e-4
epochs = 5
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
for epoch in range(1, epochs + 1):
model.train()
total_loss = 0.0
total_correct = 0
total = 0
for Xb, yb in train_loader:
Xb, yb = Xb.to(device), yb.to(device)
optimizer.zero_grad()
logits = model(Xb)
loss = criterion(logits, yb)
loss.backward()
optimizer.step()
total_loss += loss.item() * yb.size(0)
preds = logits.argmax(dim=1)
total_correct += (preds == yb).sum().item()
total += yb.size(0)
train_loss = total_loss / total
train_acc = total_correct / total * 100
print(f"Epoch {epoch}/{epochs} | "
f"Train loss: {train_loss:.4f} | "
f"Train acc: {train_acc:5.2f}%")
You can skip training by downloading pretrained model here
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = Transformer(seq_len=X.shape[1], d_model=64, nhead=4, nlayers=3, d_ff=128)
model = model.to(device)
load_path = '/content/drive/MyDrive/DL/DL_data/TransformerData/transformer_weights.pth'
model.load_state_dict(torch.load(load_path, map_location=device))
model.eval()
print(f"Loaded model weights from {load_path}")
model.eval()
criterion = nn.CrossEntropyLoss()
test_loss = 0.0
test_correct = 0
total = 0
with torch.no_grad():
for batch in test_loader:
Xb, yb = batch
Xb, yb = Xb.to(device), yb.to(device)
logits = model(Xb)
loss = criterion(logits, yb)
bs = yb.size(0)
test_loss += loss.item() * bs
preds = logits.argmax(dim=1)
correct = (preds == yb).sum().item()
test_correct += correct
total += bs
val_loss = test_loss / total
val_acc = test_correct / total * 100
print("Val Loss:", round(val_loss, 4), " Val Acc:", round(val_acc, 2), "%")
all_indices = list(range(len(test_ds)))
sample_indices = random.sample(all_indices, 3)
for i in range(3):
idx = sample_indices[i]
sig, true_label = test_ds[idx]
sig_np = sig.numpy()
with torch.no_grad():
out = model(sig.unsqueeze(0).to(device))
pred_label = out.argmax(dim=1).cpu().item()
plt.figure(figsize=(6, 2))
plt.plot(sig_np)
plt.title("Sample " + str(i+1) +
" (idx=" + str(idx) + ") — True: " + str(true_label.item()) +
", Pred: " + str(pred_label))
plt.xlabel("Timestep")
plt.ylabel("Signal")
plt.show()
Visualization of attention maps
sig_idx = 500
sig = X[test_ds.indices[sig_idx]]
sig_t = torch.tensor(sig).unsqueeze(0).to(device)
model.eval()
with torch.no_grad():
_ = model(sig_t)
attn_all = torch.stack(model.attn_maps)
attn_avg = attn_all.mean(dim=2).mean(dim=0)[0]
plt.figure(figsize=(5,4))
plt.title("Average attention (all layers & heads)")
plt.imshow(attn_avg.cpu(), origin='lower', aspect='auto')
plt.colorbar(label="attention weight")
plt.xlabel("Key timestep"); plt.ylabel("Query timestep")
plt.clim([0,0.015])
plt.tight_layout()
plt.show()
layer_id = 2
maps_per_head = attn_all[layer_id, 0]
n_heads = maps_per_head.shape[0]
cols = min(n_heads, 4)
rows = (n_heads + cols - 1) // cols
fig, axes = plt.subplots(rows, cols, figsize=(8, 3))
axes = axes.flatten()
for h in range(n_heads):
ax = axes[h]
im = ax.imshow(maps_per_head[h].cpu(), origin='lower', aspect='auto', vmin=0, vmax=0.002)
ax.set_title(f"Layer {layer_id}, Head {h}")
ax.set_xlabel("Key timestep")
ax.set_ylabel("Query timestep")
for ax in axes[n_heads:]:
ax.axis("off")
plt.tight_layout()
plt.show()
sig = X[test_ds.indices[sig_idx]]
sig_t = torch.tensor(sig).unsqueeze(0).to(device)
model.eval()
with torch.no_grad():
_ = model(sig_t)
attn_all = torch.stack(model.attn_maps)
score = attn_all.mean(dim=2).mean(dim=0).mean(dim=1)[0].cpu().numpy()
min_s = score.min()
max_s = score.max()
score = (score - min_s) / (max_s - min_s + 1e-9)
thr = 0.1
mask = score >= thr
plt.figure(figsize=(6, 3))
plt.plot(sig, label="Signal (class=1)")
plt.fill_between(
np.arange(len(sig)),
sig.min(), sig.max(),
where=mask,
color="red", alpha=0.25,
label=f"Attention ≥ {thr:.2f}"
)
plt.xlabel("Timestep")
plt.ylabel("Signal amplitude")
plt.tight_layout()
plt.show()
import numpy as np
import math
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split
import random
random.seed(42)
from google.colab import drive
drive.mount('/content/drive')
signal = np.load('/content/drive/MyDrive/DL/DL_data/TransformerData/raw_signal.npy')
print(f"Loaded raw signal: {signal.shape}")
n_step = 25
n_input = 100
n_output = 100
stride = 5
HOLDOUT = 5_000
signal = (signal - signal.mean()) / signal.std()
train_signal = signal[:-HOLDOUT]
val_signal = signal[-HOLDOUT:]
total_in_len = n_step * n_input
total_out_len = n_output
n_samples = (len(train_signal) - total_in_len - total_out_len) // stride
print(f"Total samples: {n_samples}")
X = np.zeros((n_samples, n_step, n_input), dtype=np.float32)
y = np.zeros((n_samples, n_output), dtype=np.float32)
for i in range(n_samples):
start = i * stride
end = start + total_in_len
X[i] = train_signal[start:end].reshape(n_step, n_input)
y[i] = train_signal[end:end + n_output]
print(f"Sliced X {X.shape}, y {y.shape}")
class ForecastDataset(Dataset):
def __init__(self, X, y):
self.X = torch.from_numpy(X).float()
self.y = torch.from_numpy(y).float()
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
ds = ForecastDataset(X, y)
n_train = int(0.8 * len(ds))
train_ds, test_ds = random_split(ds, [n_train, len(ds) - n_train])
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=64, shuffle=False)
print(f"Train samples: {len(train_ds)}, Test samples: {len(test_ds)}")
plt.figure(figsize=(7, 3))
plt.plot(signal)
plt.title("Full signal")
plt.xlabel("Timestep"); plt.ylabel("Amplitude")
plt.show()
example_x = X[0].reshape(-1)
plt.figure(figsize=(7, 3))
plt.plot(example_x)
plt.title("Example input slice")
plt.xlabel("Timestep in slice"); plt.ylabel("Amplitude")
plt.show()
class ScaledDotProductAttention(nn.Module):
def __init__(self, d_k, dropout=0.0):
super().__init__()
self.d_k = d_k
self.dropout = nn.Dropout(dropout) if dropout else None
def forward(self, Q, K, V, mask=None):
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(~mask, -1e9)
weights = F.softmax(scores, dim=-1)
if self.dropout:
weights = self.dropout(weights)
out = torch.matmul(weights, V)
return out, weights
class MaskedMultiHeadAttention(nn.Module):
def __init__(self, d_model, n_heads, dropout=0.0):
super().__init__()
self.d_model = d_model
self.n_heads = n_heads
self.head_dim = d_model // n_heads
self.q_lin = nn.Linear(d_model, d_model)
self.k_lin = nn.Linear(d_model, d_model)
self.v_lin = nn.Linear(d_model, d_model)
self.sdp_attention = ScaledDotProductAttention(self.head_dim, dropout)
self.out_lin = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout) if dropout else None
def forward(self, query, key, value, mask):
B, T, _ = query.size()
def _split(x):
return (x.view(B, T, self.n_heads, self.head_dim)
.permute(0, 2, 1, 3)
.reshape(-1, T, self.head_dim))
q = _split(self.q_lin(query))
k = _split(self.k_lin(key))
v = _split(self.v_lin(value))
mask_heads = mask.repeat_interleave(self.n_heads, dim=0)
attn_out, attn_w = self.sdp_attention(q, k, v, mask_heads)
attn_out = (attn_out.view(B, self.n_heads, T, self.head_dim)
.permute(0, 2, 1, 3)
.contiguous()
.view(B, T, self.d_model))
out = self.out_lin(attn_out)
if self.dropout:
out = self.dropout(out)
attn_w = attn_w.view(B, self.n_heads, T, T)
return out, attn_w
def make_causal_mask(seq: torch.Tensor) -> torch.Tensor:
B, T, _ = seq.size()
mask = 1 - torch.triu(torch.ones(T, T, device=seq.device), diagonal=1)
return mask.bool().unsqueeze(0).expand(B, -1, -1)
class DecoderLayer(nn.Module):
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.self_attn = MaskedMultiHeadAttention(d_model, n_heads, dropout)
self.linear1 = nn.Linear(d_model, d_ff)
self.act = nn.ReLU()
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(d_ff, d_model)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.drop1 = nn.Dropout(dropout)
self.drop2 = nn.Dropout(dropout)
self.attn_weights = None
def forward(self, x, mask):
attn_out, w = self.self_attn(x, x, x, mask)
self.attn_weights = w
x = self.norm1(x + self.drop1(attn_out))
ff = self.linear2(self.dropout(self.act(self.linear1(x))))
x = self.norm2(x + self.drop2(ff))
return x
class Transformer(nn.Module):
def __init__(
self,
n_step = 25,
n_input = 100,
n_output = 100,
d_model = 128,
n_heads = 8,
n_layers = 2,
d_ff = 512
):
super().__init__()
self.proj = nn.Sequential(
nn.Linear(n_input, d_model),
nn.LayerNorm(d_model)
)
self.pos = nn.Parameter(torch.randn(1, n_step, d_model))
self.layers = nn.ModuleList([
DecoderLayer(d_model, n_heads, d_ff, dropout=0.1)
for _ in range(n_layers)
])
self.mlp = nn.Sequential(
nn.Linear(d_model * n_step, 256),
nn.ReLU(),
nn.Linear(256, n_output)
)
self.attn_maps = []
def forward(self, x, return_attn=False):
x = self.proj(x)
x = x + 0.1 * self.pos[:, : x.size(1)]
mask = make_causal_mask(x)
self.attn_maps = []
for layer in self.layers:
x = layer(x, mask)
self.attn_maps.append(layer.attn_weights)
out = self.mlp(x.flatten(start_dim=1))
if return_attn:
return out, self.attn_maps
return out
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = Transformer(
n_step = X.shape[1],
n_input = X.shape[2],
n_output = y.shape[1]
).to(device)
lr = 5e-4
epochs = 10
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
for epoch in range(1, epochs + 1):
model.train()
total_loss = 0.0
total = 0
for Xb, yb in train_loader:
Xb, yb = Xb.to(device), yb.to(device)
optimizer.zero_grad()
preds = model(Xb)
loss = criterion(preds, yb)
loss.backward()
optimizer.step()
total_loss += loss.item() * yb.size(0)
total += yb.size(0)
train_loss = total_loss / total
print(f"Epoch {epoch}/{epochs} | Train MSE: {train_loss:.6f}")
You can skip training by downloading pretrained model here
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = Transformer(
n_step = n_step,
n_input = n_input,
n_output = n_output,
d_model = 128,
n_heads = 8,
n_layers = 2,
d_ff = 512
).to(device)
load_path = "/content/drive/MyDrive/DL/DL_data/TransformerData/forecast_weights.pth"
model.load_state_dict(torch.load(load_path, map_location=device))
model.eval()
print(f"Loaded model weights from {load_path}")
horizon = 2500
val_seg = signal[-(n_step * n_input + horizon):]
window_flat = val_seg[: n_step * n_input]
gt_future = val_seg[n_step * n_input:]
window = (torch.from_numpy(window_flat.astype(np.float32))
.view(1, n_step, n_input)
.to(device))
model.eval()
with torch.no_grad():
one_pred = model(window).cpu().numpy().ravel()
plt.figure(figsize=(4.3, 3))
plt.plot(range(n_step * n_input), window_flat, label="Given")
plt.plot(range(n_step * n_input, n_step * n_input + n_input),
one_pred, 'r', label="Predicted")
plt.plot(range(n_step * n_input, n_step * n_input + n_input),
gt_future[:n_input], 'b--', label="Ground truth")
plt.axvline(n_step * n_input, color='k', linestyle='--')
plt.xlabel("Timestep"); plt.ylabel("Amplitude"); plt.legend()
plt.show()
preds = []
roll_window = window.clone()
with torch.no_grad():
for _ in range(horizon // n_input):
out = model(roll_window)
preds.append(out.cpu().numpy().ravel())
roll_window = torch.cat([roll_window[:, 1:], out.unsqueeze(1)], dim=1)
preds = np.concatenate(preds)
plt.figure(figsize=(8, 3))
plt.plot(range(n_step * n_input), window_flat, label="Given")
plt.plot(range(n_step * n_input, n_step * n_input + horizon),
preds, 'r', label="Predicted")
plt.plot(range(n_step * n_input, n_step * n_input + horizon),
gt_future, 'b--', label="Ground truth")
plt.axvline(n_step * n_input, color='k', linestyle='--')
plt.xlabel("Timestep"); plt.ylabel("Amplitude"); plt.legend()
plt.show()
Making a Surrogate Model
import numpy as np
import math
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split
import random
random.seed(42)
from google.colab import drive
drive.mount('/content/drive')
loads = np.load('/content/drive/MyDrive/DL/DL_data/TransformerData/loads.npy') # shape (N, T_load)
stresses = np.load('/content/drive/MyDrive/DL/DL_data/TransformerData/stresses.npy') # shape (N, T_stress)
print(f"Loaded loads: {loads.shape}")
print(f"Loaded stresses:{stresses.shape}")
loads = (loads - loads.mean()) / loads.std()
stresses = (stresses - stresses.mean()) / stresses.std()
class CantileverDataset(Dataset):
def __init__(self, X, y):
self.X = torch.from_numpy(X).float()
self.y = torch.from_numpy(y).float()
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
ds = CantileverDataset(loads, stresses)
n_train = int(0.8 * len(ds))
train_ds, test_ds = random_split(ds, [n_train, len(ds) - n_train])
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=64, shuffle=False)
print(f"Train samples: {len(train_ds)}, Test samples: {len(test_ds)}")
x0, y0 = ds[0]
plt.figure(figsize=(6,4))
plt.subplot(2,1,1)
plt.plot(x0.numpy())
plt.title("Load sequence (example 0)")
plt.xlabel("Spatial index"); plt.ylabel("Load")
plt.subplot(2,1,2)
plt.plot(y0.numpy())
plt.title("Stress profile (example 0)")
plt.xlabel("Spatial index"); plt.ylabel("Stress")
plt.tight_layout()
plt.show()
class TransformerEncoderRegressor(nn.Module):
def __init__(self, seq_len, d_model=64, nhead=8, num_layers=3, dim_feedforward=256, dropout=0.1):
super().__init__()
self.input_proj = nn.Linear(1, d_model)
pe = torch.zeros(seq_len, d_model)
pos = torch.arange(seq_len).unsqueeze(1).float()
div = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(pos * div)
pe[:, 1::2] = torch.cos(pos * div)
self.register_buffer("pos_encoding", pe.unsqueeze(0))
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model,
nhead=nhead,
dim_feedforward=dim_feedforward,
dropout=dropout,
batch_first=True
)
self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
self.output_head = nn.Linear(d_model, 1)
def forward(self, x):
h = self.input_proj(x)
h = h + self.pos_encoding
h = self.encoder(h)
return self.output_head(h)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = TransformerEncoderRegressor(loads.shape[1]).to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
epochs = 10
for epoch in range(1, epochs + 1):
model.train()
train_loss = 0.0
for src, tgt in train_loader:
src = src.unsqueeze(-1).to(device)
tgt = tgt.unsqueeze(-1).to(device)
pred = model(src)
loss = criterion(pred, tgt)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
model.eval()
test_loss = 0.0
with torch.no_grad():
for src, tgt in test_loader:
src = src.unsqueeze(-1).to(device)
tgt = tgt.unsqueeze(-1).to(device)
pred = model(src)
loss = criterion(pred, tgt)
test_loss += loss.item()
test_loss /= len(test_loader)
print(f"Epoch {epoch}/{epochs} | Train MSE: {train_loss:.6f}")
model.eval()
all_preds = []
with torch.no_grad():
for src, _ in test_loader:
src = src.unsqueeze(-1).to(device)
out = model(src).cpu().squeeze(-1)
all_preds.append(out)
preds = torch.cat(all_preds, dim=0)
n_examples = 5
indices = random.sample(range(len(test_ds)), n_examples)
fig, axes = plt.subplots(n_examples, 2, figsize=(7, 2*n_examples))
for i, idx in enumerate(indices):
load, true = test_ds[idx]
pred = preds[idx]
ax = axes[i, 0]
ax.plot(load.numpy(), linewidth=2)
ax.set_title(f"Load profile #{idx}")
ax.set_xlabel("Spatial index")
ax.set_ylabel("Load")
ax = axes[i, 1]
ax.plot(true.numpy(), label="True", linewidth=2)
ax.plot(pred.numpy(), "--", label="Predicted", linewidth=2)
ax.set_title(f"Stress profile #{idx}")
ax.set_xlabel("Spatial index")
ax.set_ylabel("Stress")
ax.legend()
plt.tight_layout()
plt.show()
Download data
Flow field snapshots here
import numpy as np
import math
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split
import random
random.seed(42)
u = np.load('/content/drive/MyDrive/DL/DL_data/TransformerData/u_field.npy')
print(f"Loaded u_field: {u.shape}")
window_size = 5
validation_frames = 5
n_batch = 16
nx, ny, _ = u.shape
n_snap = u.shape[2]
flat = u.transpose(2, 0, 1).reshape(n_snap, -1)
train_flat = flat[:-validation_frames]
val_flat = flat[-(validation_frames + window_size):]
class FlowfieldDataset(Dataset):
def __init__(self, arr, ctx):
xs, ys = [], []
for i in range(len(arr) - ctx):
xs.append(arr[i:i+ctx])
ys.append(arr[i+ctx])
self.x = torch.from_numpy(np.stack(xs)).float()
self.y = torch.from_numpy(np.stack(ys)).float()
def __len__(self):
return len(self.x)
def __getitem__(self, idx):
return self.x[idx], self.y[idx]
train_ds = FlowfieldDataset(train_flat, window_size)
val_ds = FlowfieldDataset(val_flat, window_size)
train_loader = DataLoader(train_ds, batch_size=n_batch, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=n_batch, shuffle=False)
print(f"Train samples: {len(train_ds)}, Val samples: {len(val_ds)}")
class TransformerDecoder(nn.Module):
def __init__(
self,
ctx,
feat_dim,
d_model=512,
nhead=8,
num_layers=4,
dim_feedforward=2048,
dropout=0.1
):
super().__init__()
self.in_proj = nn.Linear(feat_dim, d_model)
pe = torch.zeros(ctx, d_model)
pos = torch.arange(ctx).unsqueeze(1).float()
div = torch.exp(
torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(pos * div)
pe[:, 1::2] = torch.cos(pos * div)
self.pos_encoding = pe.unsqueeze(0)
mask = torch.triu(torch.ones(ctx, ctx), diagonal=1).bool()
self.causal_mask = mask
layer = nn.TransformerEncoderLayer(
d_model=d_model,
nhead=nhead,
dim_feedforward=dim_feedforward,
dropout=dropout,
batch_first=True
)
self.encoder = nn.TransformerEncoder(layer, num_layers=num_layers)
self.out_proj = nn.Linear(d_model, feat_dim)
def forward(self, x):
pos = self.pos_encoding.to(x.device)
mask = self.causal_mask.to(x.device)
h = self.in_proj(x) + pos
h = self.encoder(h, mask)
return self.out_proj(h[:, -1])
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
feat_dim = train_flat.shape[1]
model = TransformerDecoder(window_size, feat_dim).to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
epochs = 200
for epoch in range(1, epochs + 1):
model.train()
train_loss = 0.0
for src, tgt in train_loader:
src, tgt = src.to(device), tgt.to(device)
pred = model(src)
loss = criterion(pred, tgt)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
print(f"Epoch {epoch}/{epochs} | Train MSE: {train_loss:.6f}")
You can skip training by downloading pretrained model here
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
feat_dim = train_flat.shape[1]
model = TransformerDecoder(window_size, feat_dim).to(device)
load_path = '/content/drive/MyDrive/DL/DL_data/TransformerData/u_forecast_weights.pth'
state = torch.load(load_path, map_location=device)
model.load_state_dict(state)
model.eval()
print(f"Loaded TransformerDecoder weights from {load_path}")
model.eval()
preds = []
with torch.no_grad():
for seq, _ in train_ds:
inp = seq.unsqueeze(0).to(device)
out = model(inp).cpu().squeeze(0)
preds.append(out.numpy())
n_examples = 5
indices = random.sample(range(len(train_ds)), n_examples)
fig, axes = plt.subplots(n_examples, 2, figsize=(10, 3*n_examples))
for i, idx in enumerate(indices):
_, true_frame = train_ds[idx]
pred_frame = preds[idx]
true_img = true_frame.numpy().reshape(nx, ny)
pred_img = pred_frame.reshape(nx, ny)
ax = axes[i, 0]
ax.imshow(true_img, origin="lower")
ax.set_title(f"True snapshot #{idx}")
ax.axis("off")
ax = axes[i, 1]
ax.imshow(pred_img, origin="lower")
ax.set_title(f"Predicted snapshot #{idx}")
ax.axis("off")
plt.tight_layout()
plt.show()
model.eval()
last_window = flat[-(validation_frames + window_size):-validation_frames]
future_true = flat[-validation_frames:]
input_tensor = torch.from_numpy(last_window).float().unsqueeze(0).to(device)
predictions = []
with torch.no_grad():
for _ in range(validation_frames):
out = model(input_tensor)
frame_pred = out[0]
predictions.append(frame_pred.cpu().numpy())
input_tensor = torch.cat([
input_tensor[:, 1:, :],
frame_pred.unsqueeze(0).unsqueeze(0).to(device)
], dim=1)
pred_np = np.stack(predictions)
gt = future_true.reshape(validation_frames, nx, ny)
pr = pred_np.reshape(validation_frames, nx, ny)
err = pr - gt
fig, axes = plt.subplots(validation_frames, 2, figsize=(8, 3*validation_frames))
for t in range(validation_frames):
ax = axes[t, 0]
ax.imshow(gt[t], origin='lower')
ax.set_title(f"t = −{validation_frames-t} true")
ax.axis("off")
ax = axes[t, 1]
ax.imshow(pr[t], origin='lower')
ax.set_title(f"t = −{validation_frames-t} pred")
ax.axis("off")
plt.tight_layout()
plt.show()
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')