Style Transfer
Table of Contents
Neural Style Transfer (NST) is a technique in computer vision and image synthesis that generates a new image by combining the content of one image with the style of another. The content image defines the spatial layout and semantic structure of objects, while the style image contributes color, texture, and brushstroke patterns.
The figure above illustrates a representative example of artistic style transfer. On the left is a content image of a European riverside town (Tübingen, Germany); the inset shows the style image, Vincent van Gogh's The Starry Night (1889). On the right is the synthesized output: the architectural structure and spatial layout of the photograph are preserved, while the swirling brushstrokes, deep blue palette, and impasto texture of van Gogh's painting are faithfully reproduced.
The technique was first introduced by Gatys et al. (2015), who demonstrated that a pretrained CNN naturally disentangles content and style representations — enabling high-quality artistic style transfer through direct pixel-space optimization. Prior to this, approaches to texture transfer relied on hand-crafted features and classical texture synthesis pipelines, which lacked the semantic awareness needed for perceptually convincing results.
In this lecture, we will:
Given two input images:
The objective is to synthesize a third image $I_g$ (the generated image) such that:
This is accomplished by optimizing a loss function that balances content and style terms. The resulting image $I_g$ is computed iteratively so that it matches the high-level features of the content image while reproducing the statistical patterns (style) extracted from the style image.
Neural style transfer relies on the hierarchical representation capabilities of convolutional neural networks (CNNs), such as VGG-19 pretrained on the ImageNet dataset. These networks extract multi-scale features from images, which are useful for encoding both spatial content and visual style.
Lower (or upstream) convolutional layers tend to capture local structures, such as edges, textures, and basic color contrasts
Higher (or downstream) convolutional layers encode more abstract representations, including object shapes, part relationships, and spatial configurations
The figure above illustrates this hierarchy. As the input passes through successive convolutional layers, the learned filters progress from detecting simple oriented edges, to combinations of edges forming facial parts, to high-level object models — each level providing a richer semantic abstraction of the input.
Because of this hierarchy, intermediate feature maps extracted from different layers of a CNN can be used to characterize the content and style of an image:
Content features are drawn from deeper layers, where spatial semantics are preserved
Style features are obtained by computing feature correlations (e.g., via Gram matrices) across multiple layers, capturing texture and appearance
While this interpretation is not mathematically rigorous in a strict sense, it is strongly supported by empirical results across many experiments. In practice, this layer-wise disentanglement of content and style forms the foundation for neural style transfer.
This section outlines the overall structure of the neural style transfer procedure. The method follows a well-defined sequence of steps that combine feature extraction, loss computation, and image optimization.
Step 1: Select a pretrained CNN
Step 2: Extract features from the content and style images
conv4_2) that preserve spatial structure and object layout.conv1_1, conv2_1, conv3_1, conv4_1, conv5_1), where texture and low-level statistics are more prominent.Step 3: Initialize the generated image
Step 4: Optimization loop
Step 5: Result
After optimization, the resulting image $I_g$ will combine:
This synthesis is achieved by aligning intermediate representations of $I_g$ with those of both $I_c$ and $I_s$ through loss minimization.
In neural style transfer, the optimization of the generated image relies on carefully defined loss functions. These guide the synthesis toward blending content from one image with the artistic style of another. This section outlines each component of the loss and explains how they are combined and optimized.
We use VGG-19 pretrained on ImageNet as our feature extractor. Let:
We reshape this to $F^l(x) \in \mathbb{R}^{C_l \times N_l}$, where $N_l = H_l \times W_l$.
relu1_1) encode fine-grained spatial detail.relu4_2) encode semantic, structural content.The content of an image is encoded in the high-level feature activations. Given a content image $I_c$ and a generated image $I_g$, the content loss measures the distance between their feature activations at layer $l$:
This loss encourages the generated image $I_g$ to preserve the semantic structure of the content image $I_c$. Importantly, it is computed in feature space, not pixel space.
Style is not a spatially localized property — it manifests as texture statistics that should be position-invariant. We capture this by computing pairwise feature correlations across spatial positions.
The Gram Matrix
Given the feature map $F^l(I) \in \mathbb{R}^{C_l \times H_l \times W_l}$ at layer $l$, we first flatten the spatial dimensions to obtain $F^l(I) \in \mathbb{R}^{C_l \times N_l}$, where $N_l = H_l \times W_l$. The Gram matrix $G^l(I) \in \mathbb{R}^{C_l \times C_l}$ is then defined as:
or equivalently in matrix form:
Intuitively, $G^l_{ij}$ measures how much feature channels $i$ and $j$ co-activate across all spatial positions. This operation discards spatial layout entirely while preserving the co-occurrence statistics of features — precisely what characterizes texture and artistic style.
Note the normalization factor $\frac{1}{N_l}$: without it, the Gram matrix magnitude grows with image resolution, making the style loss resolution-sensitive and difficult to tune consistently.
Style Loss
The style loss at layer $l$ is then the squared Frobenius norm between the Gram matrices of the generated image $I_g$ and the style image $I_s$:
The total style loss aggregates contributions across a set of layers $\mathcal{L}_s$ with per-layer weights $w_l$:
Using multiple layers is critical: lower layers capture fine-grained texture (brushstroke-level patterns), while higher layers encode coarser compositional and color distributions. The choice of $\mathcal{L}_s$ and the weights $w_l$ therefore directly controls the granularity of the transferred style.
Optimization may introduce high-frequency noise in $I_g$. This is mitigated by the total variation (TV) loss, which encourages spatial smoothness by penalizing large intensity differences between neighboring pixels:
or
Unlike the squared (L2) variant, the L1 formulation does not overly suppress large local differences — it tolerates sharp edges and object boundaries while still discouraging fine-grained high-frequency noise. This property is desirable in style transfer, where perceptually meaningful discontinuities (e.g., outlines of objects) should be preserved in the generated image.
The total loss combines content, style, and total variation components:
where:
During training:
Initialize $I_g$ (e.g., with white noise or a copy of $I_c$)
Keep all CNN weights fixed
Compute each loss via forward passes
Backpropagate gradients with respect to $I_g$ only
Update $I_g$ using an optimizer
Repeat until synthesis achieves a satisfactory blend of content, style, and smoothness
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.applications import VGG16
import cv2
h_image, w_image = 600, 1000
from google.colab import drive
drive.mount('/content/drive')
img_path = '/content/drive/MyDrive/DL/DL_data/kaist_nubzuki.jpg'
img_content = cv2.imread(img_path)
img_content = cv2.cvtColor(img_content, cv2.COLOR_BGR2RGB)
img_content = cv2.resize(img_content, (w_image, h_image))
# 시각화
plt.figure(figsize = (8, 6))
plt.imshow(img_content)
plt.axis('off')
plt.show()
img_style_path = '/content/drive/MyDrive/DL/DL_data/la_muse.jpg'
img_style = cv2.imread(img_style_path)
img_style = cv2.cvtColor(img_style, cv2.COLOR_BGR2RGB)
img_style = cv2.resize(img_style, (w_image, h_image))
plt.figure(figsize = (8, 6))
plt.imshow(img_style)
plt.axis('off')
plt.show()
Pre-trained Model (VGG16)
model = VGG16(weights='imagenet', include_top=True)
model.summary()
vgg16_weights = model.get_weights()
# kernel size: [kernel_height, kernel_width, input_ch, output_ch]
weights = {
'conv1_1': tf.constant(vgg16_weights[0]),
'conv1_2': tf.constant(vgg16_weights[2]),
'conv2_1': tf.constant(vgg16_weights[4]),
'conv2_2': tf.constant(vgg16_weights[6]),
'conv3_1': tf.constant(vgg16_weights[8]),
'conv3_2': tf.constant(vgg16_weights[10]),
'conv3_3': tf.constant(vgg16_weights[12]),
'conv4_1': tf.constant(vgg16_weights[14]),
'conv4_2': tf.constant(vgg16_weights[16]),
'conv4_3': tf.constant(vgg16_weights[18]),
'conv5_1': tf.constant(vgg16_weights[20]),
'conv5_2': tf.constant(vgg16_weights[22]),
'conv5_3': tf.constant(vgg16_weights[24]),
}
# bias size: [output_ch] or [neuron_size]
biases = {
'conv1_1': tf.constant(vgg16_weights[1]),
'conv1_2': tf.constant(vgg16_weights[3]),
'conv2_1': tf.constant(vgg16_weights[5]),
'conv2_2': tf.constant(vgg16_weights[7]),
'conv3_1': tf.constant(vgg16_weights[9]),
'conv3_2': tf.constant(vgg16_weights[11]),
'conv3_3': tf.constant(vgg16_weights[13]),
'conv4_1': tf.constant(vgg16_weights[15]),
'conv4_2': tf.constant(vgg16_weights[17]),
'conv4_3': tf.constant(vgg16_weights[19]),
'conv5_1': tf.constant(vgg16_weights[21]),
'conv5_2': tf.constant(vgg16_weights[23]),
'conv5_3': tf.constant(vgg16_weights[25]),
}
# input layer: [1, image_height, image_width, channels]
input_style = tf.constant(img_style[np.newaxis, :, :, :], dtype=tf.float32)
input_content = tf.constant(img_content[np.newaxis, :, :, :], dtype=tf.float32)
def net(x, weights, biases):
def conv_relu(x, w, b):
return tf.nn.relu(tf.nn.conv2d(x, w, strides=[1,1,1,1], padding='SAME') + b)
def maxpool(x):
return tf.nn.max_pool2d(x, ksize=2, strides=2, padding='VALID')
conv1_1 = conv_relu(x, weights['conv1_1'], biases['conv1_1'])
conv1_2 = conv_relu(conv1_1, weights['conv1_2'], biases['conv1_2'])
maxp1 = maxpool(conv1_2)
conv2_1 = conv_relu(maxp1, weights['conv2_1'], biases['conv2_1'])
conv2_2 = conv_relu(conv2_1, weights['conv2_2'], biases['conv2_2'])
maxp2 = maxpool(conv2_2)
conv3_1 = conv_relu(maxp2, weights['conv3_1'], biases['conv3_1'])
conv3_2 = conv_relu(conv3_1, weights['conv3_2'], biases['conv3_2'])
conv3_3 = conv_relu(conv3_2, weights['conv3_3'], biases['conv3_3'])
maxp3 = maxpool(conv3_3)
conv4_1 = conv_relu(maxp3, weights['conv4_1'], biases['conv4_1'])
conv4_2 = conv_relu(conv4_1, weights['conv4_2'], biases['conv4_2'])
conv4_3 = conv_relu(conv4_2, weights['conv4_3'], biases['conv4_3'])
maxp4 = maxpool(conv4_3)
conv5_1 = conv_relu(maxp4, weights['conv5_1'], biases['conv5_1'])
conv5_2 = conv_relu(conv5_1, weights['conv5_2'], biases['conv5_2'])
conv5_3 = conv_relu(conv5_2, weights['conv5_3'], biases['conv5_3'])
maxp5 = maxpool(conv5_3)
return {
'conv1_1': conv1_1,
'conv2_1': conv2_1,
'conv3_1': conv3_1,
'conv4_1': conv4_1,
'conv4_2': conv4_2,
'conv5_1': conv5_1
}
layers_style = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1']
layers_content = ['conv4_2']
LR = 30.0
optimizer = tf.keras.optimizers.Adam(learning_rate=LR)
Image Composition (Generation) as tf.Variable
# composite image is the only variable that needs to be updated
input_gen = tf.Variable(tf.random.uniform([1, h_image, w_image, 3], maxval=255.0))
Style Loss and Content Loss
(1) Style loss
$$ G^l_{ij}(I) = \sum_{k} F^l_{ik}(I) \cdot F^l_{jk}(I) $$or
$$G^{l}(I) = F^{l}(I) \; \left( F^{l}(I) \right)^T$$(2) Content loss
$$ \mathcal{L}_{\text{content}}(I_g, I_c) = \frac{1}{2} \sum_{i,j} \left( F^l_{ij}(I_g) - F^l_{ij}(I_c) \right)^2 $$def get_gram_matrix(conv):
shape = tf.shape(conv)
c = shape[-1]
features = tf.reshape(conv, [-1, c])
gram = tf.matmul(features, features, transpose_a=True)
return gram / tf.cast(tf.size(features), tf.float32)
def get_loss_style(g1, g2):
return tf.reduce_mean(tf.square(g1 - g2))
def get_loss_content(f1, f2):
return tf.reduce_mean(tf.square(f1 - f2))
Composite Image
for epoch in range(1001):
with tf.GradientTape() as tape:
f_s = net(input_style, weights, biases)
f_c = net(input_content, weights, biases)
f_g = net(input_gen, weights, biases)
loss_style = tf.add_n([get_loss_style(get_gram_matrix(f_g[l]), get_gram_matrix(f_s[l])) for l in layers_style])
loss_content = tf.add_n([get_loss_content(f_g[l], f_c[l]) for l in layers_content])
loss_total = loss_style + loss_content
grads = tape.gradient(loss_total, [input_gen])
optimizer.apply_gradients(zip(grads, [input_gen]))
if epoch % 250 == 0:
print(f"Epoch: {epoch}")
print(f"Style loss: {loss_style.numpy():.2f}")
print(f"Content loss: {loss_content.numpy():.2f}\n")
image = tf.clip_by_value(tf.round(input_gen), 0, 255)
image_np = tf.squeeze(image).numpy().astype(np.uint8)
plt.figure(figsize = (8, 6))
plt.imshow(image_np)
plt.axis('off')
plt.show()
Style Transfer with Total Variance Loss
Sometimes, the composite images we learn have a lot of high-frequency noise, particularly bright or dark pixels.
One common noise reduction method is total variation denoising.
def get_loss_TV(conv_layer):
loss = tf.reduce_mean(tf.abs(conv_layer[:, :, 1:, :] - conv_layer[:, :, :-1, :])) + tf.reduce_mean(tf.abs(conv_layer[:, 1:, :, :] - conv_layer[:, :-1, :, :]))
return loss
input_gen = tf.Variable(tf.random.uniform([1, h_image, w_image, 3], maxval=255.0))
optimizer = tf.keras.optimizers.Adam(learning_rate=LR)
for epoch in range(1001):
with tf.GradientTape() as tape:
f_s = net(input_style, weights, biases)
f_c = net(input_content, weights, biases)
f_g = net(input_gen, weights, biases)
loss_style = tf.add_n([get_loss_style(get_gram_matrix(f_g[l]), get_gram_matrix(f_s[l])) for l in layers_style])
loss_content = tf.add_n([get_loss_content(f_g[l], f_c[l]) for l in layers_content])
loss_TV = get_loss_TV(input_gen)
loss_total = loss_style + loss_content + 5 * loss_TV
grads = tape.gradient(loss_total, [input_gen])
optimizer.apply_gradients(zip(grads, [input_gen]))
if epoch % 250 == 0:
print('Epoch: {}'.format(epoch))
print('Style loss: {:.4f}'.format(loss_style.numpy()))
print('Content loss: {:.4f}'.format(loss_content.numpy()))
print('TV loss: {:.4f}\n'.format(5 * loss_TV.numpy()))
image = tf.clip_by_value(tf.round(input_gen), 0, 255)
image = tf.squeeze(image).numpy().astype(np.uint8)
plt.figure(figsize = (8, 6))
plt.imshow(image)
plt.axis('off')
plt.show()
Beyond its visual appeal, NST offers several conceptually important insights that extend well beyond the task of image stylization.
The layer-wise interpretation of a CNN is perhaps the most elegant aspect of the framework. The observation that lower layers encode texture and higher layers encode semantics was not designed into VGG — it emerged from supervised training on image classification. NST exploits this emergent structure deliberately, treating a classifier's internal representations as a perceptual measurement device. This suggests that the hierarchical organization of features in deep networks carries genuine geometric and semantic meaning, not merely discriminative utility.
The definition of content and style similarity is notably non-trivial. Content is measured as a direct feature distance in activation space, while style is measured as a distance between second-order statistics (Gram matrices) — discarding spatial layout entirely. The fact that these two very different notions of similarity, when combined, produce perceptually coherent results is a strong empirical validation of the representational structure learned by CNNs.
A subtle but important observation is that the neural network itself is never trained during style transfer. VGG is fixed; its weights are not updated. What is being optimized is the input image itself, pixel by pixel. This is a fundamentally different computational paradigm from standard deep learning, where the model is the optimization variable. Here, the model is a fixed measurement instrument and the data is the unknown.
This perspective connects NST to a broader class of problems known as inverse design — a paradigm with significant engineering relevance. Rather than asking "given a design, what is its output?", inverse design asks "given a desired output, what is the optimal input?" NST is a clean instantiation of this idea: the desired output is a perceptual specification (match this content, match this style), and the solution is found by gradient-based optimization over the input space. The same framework has since been applied to photonic structure design, drug molecule generation, acoustic metamaterials, and aerodynamic shape optimization — wherever a differentiable forward model exists and the design space is continuous.
Finally, NST can be viewed as an early example of using a neural network as a learned perceptual loss. Instead of defining image quality with hand-crafted metrics (e.g., PSNR, SSIM), the loss is computed in a deep feature space that correlates more closely with human perception. This idea later became central to image super-resolution, image-to-image translation, and generative adversarial network training.
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')