Style Transfer

Table of Contents





1. Neural Style Transfer

Neural Style Transfer (NST) is a technique in computer vision and image synthesis that generates a new image by combining the content of one image with the style of another. The content image defines the spatial layout and semantic structure of objects, while the style image contributes color, texture, and brushstroke patterns.



The figure above illustrates a representative example of artistic style transfer. On the left is a content image of a European riverside town (Tübingen, Germany); the inset shows the style image, Vincent van Gogh's The Starry Night (1889). On the right is the synthesized output: the architectural structure and spatial layout of the photograph are preserved, while the swirling brushstrokes, deep blue palette, and impasto texture of van Gogh's painting are faithfully reproduced.

The technique was first introduced by Gatys et al. (2015), who demonstrated that a pretrained CNN naturally disentangles content and style representations — enabling high-quality artistic style transfer through direct pixel-space optimization. Prior to this, approaches to texture transfer relied on hand-crafted features and classical texture synthesis pipelines, which lacked the semantic awareness needed for perceptually convincing results.


In this lecture, we will:

  • Understand the theoretical foundations from a representation-learning perspective
  • Derive the optimization objective rigorously
  • Implement the original Gatys et al. algorithm
  • Analyze the limitations of the optimization-based approach and survey modern feed-forward alternatives


1.1. Problem Definition

Given two input images:

  • A content image $I_c$, which provides the spatial structure, layout, and objects that should be preserved
  • A style image $I_s$, which contributes the textures, colors, and artistic patterns to be transferred

The objective is to synthesize a third image $I_g$ (the generated image) such that:

  • The structural and semantic content in $I_g$ resembles that of $I_c$
  • The visual appearance (in terms of style) of $I_g$ matches that of $I_s$

This is accomplished by optimizing a loss function that balances content and style terms. The resulting image $I_g$ is computed iteratively so that it matches the high-level features of the content image while reproducing the statistical patterns (style) extracted from the style image.


1.2. CNN Network as a Feature Extractor

Neural style transfer relies on the hierarchical representation capabilities of convolutional neural networks (CNNs), such as VGG-19 pretrained on the ImageNet dataset. These networks extract multi-scale features from images, which are useful for encoding both spatial content and visual style.

  • Lower (or upstream) convolutional layers tend to capture local structures, such as edges, textures, and basic color contrasts

  • Higher (or downstream) convolutional layers encode more abstract representations, including object shapes, part relationships, and spatial configurations



The figure above illustrates this hierarchy. As the input passes through successive convolutional layers, the learned filters progress from detecting simple oriented edges, to combinations of edges forming facial parts, to high-level object models — each level providing a richer semantic abstraction of the input.


Because of this hierarchy, intermediate feature maps extracted from different layers of a CNN can be used to characterize the content and style of an image:

  • Content features are drawn from deeper layers, where spatial semantics are preserved

  • Style features are obtained by computing feature correlations (e.g., via Gram matrices) across multiple layers, capturing texture and appearance


While this interpretation is not mathematically rigorous in a strict sense, it is strongly supported by empirical results across many experiments. In practice, this layer-wise disentanglement of content and style forms the foundation for neural style transfer.



1.3. Overview of Structure of Style Transfer

This section outlines the overall structure of the neural style transfer procedure. The method follows a well-defined sequence of steps that combine feature extraction, loss computation, and image optimization.


Step 1: Select a pretrained CNN

  • Use a fixed, pretrained convolutional neural network such as VGG‑19, trained on ImageNet.
  • The network serves as a feature extractor; its weights remain unchanged during the optimization process.

Step 2: Extract features from the content and style images

  • Feed the content image $I_c$ and the style image $I_s$ independently through identical copies of the pretrained network.
  • Content features are extracted from deeper layers (e.g., conv4_2) that preserve spatial structure and object layout.
  • Style features are extracted from multiple shallower and intermediate layers (e.g., conv1_1, conv2_1, conv3_1, conv4_1, conv5_1), where texture and low-level statistics are more prominent.


Step 3: Initialize the generated image

  • Create the generated image $I_g$ as a randomly initialized image (e.g., white noise or a copy of $I_c$).
  • Unlike $I_c$ and $I_s$, the generated image $I_g$ is the only input that will be updated through gradient-based optimization.


Step 4: Optimization loop

  • Forward $I_g$ through the same pretrained network to compute its content and style features.
  • Compute the content loss of $I_g$ with those of $I_c$.
  • Compute the style loss of $I_g$ and $I_s$ across selected layers.
  • Backpropagate the total loss with respect to the pixels of $I_g$.
  • Update $I_g$ using an optimizer such as L-BFGS or Adam.
  • Repeat this process for a fixed number of iterations or until convergence.

Step 5: Result
After optimization, the resulting image $I_g$ will combine:

  • The structural content of $I_c$, and
  • The visual style of $I_s$

This synthesis is achieved by aligning intermediate representations of $I_g$ with those of both $I_c$ and $I_s$ through loss minimization.




1.4. Loss Function in Neural Style Transfer

In neural style transfer, the optimization of the generated image relies on carefully defined loss functions. These guide the synthesis toward blending content from one image with the artistic style of another. This section outlines each component of the loss and explains how they are combined and optimized.


We use VGG-19 pretrained on ImageNet as our feature extractor. Let:

  • $x$ denote an input image
  • $F^l(x) \in \mathbb{R}^{C_l \times H_l \times W_l}$ denote the feature map at layer $l$, where $C_l$ is the number of channels, and $H_l, W_l$ are the spatial dimensions

We reshape this to $F^l(x) \in \mathbb{R}^{C_l \times N_l}$, where $N_l = H_l \times W_l$.


  • Lower layers (e.g., relu1_1) encode fine-grained spatial detail.
  • Higher layers (e.g., relu4_2) encode semantic, structural content.

1.4.1. Content Representation

The content of an image is encoded in the high-level feature activations. Given a content image $I_c$ and a generated image $I_g$, the content loss measures the distance between their feature activations at layer $l$:


$$ \mathcal{L}_{\text{content}}(I_g, I_c) = \frac{1}{2} \sum_{i,j} \left( F^l_{ij}(I_g) - F^l_{ij}(I_c) \right)^2 $$

This loss encourages the generated image $I_g$ to preserve the semantic structure of the content image $I_c$. Importantly, it is computed in feature space, not pixel space.


1.4.2. Style Representation via Gram Matrices

Style is not a spatially localized property — it manifests as texture statistics that should be position-invariant. We capture this by computing pairwise feature correlations across spatial positions.


The Gram Matrix

Given the feature map $F^l(I) \in \mathbb{R}^{C_l \times H_l \times W_l}$ at layer $l$, we first flatten the spatial dimensions to obtain $F^l(I) \in \mathbb{R}^{C_l \times N_l}$, where $N_l = H_l \times W_l$. The Gram matrix $G^l(I) \in \mathbb{R}^{C_l \times C_l}$ is then defined as:


$$G^l_{ij}(I) = \frac{1}{N_l} \sum_{k=1}^{N_l} F^l_{ik}(I) \cdot F^l_{jk}(I)$$

or equivalently in matrix form:


$$G^l(I) = \frac{1}{N_l} F^l(I) \left( F^l(I) \right)^T$$

Intuitively, $G^l_{ij}$ measures how much feature channels $i$ and $j$ co-activate across all spatial positions. This operation discards spatial layout entirely while preserving the co-occurrence statistics of features — precisely what characterizes texture and artistic style.

Note the normalization factor $\frac{1}{N_l}$: without it, the Gram matrix magnitude grows with image resolution, making the style loss resolution-sensitive and difficult to tune consistently.


Style Loss

The style loss at layer $l$ is then the squared Frobenius norm between the Gram matrices of the generated image $I_g$ and the style image $I_s$:


$$\mathcal{L}_{\text{style}}^l(I_g, I_s) = \frac{1}{4 C_l^2 N_l^2} \left\| G^l(I_g) - G^l(I_s) \right\|^2_F$$

The total style loss aggregates contributions across a set of layers $\mathcal{L}_s$ with per-layer weights $w_l$:


$$\mathcal{L}_{\text{style}}(I_g, I_s) = \sum_{l \in \mathcal{L}_s} w_l \, \mathcal{L}_{\text{style}}^l(I_g, I_s)$$

Using multiple layers is critical: lower layers capture fine-grained texture (brushstroke-level patterns), while higher layers encode coarser compositional and color distributions. The choice of $\mathcal{L}_s$ and the weights $w_l$ therefore directly controls the granularity of the transferred style.


1.4.3. Total Variation Loss

Optimization may introduce high-frequency noise in $I_g$. This is mitigated by the total variation (TV) loss, which encourages spatial smoothness by penalizing large intensity differences between neighboring pixels:


$$ \mathcal{L}_{\text{TV}}(I_g) = \sum_{i,j} \; \lvert I_g[i,j+1] - I_g[i,j] \rvert + \lvert I_g[i+1,j] - I_g[i,j]\rvert $$

or


$$\sum_{i,j} \; \left| x_{i,j+1}- x_{i,j}\right| + \left| x_{i+1,j} - x_{i,j}\right|$$

Unlike the squared (L2) variant, the L1 formulation does not overly suppress large local differences — it tolerates sharp edges and object boundaries while still discouraging fine-grained high-frequency noise. This property is desirable in style transfer, where perceptually meaningful discontinuities (e.g., outlines of objects) should be preserved in the generated image.



1.4.4. Combined Loss and Optimization

The total loss combines content, style, and total variation components:


$$ \mathcal{L}_{\text{total}} = \alpha \cdot \mathcal{L}_{\text{content}} + \beta \cdot \mathcal{L}_{\text{style}} + \gamma \cdot \mathcal{L}_{\text{TV}} $$

where:

  • $\alpha$: weight for preserving content
  • $\beta$: weight for applying style
  • $\gamma$: weight for smoothing artifacts

During training:

  1. Initialize $I_g$ (e.g., with white noise or a copy of $I_c$)

  2. Keep all CNN weights fixed

  3. Compute each loss via forward passes

  4. Backpropagate gradients with respect to $I_g$ only

  5. Update $I_g$ using an optimizer

  6. Repeat until synthesis achieves a satisfactory blend of content, style, and smoothness





2. Style Transfer in Python


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.applications import VGG16
import cv2

Content Image

In [ ]:
h_image, w_image = 600, 1000
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
img_path = '/content/drive/MyDrive/DL/DL_data/kaist_nubzuki.jpg'
img_content = cv2.imread(img_path)
img_content = cv2.cvtColor(img_content, cv2.COLOR_BGR2RGB)
img_content = cv2.resize(img_content, (w_image, h_image))

# 시각화
plt.figure(figsize = (8, 6))
plt.imshow(img_content)
plt.axis('off')
plt.show()

Style Image

In [ ]:
img_style_path = '/content/drive/MyDrive/DL/DL_data/la_muse.jpg'

img_style = cv2.imread(img_style_path)
img_style = cv2.cvtColor(img_style, cv2.COLOR_BGR2RGB)
img_style = cv2.resize(img_style, (w_image, h_image))

plt.figure(figsize = (8, 6))
plt.imshow(img_style)
plt.axis('off')
plt.show()

Pre-trained Model (VGG16)

In [ ]:
model = VGG16(weights='imagenet', include_top=True)
model.summary()
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels.h5
553467096/553467096 ━━━━━━━━━━━━━━━━━━━━ 3s 0us/step
Model: "vgg16"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer (InputLayer)        │ (None, 224, 224, 3)    │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block1_conv1 (Conv2D)           │ (None, 224, 224, 64)   │         1,792 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block1_conv2 (Conv2D)           │ (None, 224, 224, 64)   │        36,928 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block1_pool (MaxPooling2D)      │ (None, 112, 112, 64)   │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block2_conv1 (Conv2D)           │ (None, 112, 112, 128)  │        73,856 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block2_conv2 (Conv2D)           │ (None, 112, 112, 128)  │       147,584 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block2_pool (MaxPooling2D)      │ (None, 56, 56, 128)    │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block3_conv1 (Conv2D)           │ (None, 56, 56, 256)    │       295,168 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block3_conv2 (Conv2D)           │ (None, 56, 56, 256)    │       590,080 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block3_conv3 (Conv2D)           │ (None, 56, 56, 256)    │       590,080 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block3_pool (MaxPooling2D)      │ (None, 28, 28, 256)    │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block4_conv1 (Conv2D)           │ (None, 28, 28, 512)    │     1,180,160 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block4_conv2 (Conv2D)           │ (None, 28, 28, 512)    │     2,359,808 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block4_conv3 (Conv2D)           │ (None, 28, 28, 512)    │     2,359,808 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block4_pool (MaxPooling2D)      │ (None, 14, 14, 512)    │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block5_conv1 (Conv2D)           │ (None, 14, 14, 512)    │     2,359,808 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block5_conv2 (Conv2D)           │ (None, 14, 14, 512)    │     2,359,808 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block5_conv3 (Conv2D)           │ (None, 14, 14, 512)    │     2,359,808 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ block5_pool (MaxPooling2D)      │ (None, 7, 7, 512)      │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ flatten (Flatten)               │ (None, 25088)          │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ fc1 (Dense)                     │ (None, 4096)           │   102,764,544 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ fc2 (Dense)                     │ (None, 4096)           │    16,781,312 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ predictions (Dense)             │ (None, 1000)           │     4,097,000 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 138,357,544 (527.79 MB)
 Trainable params: 138,357,544 (527.79 MB)
 Non-trainable params: 0 (0.00 B)
In [ ]:
vgg16_weights = model.get_weights()

# kernel size: [kernel_height, kernel_width, input_ch, output_ch]
weights = {
    'conv1_1': tf.constant(vgg16_weights[0]),
    'conv1_2': tf.constant(vgg16_weights[2]),
    'conv2_1': tf.constant(vgg16_weights[4]),
    'conv2_2': tf.constant(vgg16_weights[6]),
    'conv3_1': tf.constant(vgg16_weights[8]),
    'conv3_2': tf.constant(vgg16_weights[10]),
    'conv3_3': tf.constant(vgg16_weights[12]),
    'conv4_1': tf.constant(vgg16_weights[14]),
    'conv4_2': tf.constant(vgg16_weights[16]),
    'conv4_3': tf.constant(vgg16_weights[18]),
    'conv5_1': tf.constant(vgg16_weights[20]),
    'conv5_2': tf.constant(vgg16_weights[22]),
    'conv5_3': tf.constant(vgg16_weights[24]),
}

# bias size: [output_ch] or [neuron_size]
biases = {
    'conv1_1': tf.constant(vgg16_weights[1]),
    'conv1_2': tf.constant(vgg16_weights[3]),
    'conv2_1': tf.constant(vgg16_weights[5]),
    'conv2_2': tf.constant(vgg16_weights[7]),
    'conv3_1': tf.constant(vgg16_weights[9]),
    'conv3_2': tf.constant(vgg16_weights[11]),
    'conv3_3': tf.constant(vgg16_weights[13]),
    'conv4_1': tf.constant(vgg16_weights[15]),
    'conv4_2': tf.constant(vgg16_weights[17]),
    'conv4_3': tf.constant(vgg16_weights[19]),
    'conv5_1': tf.constant(vgg16_weights[21]),
    'conv5_2': tf.constant(vgg16_weights[23]),
    'conv5_3': tf.constant(vgg16_weights[25]),
}
In [ ]:
# input layer: [1, image_height, image_width, channels]
input_style = tf.constant(img_style[np.newaxis, :, :, :], dtype=tf.float32)
input_content = tf.constant(img_content[np.newaxis, :, :, :], dtype=tf.float32)
In [ ]:
def net(x, weights, biases):
    def conv_relu(x, w, b):
        return tf.nn.relu(tf.nn.conv2d(x, w, strides=[1,1,1,1], padding='SAME') + b)

    def maxpool(x):
        return tf.nn.max_pool2d(x, ksize=2, strides=2, padding='VALID')

    conv1_1 = conv_relu(x, weights['conv1_1'], biases['conv1_1'])
    conv1_2 = conv_relu(conv1_1, weights['conv1_2'], biases['conv1_2'])
    maxp1 = maxpool(conv1_2)

    conv2_1 = conv_relu(maxp1, weights['conv2_1'], biases['conv2_1'])
    conv2_2 = conv_relu(conv2_1, weights['conv2_2'], biases['conv2_2'])
    maxp2 = maxpool(conv2_2)

    conv3_1 = conv_relu(maxp2, weights['conv3_1'], biases['conv3_1'])
    conv3_2 = conv_relu(conv3_1, weights['conv3_2'], biases['conv3_2'])
    conv3_3 = conv_relu(conv3_2, weights['conv3_3'], biases['conv3_3'])
    maxp3 = maxpool(conv3_3)

    conv4_1 = conv_relu(maxp3, weights['conv4_1'], biases['conv4_1'])
    conv4_2 = conv_relu(conv4_1, weights['conv4_2'], biases['conv4_2'])
    conv4_3 = conv_relu(conv4_2, weights['conv4_3'], biases['conv4_3'])
    maxp4 = maxpool(conv4_3)

    conv5_1 = conv_relu(maxp4, weights['conv5_1'], biases['conv5_1'])
    conv5_2 = conv_relu(conv5_1, weights['conv5_2'], biases['conv5_2'])
    conv5_3 = conv_relu(conv5_2, weights['conv5_3'], biases['conv5_3'])
    maxp5 = maxpool(conv5_3)

    return {
        'conv1_1': conv1_1,
        'conv2_1': conv2_1,
        'conv3_1': conv3_1,
        'conv4_1': conv4_1,
        'conv4_2': conv4_2,
        'conv5_1': conv5_1
    }
In [ ]:
layers_style = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1']
layers_content = ['conv4_2']
LR = 30.0
optimizer = tf.keras.optimizers.Adam(learning_rate=LR)

Image Composition (Generation) as tf.Variable

In [ ]:
# composite image is the only variable that needs to be updated
input_gen = tf.Variable(tf.random.uniform([1, h_image, w_image, 3], maxval=255.0))

Style Loss and Content Loss

(1) Style loss

$$ G^l_{ij}(I) = \sum_{k} F^l_{ik}(I) \cdot F^l_{jk}(I) $$

or

$$G^{l}(I) = F^{l}(I) \; \left( F^{l}(I) \right)^T$$

(2) Content loss

$$ \mathcal{L}_{\text{content}}(I_g, I_c) = \frac{1}{2} \sum_{i,j} \left( F^l_{ij}(I_g) - F^l_{ij}(I_c) \right)^2 $$
In [ ]:
def get_gram_matrix(conv):
    shape = tf.shape(conv)
    c = shape[-1]
    features = tf.reshape(conv, [-1, c])
    gram = tf.matmul(features, features, transpose_a=True)
    return gram / tf.cast(tf.size(features), tf.float32)

def get_loss_style(g1, g2):
    return tf.reduce_mean(tf.square(g1 - g2))

def get_loss_content(f1, f2):
    return tf.reduce_mean(tf.square(f1 - f2))

Composite Image

In [ ]:
for epoch in range(1001):
    with tf.GradientTape() as tape:
        f_s = net(input_style, weights, biases)
        f_c = net(input_content, weights, biases)
        f_g = net(input_gen, weights, biases)
        loss_style = tf.add_n([get_loss_style(get_gram_matrix(f_g[l]), get_gram_matrix(f_s[l])) for l in layers_style])
        loss_content = tf.add_n([get_loss_content(f_g[l], f_c[l]) for l in layers_content])
        loss_total = loss_style + loss_content

    grads = tape.gradient(loss_total, [input_gen])
    optimizer.apply_gradients(zip(grads, [input_gen]))

    if epoch % 250 == 0:
        print(f"Epoch: {epoch}")
        print(f"Style loss: {loss_style.numpy():.2f}")
        print(f"Content loss: {loss_content.numpy():.2f}\n")

        image = tf.clip_by_value(tf.round(input_gen), 0, 255)
        image_np = tf.squeeze(image).numpy().astype(np.uint8)
        plt.figure(figsize = (8, 6))
        plt.imshow(image_np)
        plt.axis('off')
        plt.show()
Epoch: 0
Style loss: 2617855.25
Content loss: 63957.09

Epoch: 250
Style loss: 4217.70
Content loss: 11564.44

Epoch: 500
Style loss: 3826.63
Content loss: 8350.68

Epoch: 750
Style loss: 3933.34
Content loss: 7738.49

Epoch: 1000
Style loss: 4603.01
Content loss: 7517.92


Style Transfer with Total Variance Loss

  • Sometimes, the composite images we learn have a lot of high-frequency noise, particularly bright or dark pixels.

  • One common noise reduction method is total variation denoising.


$$\sum_{i,j} \; \left| x_{i,j+1}- x_{i,j}\right| + \left| x_{i+1,j} - x_{i,j}\right|$$
In [ ]:
def get_loss_TV(conv_layer):
    loss = tf.reduce_mean(tf.abs(conv_layer[:, :, 1:, :] - conv_layer[:, :, :-1, :])) + tf.reduce_mean(tf.abs(conv_layer[:, 1:, :, :] - conv_layer[:, :-1, :, :]))
    return loss
In [ ]:
input_gen = tf.Variable(tf.random.uniform([1, h_image, w_image, 3], maxval=255.0))
optimizer = tf.keras.optimizers.Adam(learning_rate=LR)

for epoch in range(1001):
    with tf.GradientTape() as tape:
        f_s = net(input_style, weights, biases)
        f_c = net(input_content, weights, biases)
        f_g = net(input_gen, weights, biases)

        loss_style = tf.add_n([get_loss_style(get_gram_matrix(f_g[l]), get_gram_matrix(f_s[l])) for l in layers_style])
        loss_content = tf.add_n([get_loss_content(f_g[l], f_c[l]) for l in layers_content])
        loss_TV = get_loss_TV(input_gen)
        loss_total = loss_style + loss_content + 5 * loss_TV

    grads = tape.gradient(loss_total, [input_gen])
    optimizer.apply_gradients(zip(grads, [input_gen]))

    if epoch % 250 == 0:
        print('Epoch: {}'.format(epoch))
        print('Style loss: {:.4f}'.format(loss_style.numpy()))
        print('Content loss: {:.4f}'.format(loss_content.numpy()))
        print('TV loss: {:.4f}\n'.format(5 * loss_TV.numpy()))

        image = tf.clip_by_value(tf.round(input_gen), 0, 255)
        image = tf.squeeze(image).numpy().astype(np.uint8)
        plt.figure(figsize = (8, 6))
        plt.imshow(image)
        plt.axis('off')
        plt.show()
Epoch: 0
Style loss: 2609827.7500
Content loss: 63995.9844
TV loss: 850.0329

Epoch: 250
Style loss: 4189.2368
Content loss: 11473.0176
TV loss: 737.1957

Epoch: 500
Style loss: 3646.8010
Content loss: 8222.9346
TV loss: 648.4105

Epoch: 750
Style loss: 3643.7043
Content loss: 7462.5835
TV loss: 551.3561

Epoch: 1000
Style loss: 4379.9604
Content loss: 7669.1387
TV loss: 462.1234



3. Discussion: Why Neural Style Transfer is Intellectually Interesting

Beyond its visual appeal, NST offers several conceptually important insights that extend well beyond the task of image stylization.


The layer-wise interpretation of a CNN is perhaps the most elegant aspect of the framework. The observation that lower layers encode texture and higher layers encode semantics was not designed into VGG — it emerged from supervised training on image classification. NST exploits this emergent structure deliberately, treating a classifier's internal representations as a perceptual measurement device. This suggests that the hierarchical organization of features in deep networks carries genuine geometric and semantic meaning, not merely discriminative utility.


The definition of content and style similarity is notably non-trivial. Content is measured as a direct feature distance in activation space, while style is measured as a distance between second-order statistics (Gram matrices) — discarding spatial layout entirely. The fact that these two very different notions of similarity, when combined, produce perceptually coherent results is a strong empirical validation of the representational structure learned by CNNs.


A subtle but important observation is that the neural network itself is never trained during style transfer. VGG is fixed; its weights are not updated. What is being optimized is the input image itself, pixel by pixel. This is a fundamentally different computational paradigm from standard deep learning, where the model is the optimization variable. Here, the model is a fixed measurement instrument and the data is the unknown.


This perspective connects NST to a broader class of problems known as inverse design — a paradigm with significant engineering relevance. Rather than asking "given a design, what is its output?", inverse design asks "given a desired output, what is the optimal input?" NST is a clean instantiation of this idea: the desired output is a perceptual specification (match this content, match this style), and the solution is found by gradient-based optimization over the input space. The same framework has since been applied to photonic structure design, drug molecule generation, acoustic metamaterials, and aerodynamic shape optimization — wherever a differentiable forward model exists and the design space is continuous.


Finally, NST can be viewed as an early example of using a neural network as a learned perceptual loss. Instead of defining image quality with hand-crafted metrics (e.g., PSNR, SSIM), the loss is computed in a deep feature space that correlates more closely with human perception. This idea later became central to image super-resolution, image-to-image translation, and generative adversarial network training.


In [ ]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')