Recurrent Neural Networks (RNN)

Table of Contents

1. Time Series Data¶

1.1. What We've Covered So Far¶

We've explored regression, classification, clustering, and dimensionality reduction - all based on snapshot-type data. In other words, the data we've worked with so far assumes that each sample is independent of the others, with no temporal structure or sequential dependency.

For example, in image classification tasks - such as determining whether an image depicts a cat or a dog - the order in which the images are presented to the model does not influence its prediction. Each input is treated as a standalone instance, with no relationship to the images before or after it.

Sequence Matters in Time Series

Unlike image classification where each input is independent, in time series data, the order of data points is critical. For example, if the position of a ball is recorded at constant time intervals, we can use the past sequence of positions to predict where the ball will be next.

As illustrated in the image, the ball moves along a path, and its future position depends on its past trajectory. If we ignore the sequential nature of the data, we lose the ability to model motion, trends, or dynamic behavior. In this context, sequence carries essential information that cannot be overlooked.

However, up to this point, we haven't introduced any neural network tools that are capable of handling sequential information. Before we dive into how to model this, let's first take a closer look at the nature of time series data.

1.2. Time-Series Data¶

A sequence is an ordered set of observations where the position or order of elements carries essential meaning. Common examples include:

Natural language (e.g., sentences)
Medical signals (e.g., ECG, EEG)
Audio data (e.g., speech waveforms)
Vibration or sensor measurements in manufacturing

Definition of time-series

A time series is formally defined as a function

$$x: T \rightarrow \mathbb{R}^n$$

Where

$$T=\{\cdots, t_{-2},t_{-1},t_0,t_1,t_2, \cdots \}$$

is an ordered set of time indices, and $\mathbb{R}^n$ denotes an $n$-dimensional real-valued vector space. In other words, a time series maps each point in time to an $n$-dimensional vector of observations.

Sequence Modeling

In the real world, most data is sequential in nature. Sequence modeling focuses on capturing:

Use past and present observations to predict the future
Past events that influence future outcomes

The goal is to learn the structure and potentially hierarchical patterns within the sequence.

2. Recurrent Neural Network (RNN)¶

2.1. Feedforward Network and Sequential Data¶

Suppose there are five data points given in a specific order.

In the first attempt,

we use an artificial neural network (ANN),
but the sequence or order information is not incorporated.

Now, consider the case where we have more than five historical data points

the lack of sequential modeling becomes even more critical.

For improved visualization, the same structure is illustrated below using a 3D representation

2.2. Representation Shortcut¶

Simplified 2D Representation

Each box in the diagram actually represents an entire layer consisting of many neurons.

For simplicity, we represent each layer as a single box.
The input at each time step is a vector.
For example, $X_t$ itself is not a single value, but a collection of values from the last five time steps (e.g., $x_{t-4}, x_{t-3}, x_{t-2}, x_{t-1}, x_{t}$)
Each hidden layer contains many neurons.
The output layer may also consist of multiple neurons.
In reality, each box corresponds to a full layer with many units.

Sliding Predictor

Suppose we focus on the last four time steps (or observations). At each time step, the model uses a sliding window that captures the most recent observations (e.g., $X_{t-3}$, $X_{t-2}$, $X_{t-1}$, $X_t$) to predict the target $Y_t$.

Note that although the term "time step" is used, each input vector at a given time step consists of multiple values (i.e., it is not a single value but a multi-dimensional vector).

Finite-Repsponse Model

$$Y_t = f(X_t, X_{t-1}, \cdots, X_{t-N})$$

where $N$ denotes the width of the system

The sliding predictor can thus be interpreted as a finite-response model.

Increasing the "History" Length Toward Infinite-Response Systems

In many real-world systems, the influence of past inputs does not vanish after a fixed number of steps. Instead, the system exhibits an infinite-response behavior, where the effects of earlier inputs persist, although often with progressively weaker influence over time.

As we increase the number of past time steps (the "history") used as input, the network complexity grows significantly.

$$Y_t = f(X_t, X_{t-1}, \cdots, X_{t-\infty})$$

This results in higher computational costs and a greater risk of overfitting.

Implementing a true infinite-response system is not feasible, as it would require storing and processing an unbounded history of data.

Any ideas?

$$Y_t = f(X_t, X_{t-1}, \cdots, X_{t-\infty}) \quad \rightarrow \quad Y_t=(X_t, \color{red}{Y_{t-1}})$$

Instead of feeding a long history into the network, we recursively use previous predictions to carry information forward.
This approach is known as autoregressive modeling, where recursion occurs through the output.
In theory, the output $Y_t$ encodes information about the entire past.
It enables trading space (memory) for time, by compressing the historical information into a recursive structure.

Autoregressive modeling marks an important conceptual transition leading to recurrent neural networks (RNNs).

2.3. An Alternate Model for Infinite Response Systems¶

The state-space model

$$ \begin{align*} H_t &= f(X_t, H_{t-1})\\ Y_t &= g(H_t) \end{align*} $$

Where

$H_t$ is the hidden (or internal) state at time $t$, summarizing all relevant information from the past.
$f(\cdot)$ updates the hidden state based on the current input $X_t$ and the previous hidden state $H_{t-1}$.
$g(\cdot)$ maps the hidden state to the output $Y_t$.

Rather than directly feeding the previous output $Y_{t-1}$ back into the model (as in standard autoregressive modeling), the hidden state $H_{t-1}$ is used. This approach compresses historical information into a latent state, enabling the system to maintain memory of the past without explicitly storing all previous outputs.

This structure is quite well-known in control theory, where state-space models are widely used to describe the evolution of dynamic systems. In modern machine learning, this principle underlies architectures like recurrent neural networks (RNNs), which similarly update hidden states recursively to model sequential data.

To build a recurrent neural network (RNN) from the concept of the state-space model, consider three independent artificial neural networks (ANNs) as illustrated below (two equivalent perspectives are shown).

Each ANN takes five sequential inputs. However, because an ANN consists of fully connected layers, it does not inherently consider the sequential order of the inputs. Instead, the network simply extracts key features in the hidden layers to produce the output, without explicitly modeling temporal dependencies.

For improved visualization, the same concept is illustrated below using a 3D representation, where the structure of inputs, hidden layers, and outputs is more intuitively shown. Here, the multiple ANNs are shown processing sequential inputs, but without explicit temporal connections among them.

Each ANN takes five sequential inputs; however, it does not inherently account for the sequential nature of the data. Instead, it treats the input as a static vector, extracting key features in the hidden layers to produce the output, without explicitly modeling temporal dependencies across time steps. While stacking multiple ANNs could process different segments of data independently, this approach fails to capture how information evolves over time.

To address this limitation, connections can be introduced between the hidden layers of otherwise independent ANNs. By propagating compressed information from one hidden layer to the next, the network begins to incorporate temporal context into its representations. This enables the hidden layers to retain and utilize information from previous time steps, laying the foundation for modeling sequential dependencies in dynamic systems. This principle forms the conceptual basis for recurrent neural networks (RNNs).

In time-series data, determining both the number of sequential points within each input vector $X_t$ and the number of past time steps $N$ to consider is a critical modeling choice.

These parameters should be carefully selected based on the characteristics of the data and the requirements of the target application, as they significantly affect the model’s ability to capture temporal dependencies.

2.4. Recurrent Neural Network (RNN)¶

A recurrent neural network (RNN) is specifically designed to model sequential data by incorporating temporal dependencies.

In an RNN, the hidden state acts as a summary of all past information up to the current time step. Rather than treating each input independently, the network recursively updates its hidden state based on both the current input and the previous hidden state. This recursive structure enables the RNN to capture the dynamic evolution of data over time.

Thus, the hidden state serves as a compressed representation of the entire history, allowing the network to make informed predictions based on both present and past inputs.

The unfolded version of an RNN, as shown above, is a conceptual tool used to illustrate how the network operates across time steps. At each time step, the hidden state is updated based on the current input and the previous hidden state, and the output is generated accordingly.

However, in practice, an RNN is implemented as a folded architecture:

A single network is repeatedly applied at each time step.
Memory of the past is maintained through the hidden states, without explicitly creating multiple copies of the network.
The loops imply recurrence

Thus, while the unfolded view is useful for understanding the flow of information, the actual RNN consists of one network block that recurrently updates and carries forward its internal memory.

This recurrent structure enables the RNN to model sequences efficiently without increasing the size of the network over time.

Multiple Recurrent Layer RNN

2.4.1. Training Recurrent Neural Networks (RNNs)¶

Training an RNN involves adjusting its parameters so that it can learn to model the sequential dependencies in time-series data.

Unlike feedforward neural networks, RNNs require special considerations during training due to their recursive structure and temporal dependencies across time steps.

Loss Function over Sequences

In RNNs, the loss is typically defined over the entire sequence, not just at a single time step.

Suppose the model produces outputs $\hat{Y}_t$ at each time step $t$, and the true target values are $Y_t$. The total loss $L$ is often computed as:

$$L = \sum_{t=1}^{T}\ell(\hat Y_t, Y_t)$$

where $\ell(\cdot,\cdot)$ is a pointwise loss function (e.g., mean squared error for regression, cross-entropy for classification).

The goal is to minimize the sum of losses across all time steps, allowing the network to learn dependencies throughout the sequence.

Backpropagation Through Time (BPTT)

Because of the recurrent connections, gradient computation must account for how the hidden states evolve over time.

The standard technique is called Backpropagation Through Time (BPTT).

In BPTT:

The RNN is unfolded across all time steps, treating each unfolded copy like a layer in a deep feedforward network.
Gradients are computed by applying the chain rule through time as well as through the network layers.
Parameters are updated by accumulating gradients across all time steps.

This allows the model to adjust not only for the current input but also for the influence of earlier hidden states.

Vanishing and Exploding Gradients

However, training RNNs with BPTT introduces a critical challenge:

Vanishing gradients: Gradients shrink exponentially as they are propagated backward through many time steps, causing the network to "forget" long-term dependencies.
Exploding gradients: Gradients grow exponentially and may cause unstable updates or numerical overflow.

Both problems are rooted in the repeated multiplication of gradient terms over time steps during backpropagation.

2.4.2. Dynamical Systems and Natural Language Processing¶

In most physical and dynamical systems, the influence of an input on the output response tends to decay exponentially over time. This means that recent inputs typically have a stronger and more immediate effect on the system’s behavior, while distant past inputs gradually lose their impact. In this sense, the standard RNN architecture is well suited for modeling many physical and dynamical systems.

However, many applications in natural language processing (NLP) do not exhibit this characteristic. In NLP tasks, long-term dependencies often remain critically important. For example, in a paragraph, a subject introduced early on may heavily influence the meaning and interpretation of sentences that occur much later. Similarly, in machine translation or document summarization, words or phrases from the distant past must be remembered accurately to produce coherent outputs.

As a result, capturing long-term dependencies is essential for successful modeling in NLP. This requirement has motivated the development of specialized recurrent architectures, such as Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs), which are explicitly designed to mitigate the limitations of standard RNNs and maintain important information over longer sequences.

2.4.3. Natural Language Processing and Word Embedding¶

In natural language processing (NLP), the raw input data consists of words, phrases, or entire documents, which are inherently discrete and symbolic. Neural networks, however, require numerical vector representations as inputs. Thus, word embeddings are essential techniques that map discrete words into continuous vector spaces, where semantic similarity is preserved:

Words with similar meanings are mapped to nearby points in the embedding space.
This allows neural networks to leverage both syntactic and semantic relationships between words.

You might wonder why, in sequential modeling, sets of consecutive points are grouped together as follows:

$$X_t = \begin{bmatrix}x_{t-4} \\ x_{t-3} \\ x_{t-2} \\ x_{t-1} \\ x_{t}\end{bmatrix}$$

This construction becomes more intuitive if you think of it as analogous to word embeddings in NLP.

Just as a sequence of embedded word vectors forms the input to a language model, a sequence of past observations is collected to form the input to a sequential model for time-series data.

3. LSTM Networks¶

3.1. Long-Term Dependencies¶

While recurrent neural networks (RNNs) can model sequences by propagating hidden states through time, they often suffer from difficulties in learning long-term dependencies due to issues like vanishing or exploding gradients. In natural language processing (NLP), this challenge becomes even more critical.

Consider the sentence: "I grew up in France … I speak fluent French."

To correctly predict or understand the word "French," the model must retain information introduced many words earlier ("France").

However, a standard RNN may struggle to preserve such long-range dependencies because gradients diminish over time during backpropagation.

In contrast, the Long Short-Term Memory (LSTM) architecture introduces a memory cell and gating mechanisms that explicitly control the flow of information. This design enables LSTMs to retain critical information across long distances in the sequence, making them particularly effective for tasks that require remembering context over extended spans.

3.2. Long Short-Term Memory (LSTM)¶

A typical LSTM cell can be illustrated as follows:

The forget gate controls what information to discard from the previous cell state.
The input gate decides what new information to add.
The cell state carries important information forward across time steps.
The output gate determines what information to expose as the hidden state.

We decide not to go deeper into the theoretical details of RNNs or LSTMs. However, this does not mean that you do not need to understand how these networks work. Having a good conceptual understanding is still important for effective modeling.

In this course, we will focus more on how to model and use time-series data in practical applications rather than on theoretical analysis.

From a practical standpoint, implementing RNNs and LSTMs in Python is simple. Most deep learning libraries provide prebuilt modules for these architectures, allowing easy integration into real-world workflows.

4. Implementation for RNN and LSTM¶

Classification for Time-Series Data

In many real-world applications, time-series data is classified into categories based on patterns or behaviors observed over time.

The figure below illustrates the typical workflow for time-series classification using a recurrent neural network (RNN) or long short-term memory (LSTM) network:

Input Sequence:

A continuous time-series signal is segmented into meaningful chunks based on time windows or detected events.

Sequence Modeling:

Each chunk of sequential data, consisting of multiple time steps ($X_1, \cdots, X_t$), is passed through an RNN or LSTM network.
The network processes the sequence while maintaining hidden states that capture temporal dependencies and dynamics.

Feature Extraction and Classification:

After processing the sequence, the final hidden state (or an aggregation of hidden states) is fed into a fully connected layer to produce classification results.
Typical outputs in binary classification tasks include labels like "OK" (normal) or "NG" (not good / anomaly).

This architecture enables the model to recognize temporal patterns and dynamics across sequences, making it highly effective for time-series classification problems in domains such as manufacturing, healthcare monitoring, speech recognition, and finance.

Prediction for Time-series Data (Regression)

In many applications, the goal is not to classify time-series data, but rather to predict future values based on historical observations. This task is typically framed as a regression problem.

The figure below illustrates the workflow for time-series prediction using a recurrent neural network (RNN) or long short-term memory (LSTM) network:

Input Sequence:

A continuous time-series signal is segmented into overlapping or sliding chunks, each containing a fixed number of past observations.

Sequence Modeling:

Each chunk, consisting of sequential inputs ($X_1, \cdots, X_t$), is passed through an RNN or LSTM network. The network processes the sequence while maintaining a hidden state that captures the temporal patterns embedded in the data.

Feature Extraction and Prediction:

After processing the sequence, the final hidden state (or an aggregation of hidden states) is passed through a fully connected layer to predict a continuous value (or multiple values). This output represents the forecasted future behavior of the time-series.

This approach enables the model to learn complex, nonlinear temporal dependencies, making it highly effective for tasks such as stock price prediction, sensor signal forecasting, energy consumption estimation, and weather modeling.

4.1. Lab 1: Vibration Signals¶

In this section, we will explore a prediction example using vibration signals. Vibration signals are commonly encountered in applications such as mechanical fault diagnosis, structural health monitoring, and predictive maintenance. Accurately forecasting the future behavior of vibration signals can provide early warning signs of system anomalies or failures.

We will demonstrate how sequential models, such as LSTMs, can be applied to learn temporal patterns from historical vibration data and predict future trends.

Import Library

import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np

Load MNIST Data

Import acceleration signal of rotation machinery
rnn_vibration_signal.npy

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

data = np.load('/content/drive/MyDrive/ML/ML_data/rnn_vibration_signal.npy')

print(data.shape)
print('\n')

plt.figure(figsize = (6, 4))
plt.title('Time signal for RNN')
plt.plot(data[0:2000])
plt.xlim(0,2000)
plt.show()

(82000,)

LSTM Architecture

The figure below shows the architecture used for time-series prediction based on vibration signals.

Sequence Length:

The input sequence consists of $n_{\text{step}} = 25$ time steps.
Each input at a time step is a 100-dimensional vector ($n_{\text{input}} = 100$).

LSTM Layers:

Two stacked LSTM layers are used to process the sequence.
The first LSTM layer outputs a hidden state of size $n_{\text{LSTM1}} = 100$.
The second LSTM layer further processes this hidden representation, with a hidden size of $n_{\text{LSTM2}} = 100$.

Fully Connected Layers:

After processing through the LSTM layers, the final hidden state is passed through a fully connected (dense) layer.
The fully connected layer first maps the LSTM output to a hidden representation of size $n_{\text{hidden}} = 100$,
and then finally outputs a 100-dimensional prediction vector ($n_{\text{output}} = 100$).

Thus, the network maps an input sequence of length 25 with 100 features per step into a predicted output vector of dimension 100, capturing temporal dependencies over the entire sequence.

n_step = 25
n_input = 100

# LSTM shape
n_lstm1 = 100
n_lstm2 = 100

# fully connected
n_hidden = 100
n_output = 100

lstm_network = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (n_step, n_input)),
    tf.keras.layers.LSTM(n_lstm1, return_sequences = True),
    tf.keras.layers.LSTM(n_lstm2),
    tf.keras.layers.Dense(n_hidden),
    tf.keras.layers.Dense(n_output),
])

lstm_network.summary()

Model: "sequential_4"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ lstm_8 (LSTM)                   │ (None, 25, 100)        │        80,400 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ lstm_9 (LSTM)                   │ (None, 100)            │        80,400 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_8 (Dense)                 │ (None, 100)            │        10,100 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_9 (Dense)                 │ (None, 100)            │        10,100 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 181,000 (707.03 KB)

 Trainable params: 181,000 (707.03 KB)

 Non-trainable params: 0 (0.00 B)

lstm_network.compile(optimizer = 'adam',
                     loss = 'mean_squared_error',
                     metrics = ['mse'])

Dataset Preparation for Vibration Signal Prediction

The following function is used to generate training and testing datasets from a continuous vibration signal:

def dataset(data, n_samples, n_step = n_step, dim_input = n_input, dim_output = n_output, stride = 5):

    train_x_list = []
    train_y_list = []
    for i in range(n_samples):
        train_x = data[i*stride:i*stride + n_step*dim_input]
        train_x = train_x.reshape(n_step, dim_input)
        train_x_list.append(train_x)

        train_y = data[i*stride + n_step*dim_input:i*stride + n_step*dim_input + dim_output]
        train_y_list.append(train_y)

    train_data = np.array(train_x_list)
    train_label = np.array(train_y_list)

    test_data = data[10000:10000 + n_step*dim_input]
    test_data = test_data.reshape(1, n_step, dim_input)

    return train_data, train_label, test_data

train_data, train_label, test_data = dataset(data, 5000)

lstm_network.fit(train_data, train_label, epochs = 3)

Epoch 1/3
157/157 ━━━━━━━━━━━━━━━━━━━━ 3s 7ms/step - loss: 0.0609 - mse: 0.0609
Epoch 2/3
157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - loss: 0.0072 - mse: 0.0072
Epoch 3/3
157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - loss: 0.0051 - mse: 0.0051

<keras.src.callbacks.history.History at 0x7dc4d4c30c90>

Testing or Evaluating

The model receives a sequence of past vibration data (test input) and outputs a prediction for the next segment of the signal.

test_pred = lstm_network.predict(test_data, verbose = 0).ravel()
test_label = data[10000:10000 + n_step*n_input + n_input]

plt.figure(figsize = (6, 4))
plt.plot(np.arange(0, n_step*n_input + n_input), test_label, 'b', label = 'Ground truth')
plt.plot(np.arange(n_step*n_input, n_step*n_input + n_input), test_pred, 'r', label = 'Prediction')
plt.vlines(n_step*n_input, -1, 1, colors = 'r', linestyles = 'dashed')
plt.legend(loc = 'upper left')
plt.xlim(0, len(test_label))
plt.show()

Generating Future Time-Series Predictions

After training the LSTM model, we evaluate its ability to predict future time steps by recursively generating new signal values:

Starting with an initial test_data sequence, the model predicts one future output.
The predicted output is appended to the input sequence, while the oldest time step is discarded (sliding the window forward).
This updated sequence is used to predict the next future value.
This recursive process is repeated for n_step times to generate multiple future points.

gen_signal = []

for i in range(n_step):
    test_pred = lstm_network.predict(test_data, verbose = 0)
    gen_signal.append(test_pred.ravel())
    test_pred = test_pred[:, np.newaxis, :]

    test_data = test_data[:, 1:, :]
    test_data = np.concatenate([test_data, test_pred], axis = 1)

gen_signal = np.concatenate(gen_signal)

test_label = data[10000:10000 + n_step*n_input + n_step*n_input]

plt.figure(figsize = (6, 4))
plt.plot(np.arange(0, n_step*n_input + n_step*n_input), test_label, 'b', label = 'Ground truth')
plt.plot(np.arange(n_step*n_input,  n_step*n_input + n_step*n_input), gen_signal, 'r', label = 'Prediction')
plt.vlines(n_step*n_input, -1, 1, colors = 'r', linestyles = 'dashed')
plt.legend(loc = 'upper left')
plt.xlim(0, len(test_label))
plt.show()

Of course, as the model recursively generates future predictions, prediction errors tend to accumulate over time. Each new prediction is conditioned, at least in part, on previously predicted values rather than on true observations. Consequently, even small errors introduced early in the sequence can propagate and amplify as the prediction horizon extends, leading to a gradual degradation in predictive accuracy.

This phenomenon is well-known in autoregressive forecasting frameworks and highlights the challenges associated with long-term prediction in time-series modeling.

4.2. Lab 2: MNIST¶

In this lab, you will build an LSTM-based model to predict the second half of an MNIST image given its first half.

An MNIST image has a shape of 28 $\times$ 28 pixels.
Each image will be split into 28 sequential pieces, where each piece corresponds to a row vector of size 1 $\times$ 28.
The task is framed as a sequence prediction problem:
- The first 14 rows (14 $\times$ 28) will be provided as input to the model,
- and the model will recursively predict the remaining 14 rows (14 $\times$ 28).

The input is treated as a time series, where each time step corresponds to a row of pixels. The model learns to generate future "rows" of the image based on the previously observed sequence.

This setting naturally illustrates how LSTM networks can be used for sequence-to-sequence prediction, even in image-based applications.

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

Load MNIST Data

(train_imgs, train_labels), (test_imgs, test_labels) = tf.keras.datasets.mnist.load_data()

train_imgs = train_imgs/255.0
test_imgs = test_imgs/255.0

print('train_x: ', train_imgs.shape)
print('test_x: ', test_imgs.shape)

train_x:  (60000, 28, 28)
test_x:  (10000, 28, 28)

Plot a ramdomly selected data with its label

idx = np.random.randint(len(train_imgs))
img = train_imgs[idx].reshape(28,28)
label = train_labels[idx]

plt.figure(figsize = (5, 3))
plt.imshow(img,'gray')
plt.title("Label : {}".format(label))
plt.xticks([])
plt.yticks([])
plt.show()

Define LSTM Structure

n_step = 14
n_input = 28

## LSTM shape
n_lstm1 = 10
n_lstm2 = 10

## Fully connected
n_hidden = 100
n_output = 28

lstm_network = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (n_step, n_input)),
    tf.keras.layers.LSTM(n_lstm1, return_sequences = True),
    tf.keras.layers.LSTM(n_lstm2),
    tf.keras.layers.Dense(n_hidden, activation = 'relu'),
    tf.keras.layers.Dense(n_output),
])

lstm_network.summary()

Model: "sequential_5"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ lstm_10 (LSTM)                  │ (None, 14, 10)         │         1,560 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ lstm_11 (LSTM)                  │ (None, 10)             │           840 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_10 (Dense)                │ (None, 100)            │         1,100 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_11 (Dense)                │ (None, 28)             │         2,828 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 6,328 (24.72 KB)

 Trainable params: 6,328 (24.72 KB)

 Non-trainable params: 0 (0.00 B)

Define Cost, Initializer and Optimizer

Regression: Squared loss

$$ \frac{1}{N} \sum_{i=1}^{N} (\hat{y}^{(i)} - y^{(i)})^2$$

lstm_network.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.0005),
                     loss = 'mean_squared_error')

Training

def preprocess_dataset(imgs, n_step):
    N, H, W = imgs.shape
    num_windows = H - n_step
    total = N * num_windows

    X = np.zeros((total, n_step, W), dtype=np.float32)
    Y = np.zeros((total, W), dtype=np.float32)

    idx = 0
    for img in imgs:
        for j in range(num_windows):
            X[idx] = img[j : j+n_step]
            Y[idx] = img[j+n_step]
            idx += 1
    return X, Y

train_x, train_y = preprocess_dataset(train_imgs, 14)

lstm_network.fit(train_x, train_y, batch_size = 50, epochs = 3, validation_split = 0.1)

Epoch 1/3
15120/15120 ━━━━━━━━━━━━━━━━━━━━ 83s 5ms/step - loss: 0.0357 - val_loss: 0.0184
Epoch 2/3
15120/15120 ━━━━━━━━━━━━━━━━━━━━ 81s 5ms/step - loss: 0.0177 - val_loss: 0.0163
Epoch 3/3
15120/15120 ━━━━━━━━━━━━━━━━━━━━ 81s 5ms/step - loss: 0.0161 - val_loss: 0.0155

<keras.src.callbacks.history.History at 0x7dc4eaccaa10>

Test or Evaluate

In the evaluation phase, the trained model is used to predict the missing portion of an MNIST image.

Each MNIST image has a size of 28 $\times$ 28 pixels.
The model is tasked with predicting the second half of the image, where each predicted output corresponds to a row vector of size 1 $\times$ 28.

Procedure:

The first 14 rows (14 $\times$ 28) of the image are provided as input to the model.
The model then recursively predicts the remaining 14 rows (14 $\times$ 28), one row at a time.
Each newly predicted row is fed back into the model to assist in generating subsequent predictions.

This setup evaluates the model's ability to understand sequential patterns within the image and generate a coherent continuation based on partial information.

test_x = test_imgs[10].reshape(28,28)
test_y = test_labels[10]

gen_img = []

sample = test_x[0:14, :]
input_img = sample.copy()

feeding_img = test_x[0:14, :]

for i in range(n_step):
    test_pred = lstm_network.predict(feeding_img.reshape(1, 14, 28), verbose = 0)
    feeding_img = np.delete(feeding_img, 0, 0)
    feeding_img = np.vstack([feeding_img, test_pred])
    gen_img.append(test_pred)

for i in range(n_step):
    sample = np.vstack([sample, gen_img[i]])

plt.figure(figsize = (8, 20))
plt.subplot(1,3,1)
plt.imshow(test_x, 'gray')
plt.title('Original Img')
plt.xticks([])
plt.yticks([])

plt.subplot(1,3,2)
plt.imshow(input_img, 'gray')
plt.title('Input')
plt.xticks([])
plt.yticks([])

plt.subplot(1,3,3)
plt.imshow(sample, 'gray')
plt.title('Generated Img')
plt.xticks([])
plt.yticks([])
plt.show()

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')