RNN and LSTM

By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents

1. Time-Series Analysis¶

1.1. So Far¶

Regression, Classification, Dimension Reduction,
Based on snapshot-type data

Sequence matters

What is a sequence?
- sentence
- medical signals
- speech waveform
- vibration measurement
Sequence Modeling
- Most of the real-world data is time-series
- There are important bits to be considered
  - Past events
  - Relationship between events
    - Causality
    - Credit assignment
Learning the structure and hierarchy
Use the past and present observations to predict the future

1.2. (Determinstic) Sequences and Difference Equations¶

We will focus on linear difference equations (LDE), a surprisingly rich topic both theoretically and practivally.

For example,

$$ y[0]=1,\quad y[1]=\frac{1}{2},\quad y[2]=\frac{1}{4},\quad \cdots $$

or by closed-form expression,

$$y[n]=\left(\frac{1}{2}\right)^n,\quad n≥0 $$

or with a difference equation and an initial condition,

$$y[n]=\frac{1}{2}y[n−1],\quad y[0]=1$$

High order homogeneous LDE

$$y[n]=\alpha_1 y[n−1] + \alpha_2 y [n−2] + \cdots + \alpha_k y[n-k]$$

1.3. (Stochastic) Time Series Analysis¶

1.3.1. Stationarity and Non-Stationary Series¶

A series is stationary if there is no systematic change in mean and variance over time
- Example: radio static
A series is non-stationary if mean and variance change over time
- Example: GDP, population, weather, etc.

1.3.2. Dealing with Non-Stationarity¶

Linear trends

Non-linear trends

For example, population may grow exponentially

Seasonal trends

Some series may exhibit seasonal trends
For example, weather pattern, employment, inflation, etc.

Combining Linear, Quadratic, and Seasonal Trends

Some data may have a combintation of trends

One solution is to apply repeated differencing to the series
For example, first remove seasonal trend. Then remove linear trend
Inspect model fit by examining residuals Q-Q plot
Anternatively, include both linear and cyclical trend terms into the model

\begin{align*} Y_t &= \beta_1 + \beta_2 Y_{t-1} \\ &+ \beta_3 t + \beta_4 t^{\beta_5} \\ &+ \beta_6 \sin \frac{2\pi}{s}t + \beta_7 \cos \frac{2\pi}{s}t \\ &+ u_t \end{align*}

1.4. Time-Series Data¶

(almost) all the data coming from manufacturing environment are time-series data

sensor data,
process times,
material measurement,
equipment maintenance history,
image data, etc.

Manufacturing application is about one of the following:

prediction of time-series values
anomaly detection on time-series data
classification of time-series values
metrology and inspection

Definition of time-series

$$x: T \rightarrow \mathbb{R}^n \;\; \text{where}\;\; T=\{\cdots, t_{-2},t_{-1},t_0,t_1,t_2, \cdots \}$$

Example: material measurements: when $n=3$

$$x(t) = \begin{bmatrix} \text{average thickness}(t)\\ \text{thickness variance}(t)\\ \text{resistivity}(t) \end{bmatrix} $$

Supervised and Unsupervised Learning for Time-series

For supervised learning, we define two time series

$$x: T \rightarrow \mathbb{R}^n \;\; \text{and} \;\; y: T \rightarrow \mathbb{R}^m$$

Supervised time-series learning:

$$ \begin{align*} \text{predict} \quad &y(t_k) \\ \text{given} \quad & x(t_k), x(t_{k-1}), \cdots \;\, \text{and} \;\, y(t_{k-1}), y(t_{k-2}), \cdots \end{align*} $$

Unsupervised time-series anomaly detection

Find time segment that is considerably differnt from the rest

$$ \begin{align*} \text{find} \quad & k^* \\ \text{such that} \quad & x(t_k) |_{k=k^*}^{k^*+s} \;\; \text{is significantly different from} \;\, x(t_k) |_{k=-\infty}^{\infty} \end{align*} $$

1.5. Markov Process¶

1.5.1. Sequential Processes¶

Most classifiers ignored the sequential aspects of data
Consider a system which can occupy one of $N$ discrete states or categories

$q_t \in \{S_1,S_2,\cdots,S_N\}$

We are interested in stochastic systems, in which state evolution is random
Any joint distribution can be factored into a series of conditional distributions

$$p(q_0,q_1,\cdots,q_T ) = p(q_0) \; p(q_1 \mid q_0) \; p( q_2 \mid q_1 q_0 ) \; p( q_3 \mid q_2 q_1 q_0 ) \cdots$$
Amost impossible to compute !!
$$p(q_0,q_1,\cdots,q_T ) = p(q_0) \; p(q_1 \mid q_0) \; p( q_2 \mid q_1 ) \; p( q_3 \mid q_2 ) \cdots$$
Possible and tractable !!

1.5.2. Markov Process¶

(Assumption) for a Markov process, the next state depends only on the current state:

$$ p(q_{t+1} \mid q_t,\cdots,q_0) = p(q_{t+1} \mid q_t)$$

More clearly

$$ p(q_{t+1} = s_j \mid q_t = s_i) = p(q_{t+1} = s_j \mid q_t = s_i,\; \text{any earlier history})$$

Given current state, the past does not matter
The state captures all relevant information from the history
The state is a sufficient statistic of the future

1.5.3. State Transition Matrix¶

For a Markov state 𝑠 and successor state 𝑠′, the state transition probability is defined by

$$ P_{ss'} = P\left[S_{t+1} = s' \mid S_t = s \right] $$

State transition matrix $P$ defines transition probabilities from all states $s$ to all successor states $s'$.

Example: MC episodes

sample episodes starting from $S_1$

import numpy as np

P = [[0, 0, 1],
    [1/2, 1/2, 0],
    [1/3, 2/3, 0]]

print(P[1][:])

[0.5, 0.5, 0]

a = np.random.choice(3,1,p = P[1][:])
print(a)

[1]

# sequential processes
# sequence generated by Markov chain
# S1 = 0, S2 = 1, S3 = 2

# starting from 0
x = 0
S = []
S.append(x)

for i in range(50):
    x = np.random.choice(3,1,p = P[x][:])[0]
    S.append(x)

print(S)

[0, 2, 1, 1, 1, 0, 2, 0, 2, 0, 2, 1, 0, 2, 0, 2, 1, 1, 0, 2, 0, 2, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 0, 2, 0, 2, 1, 1, 0, 2, 1, 1, 0, 2, 1, 0, 2, 1, 0, 2, 1]

1.6. Hidden Markov Models¶

Discrete state-space model
- Used in speech recognition
- State representation is simple
- Hard to scale-up the training
Assumption
- We can observe something that is affected by the true state
- Natual way of thinking
Limited sensors (incomplete state information)
- But still partially related
Noisy senors
- Unreliable

True state (or hidden variable) follows Markov chain
Observation emitted from state
- $Y_t$ is noisily determined depending on the current state $X_t$
Forward: sequence of observations can be generated
Question: state estimation

$$P(X_T = s_i \mid Y_1 Y_2 \cdots Y_T)$$

HMM can do this, but with many difficulties

1.7. Kalman Filter¶

Linear dynamical system of motion

$$ \begin{align*} x_{t+1} &= A x_t + B u_t \\ z_t &= Cx_t \end{align*} $$

A, B, C ?
Continuous State space model
- For filtering and control applications
- Linear-Gaussian state space model
- Widely used in many applications:
  - GPS, weather systems, etc.
Weakness
- Linear state space model assumed
- Difficult to apply to highly non-linear domains

2. Recurrent Neural Network (RNN)¶

RNNs are a family of neural networks for processing sequential data

2.1. Feedforward Network and Sequential Data¶

Separate parameters for each value of the time index
Cannot share statistical strength across different time indices

2.2. Representation Shortcut¶

Input at each time is a vector
Each layer has many neurons
Output layer too may have many neurons
But will represent everything simple boxes
Each box actually represents an entire layer with many units

2.3. An Alternate Model for Infinite Response Systems¶

The state-space model

$$ \begin{align*} h_t &= f(x_t, h_{t-1})\\ y_t &= g(h_t) \end{align*} $$

This is a recurrent neural network
State summarizes information about the entire past
Single Hidden Layer RNN (Simplest State-Space Model)

Multiple Recurrent Layer RNN

Recurrent Neural Network
- Simplified models often drawn
- The loops imply recurrence

3. LSTM Networks¶

3.1. Long-Term Dependencies¶

Gradients propagated over many stages tend to either vanish or explode
Difficulty with long-term dependencies arises from the exponentially smaller weights given to long-term interactions
Introduce a memory state that runs through only linear operators
Use gating units to control the updates of the state

Example: "I grew up in France… I speak fluent French."

3.2. Long Short-Term Memory (LSTM)¶

Consists of a memory cell and a set of gating units
- Memory cell is the context that carries over
- Forget gate controls erase operation
- Input gate controls write operation
- Output gate controls the read operation

Connect LSTM cells in a recurrent manner
Train parameters in LSTM cells

LSTM for Classification

LSTM for Prediction

3.3. LSTM with TensorFlow¶

An example for predicting a next piece of an image
Regression problem
Again, MNIST dataset
Time series data and RNN

Import Library

import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
from six.moves import cPickle

Load MNIST Data

Import acceleration signal of rotation machinery
rnn_time_signal.pkl

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

data =  cPickle.load(open('/content/drive/MyDrive/DL/DL_data/rnn_time_signal.pkl', 'rb'))

plt.figure(figsize = (8, 4))
plt.title('Time signal for RNN')
plt.plot(data[0:2000])
plt.xlim(0,2000)
plt.show()

LSTM Model Training

n_step = 25
n_input = 100

# LSTM shape
n_lstm1 = 100
n_lstm2 = 100

# fully connected
n_hidden = 100
n_output = 100

lstm_network = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (n_step, n_input)),
    tf.keras.layers.LSTM(n_lstm1, return_sequences = True),
    tf.keras.layers.LSTM(n_lstm2),
    tf.keras.layers.Dense(n_hidden),
    tf.keras.layers.Dense(n_output),
])

lstm_network.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 lstm (LSTM)                 (None, 25, 100)           80400     
                                                                 
 lstm_1 (LSTM)               (None, 100)               80400     
                                                                 
 dense (Dense)               (None, 100)               10100     
                                                                 
 dense_1 (Dense)             (None, 100)               10100     
                                                                 
=================================================================
Total params: 181,000
Trainable params: 181,000
Non-trainable params: 0
_________________________________________________________________

lstm_network.compile(optimizer = 'adam',
                     loss = 'mean_squared_error',
                     metrics = ['mse'])

def dataset(data, n_samples, n_step = n_step, dim_input = n_input, dim_output = n_output, stride = 5):

    train_x_list = []
    train_y_list = []
    for i in range(n_samples):
        train_x = data[i*stride:i*stride + n_step*dim_input]
        train_x = train_x.reshape(n_step, dim_input)
        train_x_list.append(train_x)

        train_y = data[i*stride + n_step*dim_input:i*stride + n_step*dim_input + dim_output]
        train_y_list.append(train_y)

    train_data = np.array(train_x_list)
    train_label = np.array(train_y_list)

    test_data = data[10000:10000 + n_step*dim_input]
    test_data = test_data.reshape(1, n_step, dim_input)

    return train_data, train_label, test_data

train_data, train_label, test_data = dataset(data, 5000)

lstm_network.fit(train_data, train_label, epochs = 3)

Epoch 1/3
157/157 [==============================] - 40s 144ms/step - loss: 0.0332 - mse: 0.0332
Epoch 2/3
157/157 [==============================] - 9s 57ms/step - loss: 0.0069 - mse: 0.0069
Epoch 3/3
157/157 [==============================] - 14s 87ms/step - loss: 0.0052 - mse: 0.0052

<keras.callbacks.History at 0x7afa141bb7f0>

Testing or Evaluating

Predict future time signal

test_pred = lstm_network.predict(test_data).ravel()
test_label = data[10000:10000 + n_step*n_input + n_input]

plt.figure(figsize = (8, 4))
plt.plot(np.arange(0, n_step*n_input + n_input), test_label, 'b', label = 'Ground truth')
plt.plot(np.arange(n_step*n_input, n_step*n_input + n_input), test_pred, 'r', label = 'Prediction')
plt.vlines(n_step*n_input, -1, 1, colors = 'r', linestyles = 'dashed')
plt.legend(fontsize = 15, loc = 'upper left')
plt.xlim(0, len(test_label))
plt.show()

1/1 [==============================] - 2s 2s/step

gen_signal = []

for i in range(n_step):
    test_pred = lstm_network.predict(test_data, verbose = 0)
    gen_signal.append(test_pred.ravel())
    test_pred = test_pred[:, np.newaxis, :]

    test_data = test_data[:, 1:, :]
    test_data = np.concatenate([test_data, test_pred], axis = 1)

gen_signal = np.concatenate(gen_signal)

test_label = data[10000:10000 + n_step*n_input + n_step*n_input]

plt.figure(figsize = (8, 4))
plt.plot(np.arange(0, n_step*n_input + n_step*n_input), test_label, 'b', label = 'Ground truth')
plt.plot(np.arange(n_step*n_input,  n_step*n_input + n_step*n_input), gen_signal, 'r', label = 'Prediction')
plt.vlines(n_step*n_input, -1, 1, colors = 'r', linestyles = 'dashed')
plt.legend(fontsize=15, loc = 'upper left')
plt.xlim(0, len(test_label))
plt.show()

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')