RNN and LSTM


By Prof. Seungchul Lee
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST

Table of Contents


1. Time-Series Analysis

1.1. So Far

  • Regression, Classification, Dimension Reduction,

  • Based on snapshot-type data



  • Sequence matters


  • What is a sequence?

    • sentence
    • medical signals
    • speech waveform
    • vibration measurement
  • Sequence Modeling

    • Most of the real-world data is time-series
    • There are important bits to be considered
      • Past events
      • Relationship between events
        • Causality
        • Credit assignment
  • Learning the structure and hierarchy

  • Use the past and present observations to predict the future

1.2. (Determinstic) Sequences and Difference Equations

We will focus on linear difference equations (LDE), a surprisingly rich topic both theoretically and practivally.

For example,


$$ y[0]=1,\quad y[1]=\frac{1}{2},\quad y[2]=\frac{1}{4},\quad \cdots $$

or by closed-form expression,


$$y[n]=\left(\frac{1}{2}\right)^n,\quad n≥0 $$

or with a difference equation and an initial condition,


$$y[n]=\frac{1}{2}y[n−1],\quad y[0]=1$$

High order homogeneous LDE


$$y[n]=\alpha_1 y[n−1] + \alpha_2 y [n−2] + \cdots + \alpha_k y[n-k]$$

1.3. (Stochastic) Time Series Analysis

1.3.1. Stationarity and Non-Stationary Series

  • A series is stationary if there is no systematic change in mean and variance over time

    • Example: radio static
  • A series is non-stationary if mean and variance change over time

    • Example: GDP, population, weather, etc.

1.3.2. Dealing with Non-Stationarity


Linear trends




Non-linear trends

  • For example, population may grow exponentially



Seasonal trends
  • Some series may exhibit seasonal trends

  • For example, weather pattern, employment, inflation, etc.



Combining Linear, Quadratic, and Seasonal Trends

  • Some data may have a combintation of trends


  • One solution is to apply repeated differencing to the series

  • For example, first remove seasonal trend. Then remove linear trend

  • Inspect model fit by examining residuals Q-Q plot

  • Anternatively, include both linear and cyclical trend terms into the model


\begin{align*} Y_t &= \beta_1 + \beta_2 Y_{t-1} \\ &+ \beta_3 t + \beta_4 t^{\beta_5} \\ &+ \beta_6 \sin \frac{2\pi}{s}t + \beta_7 \cos \frac{2\pi}{s}t \\ &+ u_t \end{align*}

1.4. Time-Series Data

(almost) all the data coming from manufacturing environment are time-series data

  • sensor data,
  • process times,
  • material measurement,
  • equipment maintenance history,
  • image data, etc.

Manufacturing application is about one of the following:

  • prediction of time-series values
  • anomaly detection on time-series data
  • classification of time-series values
  • metrology and inspection

Definition of time-series


$$x: T \rightarrow \mathbb{R}^n \;\; \text{where}\;\; T=\{\cdots, t_{-2},t_{-1},t_0,t_1,t_2, \cdots \}$$

Example: material measurements: when $n=3$


$$x(t) = \begin{bmatrix} \text{average thickness}(t)\\ \text{thickness variance}(t)\\ \text{resistivity}(t) \end{bmatrix} $$

Supervised and Unsupervised Learning for Time-series

For supervised learning, we define two time series


$$x: T \rightarrow \mathbb{R}^n \;\; \text{and} \;\; y: T \rightarrow \mathbb{R}^m$$

Supervised time-series learning:


$$ \begin{align*} \text{predict} \quad &y(t_k) \\ \text{given} \quad & x(t_k), x(t_{k-1}), \cdots \;\, \text{and} \;\, y(t_{k-1}), y(t_{k-2}), \cdots \end{align*} $$

Unsupervised time-series anomaly detection

  • Find time segment that is considerably differnt from the rest

$$ \begin{align*} \text{find} \quad & k^* \\ \text{such that} \quad & x(t_k) |_{k=k^*}^{k^*+s} \;\; \text{is significantly different from} \;\, x(t_k) |_{k=-\infty}^{\infty} \end{align*} $$

1.5. Markov Process

1.5.1. Sequential Processes

  • Most classifiers ignored the sequential aspects of data

  • Consider a system which can occupy one of $N$ discrete states or categories

$q_t \in \{S_1,S_2,\cdots,S_N\}$

  • We are interested in stochastic systems, in which state evolution is random

  • Any joint distribution can be factored into a series of conditional distributions


$$p(q_0,q_1,\cdots,q_T ) = p(q_0) \; p(q_1 \mid q_0) \; p( q_2 \mid q_1 q_0 ) \; p( q_3 \mid q_2 q_1 q_0 ) \cdots$$
Amost impossible to compute !!

$$p(q_0,q_1,\cdots,q_T ) = p(q_0) \; p(q_1 \mid q_0) \; p( q_2 \mid q_1 ) \; p( q_3 \mid q_2 ) \cdots$$
Possible and tractable !!

1.5.2. Markov Process

  • (Assumption) for a Markov process, the next state depends only on the current state:

$$ p(q_{t+1} \mid q_t,\cdots,q_0) = p(q_{t+1} \mid q_t)$$
  • More clearly

$$ p(q_{t+1} = s_j \mid q_t = s_i) = p(q_{t+1} = s_j \mid q_t = s_i,\; \text{any earlier history})$$
  • Given current state, the past does not matter
  • The state captures all relevant information from the history
  • The state is a sufficient statistic of the future

1.5.3. State Transition Matrix

For a Markov state 𝑠 and successor state 𝑠′, the state transition probability is defined by


$$ P_{ss'} = P\left[S_{t+1} = s' \mid S_t = s \right] $$

State transition matrix $P$ defines transition probabilities from all states $s$ to all successor states $s'$.





Example: MC episodes

  • sample episodes starting from $S_1$
In [ ]:
import numpy as np

P = [[0, 0, 1],
    [1/2, 1/2, 0],
    [1/3, 2/3, 0]]
In [ ]:
print(P[1][:])
[0.5, 0.5, 0]
In [ ]:
a = np.random.choice(3,1,p = P[1][:])
print(a)
[1]
In [ ]:
# sequential processes
# sequence generated by Markov chain
# S1 = 0, S2 = 1, S3 = 2

# starting from 0
x = 0
S = []
S.append(x)

for i in range(50):
    x = np.random.choice(3,1,p = P[x][:])[0]
    S.append(x)

print(S)
[0, 2, 1, 1, 1, 0, 2, 0, 2, 0, 2, 1, 0, 2, 0, 2, 1, 1, 0, 2, 0, 2, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 0, 2, 0, 2, 1, 1, 0, 2, 1, 1, 0, 2, 1, 0, 2, 1, 0, 2, 1]

1.6. Hidden Markov Models

  • Discrete state-space model

    • Used in speech recognition
    • State representation is simple
    • Hard to scale-up the training
  • Assumption

    • We can observe something that is affected by the true state
    • Natual way of thinking
  • Limited sensors (incomplete state information)

    • But still partially related
  • Noisy senors

    • Unreliable


  • True state (or hidden variable) follows Markov chain

  • Observation emitted from state

    • $Y_t$ is noisily determined depending on the current state $X_t$
  • Forward: sequence of observations can be generated

  • Question: state estimation


$$P(X_T = s_i \mid Y_1 Y_2 \cdots Y_T)$$
  • HMM can do this, but with many difficulties

1.7. Kalman Filter

  • Linear dynamical system of motion


$$ \begin{align*} x_{t+1} &= A x_t + B u_t \\ z_t &= Cx_t \end{align*} $$
  • A, B, C ?

  • Continuous State space model

    • For filtering and control applications
    • Linear-Gaussian state space model
    • Widely used in many applications:
      • GPS, weather systems, etc.
  • Weakness

    • Linear state space model assumed
    • Difficult to apply to highly non-linear domains

2. Recurrent Neural Network (RNN)

  • RNNs are a family of neural networks for processing sequential data

2.1. Feedforward Network and Sequential Data



  • Separate parameters for each value of the time index

  • Cannot share statistical strength across different time indices



2.2. Representation Shortcut

  • Input at each time is a vector
  • Each layer has many neurons
  • Output layer too may have many neurons
  • But will represent everything simple boxes
  • Each box actually represents an entire layer with many units


2.3. An Alternate Model for Infinite Response Systems

  • The state-space model

$$ \begin{align*} h_t &= f(x_t, h_{t-1})\\ y_t &= g(h_t) \end{align*} $$
  • This is a recurrent neural network

  • State summarizes information about the entire past

  • Single Hidden Layer RNN (Simplest State-Space Model)



  • Multiple Recurrent Layer RNN


  • Recurrent Neural Network
    • Simplified models often drawn
    • The loops imply recurrence


3. LSTM Networks

3.1. Long-Term Dependencies

  • Gradients propagated over many stages tend to either vanish or explode
  • Difficulty with long-term dependencies arises from the exponentially smaller weights given to long-term interactions
  • Introduce a memory state that runs through only linear operators
  • Use gating units to control the updates of the state


Example: "I grew up in France… I speak fluent French."



3.2. Long Short-Term Memory (LSTM)

  • Consists of a memory cell and a set of gating units
    • Memory cell is the context that carries over
    • Forget gate controls erase operation
    • Input gate controls write operation
    • Output gate controls the read operation



  • Connect LSTM cells in a recurrent manner
  • Train parameters in LSTM cells

LSTM for Classification



LSTM for Prediction



3.3. LSTM with TensorFlow

  • An example for predicting a next piece of an image

  • Regression problem

  • Again, MNIST dataset

  • Time series data and RNN



Import Library

In [ ]:
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
from six.moves import cPickle

Load MNIST Data

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
data =  cPickle.load(open('/content/drive/MyDrive/DL/DL_data/rnn_time_signal.pkl', 'rb'))

plt.figure(figsize = (8, 4))
plt.title('Time signal for RNN')
plt.plot(data[0:2000])
plt.xlim(0,2000)
plt.show()

LSTM Model Training



In [ ]:
n_step = 25
n_input = 100

# LSTM shape
n_lstm1 = 100
n_lstm2 = 100

# fully connected
n_hidden = 100
n_output = 100
In [ ]:
lstm_network = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape = (n_step, n_input)),
    tf.keras.layers.LSTM(n_lstm1, return_sequences = True),
    tf.keras.layers.LSTM(n_lstm2),
    tf.keras.layers.Dense(n_hidden),
    tf.keras.layers.Dense(n_output),
])

lstm_network.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 lstm (LSTM)                 (None, 25, 100)           80400     
                                                                 
 lstm_1 (LSTM)               (None, 100)               80400     
                                                                 
 dense (Dense)               (None, 100)               10100     
                                                                 
 dense_1 (Dense)             (None, 100)               10100     
                                                                 
=================================================================
Total params: 181,000
Trainable params: 181,000
Non-trainable params: 0
_________________________________________________________________
In [ ]:
lstm_network.compile(optimizer = 'adam',
                     loss = 'mean_squared_error',
                     metrics = ['mse'])
In [ ]:
def dataset(data, n_samples, n_step = n_step, dim_input = n_input, dim_output = n_output, stride = 5):

    train_x_list = []
    train_y_list = []
    for i in range(n_samples):
        train_x = data[i*stride:i*stride + n_step*dim_input]
        train_x = train_x.reshape(n_step, dim_input)
        train_x_list.append(train_x)

        train_y = data[i*stride + n_step*dim_input:i*stride + n_step*dim_input + dim_output]
        train_y_list.append(train_y)

    train_data = np.array(train_x_list)
    train_label = np.array(train_y_list)

    test_data = data[10000:10000 + n_step*dim_input]
    test_data = test_data.reshape(1, n_step, dim_input)

    return train_data, train_label, test_data
In [ ]:
train_data, train_label, test_data = dataset(data, 5000)
In [ ]:
lstm_network.fit(train_data, train_label, epochs = 3)
Epoch 1/3
157/157 [==============================] - 40s 144ms/step - loss: 0.0332 - mse: 0.0332
Epoch 2/3
157/157 [==============================] - 9s 57ms/step - loss: 0.0069 - mse: 0.0069
Epoch 3/3
157/157 [==============================] - 14s 87ms/step - loss: 0.0052 - mse: 0.0052
Out[ ]:
<keras.callbacks.History at 0x7afa141bb7f0>

Testing or Evaluating

  • Predict future time signal
In [ ]:
test_pred = lstm_network.predict(test_data).ravel()
test_label = data[10000:10000 + n_step*n_input + n_input]

plt.figure(figsize = (8, 4))
plt.plot(np.arange(0, n_step*n_input + n_input), test_label, 'b', label = 'Ground truth')
plt.plot(np.arange(n_step*n_input, n_step*n_input + n_input), test_pred, 'r', label = 'Prediction')
plt.vlines(n_step*n_input, -1, 1, colors = 'r', linestyles = 'dashed')
plt.legend(fontsize = 15, loc = 'upper left')
plt.xlim(0, len(test_label))
plt.show()
1/1 [==============================] - 2s 2s/step
In [ ]:
gen_signal = []

for i in range(n_step):
    test_pred = lstm_network.predict(test_data, verbose = 0)
    gen_signal.append(test_pred.ravel())
    test_pred = test_pred[:, np.newaxis, :]

    test_data = test_data[:, 1:, :]
    test_data = np.concatenate([test_data, test_pred], axis = 1)

gen_signal = np.concatenate(gen_signal)

test_label = data[10000:10000 + n_step*n_input + n_step*n_input]

plt.figure(figsize = (8, 4))
plt.plot(np.arange(0, n_step*n_input + n_step*n_input), test_label, 'b', label = 'Ground truth')
plt.plot(np.arange(n_step*n_input,  n_step*n_input + n_step*n_input), gen_signal, 'r', label = 'Prediction')
plt.vlines(n_step*n_input, -1, 1, colors = 'r', linestyles = 'dashed')
plt.legend(fontsize=15, loc = 'upper left')
plt.xlim(0, len(test_label))
plt.show()
In [ ]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')