Recurrent Neural Networks (RNN)
Table of Contents
- Sequence matters
What is a sequence?
- sentence
- medical signals
- speech waveform
- vibration measurement
Sequence Modeling
- Most of the real-world data is time-series
- There are important bits to be considered
- Past events
- Relationship between events
- Causality
- Credit assignment
Learning the structure and hierarchy
Use the past and present observations to predict the future
1.2. (Determinstic) Sequences and Difference EquationsĀ¶
We will focus on linear difference equations (LDE), a surprisingly rich topic both theoretically and practivally.
For example,
$$ y[0]=1,\quad y[1]=\frac{1}{2},\quad y[2]=\frac{1}{4},\quad \cdots $$
or by closed-form expression,
$$y[n]=\left(\frac{1}{2}\right)^n,\quad nā„0 $$
or with a difference equation and an initial condition,
$$y[n]=\frac{1}{2}y[nā1],\quad y[0]=1$$
High order homogeneous LDE
$$y[n]=\alpha_1 y[nā1] + \alpha_2 y [nā2] + \cdots + \alpha_k y[n-k]$$
1.3. (Stochastic) Time Series AnalysisĀ¶
1.3.1. Stationarity and Non-Stationary SeriesĀ¶
A series is stationary if there is no systematic change in mean and variance over time
- Example: radio static
A series is non-stationary if mean and variance change over time
- Example: GDP, population, weather, etc.
1.3.2. Dealing with Non-StationarityĀ¶
Linear trends
Non-linear trends
- For example, population may grow exponentially
Seasonal trends
Some series may exhibit seasonal trends
For example, weather pattern, employment, inflation, etc.
Combining Linear, Quadratic, and Seasonal Trends
- Some data may have a combintation of trends
One solution is to apply repeated differencing to the series
For example, first remove seasonal trend. Then remove linear trend
Inspect model fit by examining residuals Q-Q plot
Anternatively, include both linear and cyclical trend terms into the model
\begin{align*} Y_t &= \beta_1 + \beta_2 Y_{t-1} \\ &+ \beta_3 t + \beta_4 t^{\beta_5} \\ &+ \beta_6 \sin \frac{2\pi}{s}t + \beta_7 \cos \frac{2\pi}{s}t \\ &+ u_t \end{align*}
1.4. Time-Series DataĀ¶
(almost) all the data coming from manufacturing environment are time-series data
- sensor data,
- process times,
- material measurement,
- equipment maintenance history,
- image data, etc.
Manufacturing application is about one of the following:
- prediction of time-series values
- anomaly detection on time-series data
- classification of time-series values
- metrology and inspection
1.4.1. Definition of time-seriesĀ¶
$$x: T \rightarrow \mathbb{R}^n \;\; \text{where}\;\; T=\{\cdots, t_{-2},t_{-1},t_0,t_1,t_2, \cdots \}$$
Example: material measurements: when $n=3$
$$x(t) = \begin{bmatrix} \text{average thickness}(t)\\ \text{thickness variance}(t)\\ \text{resistivity}(t) \end{bmatrix} $$
1.4.2. Supervised and Unsupervised Learning for Time-seriesĀ¶
For supervised learning, we define two time series
$$x: T \rightarrow \mathbb{R}^n \;\; \text{and} \;\; y: T \rightarrow \mathbb{R}^m$$
Supervised time-series learning:
$$ \begin{align*} \text{predict} \quad &y(t_k) \\ \text{given} \quad & x(t_k), x(t_{k-1}), \cdots \;\, \text{and} \;\, y(t_{k-1}), y(t_{k-2}), \cdots \end{align*} $$
Unsupervised time-series anomaly detection
- Find time segment that is considerably differnt from the rest
$$ \begin{align*} \text{find} \quad & k^* \\ \text{such that} \quad & x(t_k) |_{k=k^*}^{k^*+s} \;\; \text{is significantly different from} \;\, x(t_k) |_{k=-\infty}^{\infty} \end{align*} $$
1.5. Markov ProcessĀ¶
1.5.1. Sequential ProcessesĀ¶
Most classifiers ignored the sequential aspects of data
Consider a system which can occupy one of $N$ discrete states or categories $q_t \in \{S_1,S_2,\cdots,S_N\}$
We are interested in stochastic systems, in which state evolution is random
Any joint distribution can be factored into a series of conditional distributions
$$p(q_0,q_1,\cdots,q_T ) = p(q_0) \; p(q_1 \mid q_0) \; p( q_2 \mid q_1 q_0 ) \; p( q_3 \mid q_2 q_1 q_0 ) \cdots$$
$$p(q_0,q_1,\cdots,q_T ) = p(q_0) \; p(q_1 \mid q_0) \; p( q_2 \mid q_1 ) \; p( q_3 \mid q_2 ) \cdots$$
1.5.2. Markov ProcessĀ¶
- (Assumption) for a Markov process, the next state depends only on the current state:
$$ p(q_{t+1} \mid q_t,\cdots,q_0) = p(q_{t+1} \mid q_t)$$
- More clearly
$$ p(q_{t+1} = s_j \mid q_t = s_i) = p(q_{t+1} = s_j \mid q_t = s_i,\; \text{any earlier history})$$
- Given current state, the past does not matter
- The state captures all relevant information from the history
- The state is a sufficient statistic of the future
1.5.3. State Transition MatrixĀ¶
For a Markov state $s$ and successor state $s'$, the state transition probability is defined by
$$ P_{ss'} = P \left[S_{t+1} = s' \mid S_t = s \right] $$
State transition matrix $P$ defines transition probabilities from all states $s$ to all successor states $s'$.
Example: MC episodes
- sample episodes starting from $S_1$
import numpy as np
P = [[0, 0, 1],
[1/2, 1/2, 0],
[1/3, 2/3, 0]]
print(P[1][:])
a = np.random.choice(3,1,p = P[1][:])
print(a)
# sequential processes
# sequence generated by Markov chain
# S1 = 0, S2 = 1, S3 = 2
# starting from 0
x = 0
S = []
S.append(x)
for i in range(50):
x = np.random.choice(3,1,p = P[x][:])[0]
S.append(x)
print(S)
1.6. Hidden Markov ModelsĀ¶
Discrete state-space model
- Used in speech recognition
- State representation is simple
- Hard to scale-up the training
Assumption
- We can observe something that is affected by the true state
- Natual way of thinking
Limited sensors (incomplete state information)
- But still partially related
Noisy senors
- Unreliable
True state (or hidden variable) follows Markov chain
Observation emitted from state
- $Y_t$ is noisily determined depending on the current state $X_t$
Forward: sequence of observations can be generated
Question: state estimation
$$P(X_T = s_i \mid Y_1 Y_2 \cdots Y_T)$$
- HMM can do this, but with many difficulties
1.7. Kalman FilterĀ¶
- Linear dynamical system of motion
$$ \begin{align*} x_{t+1} &= A x_t + B u_t \\ z_t &= Cx_t \end{align*} $$
A, B, C ?
Continuous State space model
- For filtering and control applications
- Linear-Gaussian state space model
- Widely used in many applications:
- GPS, weather systems, etc.
Weakness
- Linear state space model assumed
- Difficult to apply to highly non-linear domains
2. Recurrent Neural Network (RNN)Ā¶
- RNNs are a family of neural networks for processing sequential data
2.1. Feedforward Network and Sequential DataĀ¶
Separate parameters for each value of the time index
Cannot share statistical strength across different time indices
2.2. Representation ShortcutĀ¶
- Input at each time is a vector
- Each layer has many neurons
- Output layer too may have many neurons
- But will represent everything simple boxes
- Each box actually represents an entire layer with many units
2.3. An Alternate Model for Infinite Response SystemsĀ¶
- The state-space model
$$ \begin{align*} h_t &= f(x_t, h_{t-1})\\ y_t &= g(h_t) \end{align*} $$
This is a recurrent neural network
State summarizes information about the entire past
Single Hidden Layer RNN (Simplest State-Space Model)
- Multiple Recurrent Layer RNN
- Recurrent Neural Network
- Simplified models often drawn
- The loops imply recurrence
3. LSTM NetworksĀ¶
3.1. Long-Term DependenciesĀ¶
- Gradients propagated over many stages tend to either vanish or explode
- Difficulty with long-term dependencies arises from the exponentially smaller weights given to long-term interactions
- Introduce a memory state that runs through only linear operators
- Use gating units to control the updates of the state
Example: "I grew up in Franceā¦ I speak fluent French."
3.2. Long Short-Term Memory (LSTM)Ā¶
- Consists of a memory cell and a set of gating units
- Memory cell is the context that carries over
- Forget gate controls erase operation
- Input gate controls write operation
- Output gate controls the read operation
- Connect LSTM cells in a recurrent manner
- Train parameters in LSTM cells
3.2.1. LSTM for ClassificationĀ¶
3.2.2. LSTM for PredictionĀ¶
3.3. LSTM with TensorFlowĀ¶
An example for predicting a next piece of an image
Regression problem
Again, MNIST dataset
Time series data and RNN
Import Library
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
from six.moves import cPickle
Load MNIST Data
Import acceleration signal of rotation machinery
from google.colab import drive
drive.mount('/content/drive')
data = cPickle.load(open('/content/drive/MyDrive/DL_Colab/DL_data/rnn_time_signal.pkl', 'rb'))
plt.figure(figsize = (6, 4))
plt.title('Time signal for RNN')
plt.plot(data[0:2000])
plt.xlim(0,2000)
plt.show()
LSTM Model Training
n_step = 25
n_input = 100
# LSTM shape
n_lstm1 = 100
n_lstm2 = 100
# fully connected
n_hidden = 100
n_output = 100
lstm_network = tf.keras.models.Sequential([
tf.keras.layers.Input(shape = (n_step, n_input)),
tf.keras.layers.LSTM(n_lstm1, return_sequences = True),
tf.keras.layers.LSTM(n_lstm2),
tf.keras.layers.Dense(n_hidden),
tf.keras.layers.Dense(n_output),
])
lstm_network.summary()
lstm_network.compile(optimizer = 'adam',
loss = 'mean_squared_error',
metrics = ['mse'])
def dataset(data, n_samples, n_step = n_step, dim_input = n_input, dim_output = n_output, stride = 5):
train_x_list = []
train_y_list = []
for i in range(n_samples):
train_x = data[i*stride:i*stride + n_step*dim_input]
train_x = train_x.reshape(n_step, dim_input)
train_x_list.append(train_x)
train_y = data[i*stride + n_step*dim_input:i*stride + n_step*dim_input + dim_output]
train_y_list.append(train_y)
train_data = np.array(train_x_list)
train_label = np.array(train_y_list)
test_data = data[10000:10000 + n_step*dim_input]
test_data = test_data.reshape(1, n_step, dim_input)
return train_data, train_label, test_data
train_data, train_label, test_data = dataset(data, 5000)
lstm_network.fit(train_data, train_label, epochs = 3)
Testing or Evaluating
- Predict future time signal
test_pred = lstm_network.predict(test_data).ravel()
test_label = data[10000:10000 + n_step*n_input + n_input]
plt.figure(figsize = (6, 4))
plt.plot(np.arange(0, n_step*n_input + n_input), test_label, 'b', label = 'Ground truth')
plt.plot(np.arange(n_step*n_input, n_step*n_input + n_input), test_pred, 'r', label = 'Prediction')
plt.vlines(n_step*n_input, -1, 1, colors = 'r', linestyles = 'dashed')
plt.legend(fontsize = 12, loc = 'upper left')
plt.xlim(0, len(test_label))
plt.show()
gen_signal = []
for i in range(n_step):
test_pred = lstm_network.predict(test_data, verbose = 0)
gen_signal.append(test_pred.ravel())
test_pred = test_pred[:, np.newaxis, :]
test_data = test_data[:, 1:, :]
test_data = np.concatenate([test_data, test_pred], axis = 1)
gen_signal = np.concatenate(gen_signal)
test_label = data[10000:10000 + n_step*n_input + n_step*n_input]
plt.figure(figsize = (6, 4))
plt.plot(np.arange(0, n_step*n_input + n_step*n_input), test_label, 'b', label = 'Ground truth')
plt.plot(np.arange(n_step*n_input, n_step*n_input + n_step*n_input), gen_signal, 'r', label = 'Prediction')
plt.vlines(n_step*n_input, -1, 1, colors = 'r', linestyles = 'dashed')
plt.legend(fontsize = 12, loc = 'upper left')
plt.xlim(0, len(test_label))
plt.show()
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')