Recurrent Neural Networks (RNN)
Table of Contents
- Sequence matters
What is a sequence?
- sentence
- medical signals
- speech waveform
- vibration measurement
Sequence Modeling
- Most of the real-world data is time-series
- There are important bits to be considered
- Past events
- Relationship between events
- Causality
- Credit assignment
Learning the structure and hierarchy
Use the past and present observations to predict the future
1.2. (Determinstic) Sequences and Difference Equations¶
We will focus on linear difference equations (LDE), a surprisingly rich topic both theoretically and practivally.
For example,
$$ y[0]=1,\quad y[1]=\frac{1}{2},\quad y[2]=\frac{1}{4},\quad \cdots $$
or by closed-form expression,
$$y[n]=\left(\frac{1}{2}\right)^n,\quad n≥0 $$
or with a difference equation and an initial condition,
$$y[n]=\frac{1}{2}y[n−1],\quad y[0]=1$$
High order homogeneous LDE
$$y[n]=\alpha_1 y[n−1] + \alpha_2 y [n−2] + \cdots + \alpha_k y[n-k]$$
1.3. (Stochastic) Time Series Analysis¶
1.3.1. Stationarity and Non-Stationary Series¶
A series is stationary if there is no systematic change in mean and variance over time
- Example: radio static
A series is non-stationary if mean and variance change over time
- Example: GDP, population, weather, etc.
1.3.2. Dealing with Non-Stationarity¶
Linear trends
Non-linear trends
- For example, population may grow exponentially
Seasonal trends
Some series may exhibit seasonal trends
For example, weather pattern, employment, inflation, etc.
Combining Linear, Quadratic, and Seasonal Trends
- Some data may have a combintation of trends
One solution is to apply repeated differencing to the series
For example, first remove seasonal trend. Then remove linear trend
Inspect model fit by examining residuals Q-Q plot
Anternatively, include both linear and cyclical trend terms into the model
\begin{align*} Y_t &= \beta_1 + \beta_2 Y_{t-1} \\ &+ \beta_3 t + \beta_4 t^{\beta_5} \\ &+ \beta_6 \sin \frac{2\pi}{s}t + \beta_7 \cos \frac{2\pi}{s}t \\ &+ u_t \end{align*}
1.4. Time-Series Data¶
(almost) all the data coming from manufacturing environment are time-series data
- sensor data,
- process times,
- material measurement,
- equipment maintenance history,
- image data, etc.
Manufacturing application is about one of the following:
- prediction of time-series values
- anomaly detection on time-series data
- classification of time-series values
- metrology and inspection
1.4.1. Definition of time-series¶
$$x: T \rightarrow \mathbb{R}^n \;\; \text{where}\;\; T=\{\cdots, t_{-2},t_{-1},t_0,t_1,t_2, \cdots \}$$
Example: material measurements: when $n=3$
$$x(t) = \begin{bmatrix} \text{average thickness}(t)\\ \text{thickness variance}(t)\\ \text{resistivity}(t) \end{bmatrix} $$
1.4.2. Supervised and Unsupervised Learning for Time-series¶
For supervised learning, we define two time series
$$x: T \rightarrow \mathbb{R}^n \;\; \text{and} \;\; y: T \rightarrow \mathbb{R}^m$$
Supervised time-series learning:
$$ \begin{align*} \text{predict} \quad &y(t_k) \\ \text{given} \quad & x(t_k), x(t_{k-1}), \cdots \;\, \text{and} \;\, y(t_{k-1}), y(t_{k-2}), \cdots \end{align*} $$
Unsupervised time-series anomaly detection
- Find time segment that is considerably differnt from the rest
$$ \begin{align*} \text{find} \quad & k^* \\ \text{such that} \quad & x(t_k) |_{k=k^*}^{k^*+s} \;\; \text{is significantly different from} \;\, x(t_k) |_{k=-\infty}^{\infty} \end{align*} $$
1.5. Markov Process¶
1.5.1. Sequential Processes¶
Most classifiers ignored the sequential aspects of data
Consider a system which can occupy one of $N$ discrete states or categories $q_t \in \{S_1,S_2,\cdots,S_N\}$
We are interested in stochastic systems, in which state evolution is random
Any joint distribution can be factored into a series of conditional distributions
$$p(q_0,q_1,\cdots,q_T ) = p(q_0) \; p(q_1 \mid q_0) \; p( q_2 \mid q_1 q_0 ) \; p( q_3 \mid q_2 q_1 q_0 ) \cdots$$
$$p(q_0,q_1,\cdots,q_T ) = p(q_0) \; p(q_1 \mid q_0) \; p( q_2 \mid q_1 ) \; p( q_3 \mid q_2 ) \cdots$$
1.5.2. Markov Process¶
- (Assumption) for a Markov process, the next state depends only on the current state:
$$ p(q_{t+1} \mid q_t,\cdots,q_0) = p(q_{t+1} \mid q_t)$$
- More clearly
$$ p(q_{t+1} = s_j \mid q_t = s_i) = p(q_{t+1} = s_j \mid q_t = s_i,\; \text{any earlier history})$$
- Given current state, the past does not matter
- The state captures all relevant information from the history
- The state is a sufficient statistic of the future
1.5.3. State Transition Matrix¶
For a Markov state $s$ and successor state $s'$, the state transition probability is defined by
$$ P_{ss'} = P \left[S_{t+1} = s' \mid S_t = s \right] $$
State transition matrix $P$ defines transition probabilities from all states $s$ to all successor states $s'$.