Markov Reward Process
http://iailab.kaist.ac.kr/
Industrial AI Lab at KAIST
Source
- By David Silver's RL Course at UCL
- By Prof. Zico Kolter at CMU
Table of Contents
1. Markov Reward Process¶
Suppose that each transition in a Markov chain is associated with a reward, $r$
As the Markov chain proceeds from state to state, there is an associated sequence of rewards
Discount factor $\gamma$
Later, we will study dynamic programming and Markov decision theory $\implies$ Markov Decision Process (MDP)
1.1. Definition of MRP¶
Definition: A Markov Reward Process is a tuple $\langle S,P,R,\gamma \rangle$
- $S$ is a finite set of states
- $P$ is a state transition probability matrix
$$P_{ss'} = P\left[S_{t+1} = s' \mid S_t = s \right]$$
- $R$ is a reward function, $R = \mathbb{E} \left[ R_{t+1} \mid S_t = s \right]$
- $\gamma$ is a discount factor, $\gamma \in [0,1]$
Example: student MRP
1.2. Reward over Multiple Transitions (= Return $G_t$)¶
Definition: The return $G_t$ is the total discounted reward from time-step $t$
$$G_t = R_{t+1} + \gamma R_{t+2} + \cdots = \sum_{k=0}^{\infty}\gamma^k R_{t+k+1}$$
Discount factor $\gamma$
It is reasonable to maximize the sum of rewards
It is also reasonable to prefer rewards now to rewards later
One solution: values of rewards decay exponentially
Mathematically convenient (avoid infinite returns and values)
Human often acts as if there is a discount factor $\gamma < 1$
- $\gamma = 0$: Only care about immediate reward
- $\gamma = 1$: Future reward is as beneficial as immediate reward
import numpy as np
# [C1 C2 C3 Pass Pub FB Sleep] = [0 1 2 3 4 5 6]
R = [-2, -2, -2, 10, 1, -1, 0]
gamma = 0.9
# if a sequence is given
S = [0, 1, 2, 4, 2, 4]
G = 0
for i in range(5):
G = G + (gamma**i)*R[S[i]]
print(G)
R = [-2, -2, -2, 10, 1, -1, 0]
gamma = 0.9
P = [[0, 0.5, 0, 0, 0, 0.5, 0],
[0, 0, 0.8, 0, 0, 0, 0.2],
[0, 0, 0, 0.6, 0.4, 0, 0],
[0, 0, 0, 0, 0, 0, 1],
[0.2, 0.4, 0.4, 0, 0, 0, 0],
[0.1, 0, 0, 0, 0, 0.9, 0],
[0, 0, 0, 0, 0, 0, 1]]
# sequence generated by Markov chain
# [C1 C2 C3 Pass Pub FB Sleep] = [0 1 2 3 4 5 6]
# starting from 0
x = 0
S = []
S.append(x)
for i in range(5):
x = np.random.choice(len(P),1,p = P[x][:])[0]
S.append(x)
G = 0
for i in range(5):
G = G + (gamma**i)*R[S[i]]
print(S)
print(G)
1.3. Value Function¶
Definition: The state value function $v(s)$ of an MRP is the expected return starting from state $s$
$$ \begin{align*} v(s) & = \mathbb{E}[G_t \mid S_t = s] \\ & = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid S_t = s] \end{align*} $$
- The value function $v(s)$ gives the long-term value of state $s$
- Sample returns for student MRP: starting from $S_1 = C_1$ with $\gamma = \frac{1}{2}$
$$G_1 = R_2 + \gamma R_3 + \cdots + \gamma^{T-2} R_T$$
- (Naive) Computing the value function in MRP
- Generate a large number of episodes and compute the average return
2. Bellman Equations for MRP¶
- The value function $v(S_t)$ can be decomposed into two parts:
- Immediate reward $R_{t+1}$ at state $S_t$
- Discounted value of successor state $\gamma v (S_{t+1})$
$$ \begin{align*} v(s) &= \mathbb{E} \left[G_t \mid S_t = s \right] \\ &= \mathbb{E} \left[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid S_t = s \right] \\ &= \mathbb{E} \left[R_{t+1} + \gamma \left( R_{t+2} + \gamma R_{t+3} + \cdots \right) \mid S_t = s \right] \\ &= \mathbb{E} \left[R_{t+1} + \gamma G_{t+1} \mid S_t = s \right] \\ &= \mathbb{E} \left[R_{t+1} + \gamma v \left(S_{t+1} \right) \mid S_t = s \right] \end{align*} $$
- Bellman Equation for Student MRP
2.1. Bellman Equation in Matrix Form¶
$$v(s) = R + \gamma \sum_{s' \in S} P_{ss'}v\left(s'\right) \qquad \forall s$$
The Bellman equation can be expressed concisely using matrices,
$$v = R + \gamma P v$$ where $v$ is a column vector with one entry per state
$$ \begin{bmatrix} v(1)\\ \vdots \\v(n) \end{bmatrix} = \begin{bmatrix} R_1\\ \vdots \\R_n \end{bmatrix} + \gamma \begin{bmatrix} p_{11} & \cdots & p_{1n}\\ &\vdots& \\p_{n1}& \cdots &p_{nn} \end{bmatrix} \begin{bmatrix} v(1)\\ \vdots \\v(n) \end{bmatrix} $$
2.2. Solving the Bellman Equation¶
The Bellman equation is a linear equation
It can be solved directly:
$$ \begin{align*} v &= R + \gamma P v \\ (I-\gamma R) v & = R \\ v & = (I - \gamma P)^{-1}R \end{align*} $$
- Direct solution only possible for small MRP
- Computational complexity is $O(n^3)$ for $n$ states
There are many iterative methods for large MRP
- Dynamic programming
- Monte-Carlo simulation
- Temporal-Difference learning
Iterative algorithm for value function (Value Iteration)
Initialize $v_1(s) = 0$ for all $s$
For $k=1$ until convergence
- for all $s$ in $S$
$$v_{k+1}(s) \;\; \longleftarrow \;\; R(s) + \gamma \sum_{s' \in S} p\left(s' \mid s\right) v_k \left(s'\right) $$
# [C1 C2 C3 Pass Pub FB Sleep] = [0 1 2 3 4 5 6]
P = [[0, 0.5, 0, 0, 0, 0.5, 0],
[0, 0, 0.8, 0, 0, 0, 0.2],
[0, 0, 0, 0.6, 0.4, 0, 0],
[0, 0, 0, 0, 0, 0, 1],
[0.2, 0.4, 0.4, 0, 0, 0, 0],
[0.1, 0, 0, 0, 0, 0.9, 0],
[0, 0, 0, 0, 0, 0, 1]]
R = [-2, -2, -2, 10, 1, -1, 0]
P = np.asmatrix(P)
R = np.asmatrix(R)
R = R.T
gamma = 0.9
v = (np.eye(7) - gamma*P).I*R
print(v)
gamma = 0.9
v = np.zeros([7,1])
for i in range(100):
v = R + gamma*P*v
print(v)
gamma = 1
v = np.zeros([7,1])
for i in range(100):
v = R + gamma*P*v
print(v)
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')