Markov Reward Process
Source
Table of Contents
Definition: A Markov Reward Process is a tuple $\langle S,P,R,\gamma \rangle$
Example: student MRP
Definition: The return $G_t$ is the total discounted reward from time-step $t$
Discount factor $\gamma$
import numpy as np
# [C1 C2 C3 Pass Pub FB Sleep] = [0 1 2 3 4 5 6]
R = [-2, -2, -2, 10, 1, -1, 0]
gamma = 0.9
# if a sequence is given
S = [0, 1, 2, 4, 2, 4]
G = 0
for i in range(5):
G = G + (gamma**i)*R[S[i]]
print(G)
R = [-2, -2, -2, 10, 1, -1, 0]
gamma = 0.9
P = [[0, 0.5, 0, 0, 0, 0.5, 0],
[0, 0, 0.8, 0, 0, 0, 0.2],
[0, 0, 0, 0.6, 0.4, 0, 0],
[0, 0, 0, 0, 0, 0, 1],
[0.2, 0.4, 0.4, 0, 0, 0, 0],
[0.1, 0, 0, 0, 0, 0.9, 0],
[0, 0, 0, 0, 0, 0, 1]]
# sequence generated by Markov chain
# [C1 C2 C3 Pass Pub FB Sleep] = [0 1 2 3 4 5 6]
# starting from 0
x = 0
S = []
S.append(x)
for i in range(5):
x = np.random.choice(len(P),1,p = P[x][:])[0]
S.append(x)
G = 0
for i in range(5):
G = G + (gamma**i)*R[S[i]]
print(S)
print(G)
Definition: The state value function $v(s)$ of an MRP is the expected return starting from state $s$
$$
\begin{align*}
v(s) &= \mathbb{E} \left[G_t \mid S_t = s \right] \\
&= \mathbb{E} \left[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid S_t = s \right] \\
&= \mathbb{E} \left[R_{t+1} + \gamma \left( R_{t+2} + \gamma R_{t+3} + \cdots \right) \mid S_t = s \right] \\
&= \mathbb{E} \left[R_{t+1} + \gamma G_{t+1} \mid S_t = s \right] \\
&= \mathbb{E} \left[R_{t+1} + \gamma v \left(S_{t+1} \right) \mid S_t = s \right]
\end{align*}
$$
$$v(s) = R + \gamma \sum_{s' \in S} P_{ss'}v\left(s'\right) \qquad \forall s$$
The Bellman equation can be expressed concisely using matrices,
where $v$ is a column vector with one entry per state
Iterative algorithm for value function (Value Iteration)
Initialize $v_1(s) = 0$ for all $s$
For $k=1$ until convergence
# [C1 C2 C3 Pass Pub FB Sleep] = [0 1 2 3 4 5 6]
P = [[0, 0.5, 0, 0, 0, 0.5, 0],
[0, 0, 0.8, 0, 0, 0, 0.2],
[0, 0, 0, 0.6, 0.4, 0, 0],
[0, 0, 0, 0, 0, 0, 1],
[0.2, 0.4, 0.4, 0, 0, 0, 0],
[0.1, 0, 0, 0, 0, 0.9, 0],
[0, 0, 0, 0, 0, 0, 1]]
R = [-2, -2, -2, 10, 1, -1, 0]
P = np.asmatrix(P)
R = np.asmatrix(R)
R = R.T
gamma = 0.9
v = (np.eye(7) - gamma*P).I*R
print(v)
gamma = 0.9
v = np.zeros([7,1])
for i in range(100):
v = R + gamma*P*v
print(v)
gamma = 1
v = np.zeros([7,1])
for i in range(100):
v = R + gamma*P*v
print(v)
%%html
<center><iframe src="https://www.youtube.com/embed/y_5qhQbAZdM?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/U1bn0CUcp2w?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')