Markov Decision Process
Source
Table of Contents
Definition: A Markov Decision Process is a tuple $\langle S,\color{red}{A},P,R,\gamma \rangle$
Example: Startup MDP
Given an MDP $\mathcal{M} = \langle S,A,P,R,\gamma \rangle$ and a policy $\pi$
The state sequence $s_1, s_2, \cdots $ is a Markov process $\langle S, P^{\pi} \rangle$
The state and reward sequence is a Markov reward process $\langle S,P^{\pi},R^{\pi},\gamma \rangle$
Questions on MDP policy
How many possible policies in our example?
Which of the above two policies is best?
How do you compute the optimal policy?
State-value function
Action-value function
$$
\begin{align*}
v_{\pi}(s) &= \sum_{a \in A}\pi(a \mid s) \underline{q_{\pi}(s,a)} \\ \\
&= \sum_{a \in A}\pi(a \mid s) \left( R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) v_{\pi}\left(s'\right) \right)
\end{align*}
$$
$$
\begin{align*}
q_{\pi}(s,a) &= R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) \underline{v_{\pi} \left(s'\right)}\\ \\
&= R(s) + \gamma \sum_{s' \in S} P\left(s' \mid s,a \right) \left( \sum_{a' \in A} \pi \left(a' \mid s' \right) q_{\pi} \left(s', a' \right) \right)
\end{align*}
$$
Example
# [PU PF RU RF] = [0 1 2 3]
import numpy as np
P = [[1, 0, 0, 0],
[0, 1, 0, 0],
[0.5, 0, 0.5, 0],
[0, 1, 0, 0]]
R = [0, 0, 10, 10]
P = np.asmatrix(P)
R = np.asmatrix(R)
R = R.T
gamma = 0.9
v = (np.eye(4) - gamma*P).I*R
print(v)
v = np.zeros([4,1])
for i in range(100):
v = R + gamma*P*v
print(v)
# [PU PF RU RF] = [0 1 2 3]
import numpy as np
P = [[0.5, 0.5, 0, 0],
[0, 1, 0, 0],
[0.5, 0.5, 0, 0],
[0, 1, 0, 0]]
R = [0, 0, 10, 10]
P = np.asmatrix(P)
R = np.asmatrix(R)
R = R.T
gamma = 0.9
v = (np.eye(4) - gamma*P).I*R
print(v)
v = np.zeros([4,1])
for i in range(100):
v = R + gamma*P*v
print(v)
$\quad \;\;$and its optimal value function is written $v_{*}(s)$
We can directly define the optimal value function using Bellman optimality equation
and optimal policy is simply the action that attains this max
Algorithm
$\quad$ 1. Initialize an estimate for the value function arbitrarily (or zeros)
$\quad$ 2. Repeat, update
Note
Algorithm
$\quad$ 1. initialize policy $\hat{\pi}$ (e.g., randomly)
$\quad$ 2. Compute a value function of policy, $v_{\pi}$ (e.g., via solving linear system or Bellman expectation equation iteratively)
$\quad$ 3. Update $\pi$ to be greedy policy with respect to $v_{\pi}$
$\quad$ 4. If policy $\pi$ changed in last iteration, return to step 2
Note
Given a policy $\pi$, then evaluate the policy $\pi$
Improve the policy by acting greedily with respect to $v_{\pi}$
Policy iteration requires fewer iterations that value iteration, but each iteration requires solving a linear system instead of just applying Bellman operator
In practice, policy iteration is often faster, especially if the transition probabilities are structured (e.g., sparse) to make solution of linear system efficient
%%html
<center><iframe src="https://www.youtube.com/embed/XLEpZzdLxI4?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/4R6kDYDcq8U?rel=0"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
Define MDP as a two-level dictionary
P
is a two-level dictionary where the first key is the state and the second key is the action.
[0, 1, 2, 3]
correspond to [PU, PF, RU, RF][0, 1]
correspond to [Saving momey, Advertising]P[state][action]
is a list of tuples (probability, nextstate)
.
For example,
P[0][0] = [(1, 0)]
P[3][0] = [(0.5, 2), (0.5, 3)]
P = {
0: {0: [(1, 0)],
1: [(0.5, 0), (0.5, 1)]},
1: {0: [(0.5, 0), (0.5, 3)],
1: [(1, 1)]},
2: {0: [(0.5, 0), (0.5, 2)],
1: [(0.5, 0), (0.5, 1)]},
3: {0: [(0.5, 2), (0.5, 3)],
1: [(1, 1)]},
}
R = [0, 0, 10, 10]
gamma = 0.9
States = [0, 1, 2, 3]
Actions = [0, 1]
v = [0]*4
P[2][0]
# compute the above summation
v = [0, 0, 10, 10]
temp = 0
for trans in P[2][0]:
temp = temp + trans[0]*v[trans[1]]
print(temp)
# shorten
sum(trans[0]*v[trans[1]] for trans in P[2][0])
# optimal value fuction
v = [0]*4
for i in range(100):
for s in States:
q_0 = sum(trans[0]*v[trans[1]] for trans in P[s][0])
q_1 = sum(trans[0]*v[trans[1]] for trans in P[s][1])
v[s] = R[s] + gamma*max(q_0, q_1)
print(v)
# optimal policy
# once v computed
optPolicy = [0]*4
for s in States:
q_0 = sum(trans[0]*v[trans[1]] for trans in P[s][0])
q_1 = sum(trans[0]*v[trans[1]] for trans in P[s][1])
optPolicy[s] = np.argmax([q_0, q_1])
print(optPolicy)
# shorten
v = [0]*4
for i in range(100):
for s in States:
v[s] = R[s] + gamma*max([sum([trans[0]*v[trans[1]] for trans in P[s][a]]) for a in Actions])
print(v)
optPolicy = [0]*4
for s in States:
optPolicy[s] = np.argmax([sum([trans[0]*v[trans[1]] for trans in P[s][a]]) for a in Actions])
print(optPolicy)
policy = np.random.randint(0,2,4)
policy
def cal_value(policy):
v = [0]*4
for i in range(100):
for s in States:
q = sum(trans[0]*v[trans[1]] for trans in P[s][policy[s]])
v[s] = R[s] + gamma*q
return v
V_pi = cal_value(policy)
print(V_pi)
for _ in range(10):
for s in range(4):
values_temp = []
for action in [0,1]:
temp = 0
for trans in P[s][action]:
temp = temp + trans[0]*V_pi[trans[1]]
values_temp.append(temp)
policy[s] = np.argmax(values_temp)
V_pi = cal_value(policy)
optPolicy = policy
print(optPolicy)
P = {
0: {0: [(0.9,0), (0.1,1), (0,4)],
1: [(0.8,1), (0.1,4), (0.1,0)],
2: [(0.8,4), (0.1,1), (0.1,0)],
3: [(0.9,0), (0.1,4)]},
1: {0: [(0.8,1), (0.1,2), (0.1,0)],
1: [(0.8,2), (0.2,1)],
2: [(0.8,1), (0.1,0), (0.1,2)],
3: [(0.8,0), (0.2,1)]},
2: {0: [(0.8,2), (0.1,3), (0.1,1)],
1: [(0.8,3), (0.1,5), (0.1,2)],
2: [(0.8,5), (0.1,1), (0.1,3)],
3: [(0.8,1), (0.1,2), (0.1,5)]},
3: {0: [(0.9,3), (0.1,2)],
1: [(0.9,3), (0.1,6)],
2: [(0.8,6), (0.1,2), (0.1,3)],
3: [(0.8,2), (0.1,3), (0.1,6)]},
4: {0: [(0.8,0), (0.2,4)],
1: [(0.8,4), (0.1,7), (0.1,0)],
2: [(0.8,7), (0.2,4)],
3: [(0.8,4), (0.1,0), (0.1,7)]},
5: {0: [(0.8,2), (0.1,6), (0.1,5)],
1: [(0.8,6), (0.1,9), (0.1,2)],
2: [(0.8,9), (0.1,5), (0.1,6)],
3: [(0.8,5), (0.1,2), (0.1,9)]},
6: {0: [(0.8,3), (0.1,6), (0.1,5)],
1: [(0.8,6), (0.1,10), (0.1,3)],
2: [(0.8,10), (0.1,5), (0.1,6)],
3: [(0.8,5), (0.1,3), (0.1,10)]},
7: {0: [(0.8,4), (0.1,8), (0.1,7)],
1: [(0.8,8), (0.1,7), (0.1,4)],
2: [(0.9,7), (0.1,8)],
3: [(0.9,7), (0.1,4)]},
8: {0: [(0.8,8), (0.1,9), (0.1,7)],
1: [(0.8,9), (0.2,8)],
2: [(0.8,8), (0.1,7), (0.1,9)],
3: [(0.8,7), (0.2,8)]},
9: {0: [(0.8,5), (0.1,10), (0.1,8)],
1: [(0.8,9), (0.1,9), (0.1,5)],
2: [(0.8,9), (0.1,8), (0.1,10)],
3: [(0.8,8), (0.1,5), (0.1,9)]},
10: {0: [(0.8,6), (0.1,10), (0.1,9)],
1: [(0.9,10), (0.1,6)],
2: [(0.9,10), (0.1,9)],
3: [(0.8,9), (0.1,6), (0.1,10)]}
}
R = [0, 0, 0, 1, 0, 0, -100, 0, 0, 0, 0]
gamma = 0.9
States = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Actions = [0, 1, 2, 3] # [north, east, south, west]
# optimal value fuction
v = [0]*11
for i in range(100):
for s in States:
v[s] = R[s] + gamma*max([sum([trans[0]*v[trans[1]] for trans in P[s][a]]) for a in Actions])
print(v)
# optimal policy
optPolicy = [0]*11
for s in States:
optPolicy[s] = np.argmax([sum([trans[0]*v[trans[1]] for trans in P[s][a]]) for a in Actions])
print(optPolicy)
%%html
<center><iframe src="https://www.youtube.com/embed/3TXa2IQ7Vhg?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/FG4WrSCpENc?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/M3sbfB5C0W4?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/HaBUSk6rat4?rel=0"
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')