1RL - Planning

Last time we talked about MDP's, along with the MDP are some value functions you can calculate,
and we saw how when we "know" the MDP (ie we know the tuple $< S, A, P, R, γ$ ) Then we can caluclate the value functions, and hence recursively calculate the optimal value functions, and hence the optimal policy.

Planning, unlike reinforcement learning is the act of solving (finding optimals) of a known mdp.

There are two types of planning, both when MDP is fully known.
Prediction: given a policy, figure out its value function (evaluate the policy)
Control: find the optimal policy.

Prediction in planning by DP

v_{π} (s) = \sum_{a \in A} π (a | s) (R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s \to s^{'}}^{a} v_{π} (s^{'}))

Above we have the bellman equation for the state value function of an MDP

How do use dynamic programming to evaluate $v_{π}$ ? well first, we can ease ourself realizing that for a given policy, from an MDP we can extract out an MRP:

Notice that $R_{s}^{a}$ is a matrix that gives the immediate value for taking an action $a$ in state $s$ . Hence $\sum_{a \in A} π (a | s) R_{s}^{a} = R_{s}^{π}$ . That is, for a fixed policy, if we know the expected reward for state action pairs $(s, a)$ , then under that policy, the expected reward for a state $s$ is $\sum_{a \in A} P_{π} (a | s) R_{s}^{a}$ , where $P_{π} (a | s) = π (a | s)$ .

And moreover, if the environment transition dynamics from state $s$ to state $s^{'}$ under the action $a$ is given by the distribution tensor $P_{s \to s^{'}}^{a}$ then to get the expected dynamics for a state $s$ to $s^{'}$ we have to average out the contribution of actions from out policy. Hence $P_{s \to s^{'}}^{π} = \sum_{a \in A} π (a | s) P_{s \to s^{'}}^{a}$ .

The tuple $< S, P^{π}, R^{π}, γ >$ is then a Markov reward process under the policy $π$

Now then, distributing the sums over,

v_{π} (s) = \sum_{a \in A} π (a | s) R_{s}^{a} + γ \sum_{s^{'} \in S} \sum_{a \in A} π (a | s) P_{s \to s^{'}}^{a} v_{π} (s^{'})

Eqivalently,

v_{π} (s) = R_{s}^{π} + γ \sum_{s^{'} \in S} P_{s \to s^{'}}^{π} v_{π} (s^{'}) $ $ . N o w f o r t h e a l g o r i t h m t h a t c a l c u l a t e s $ v_{π} $ u s i n g d y n a m i c p r o g r a m m i n g : $ $ \begin{aligned} Algorithm policy evaluation given MRP of an MDP under that policy \\ Input: MRP of an MDP under a policy π :< S, P^{π}, R^{π}, γ > \\ Output: The state value function of that policy v_{π} \\ intitalize v_{π} \leftarrow [0, 0, \dots, 0] a vector of zeros for each state \\ While True do: \\ v_{π}^{'} \leftarrow [0, 0, \dots, 0] \\ For i in Range [0, | S | - 1] do: \\ v_{π}^{'} [i] \leftarrow v_{π}^{'} [i] + R^{π} [i] \\ For j in Range [0, | S | - 1] do: \\ v_{π}^{'} [i] \leftarrow v_{π}^{'} [i] + γ P^{π} [i, j] v_{π} [j] \\ v_{π} \leftarrow v_{π}^{'} \\ Return v_{π} \end{aligned}

So what's the time and space on the algorithm like?
well we need to store $S$ that takes $O (n)$ space ( $n$ is the number of states), $R^{π}$ also takes $O (n)$ space, but $P^{π}$ is a matrix that takes $O (n^{2})$ space. we also need to store $v_{π}, v_{π}^{'}$ . so in total $O (4 n + n^{2}) = O (n^{2})$ space complexity PER ITERATION.
And as you can see there are two loops nested, each going over the state, and the computation in the inner loop is $O (1)$ and the computation of the outer loop before the inner loop starts is also $O (1)$ . Therefore overall $O (n^{2})$ Time complexity PER ITERATION.

but how many iterations do we need to get a good estimate of the actual value function (of the policy) $v_{π}$ ? There is a theorem that it does converge. Suppose we need $K$ iterations to converge, the algorithm then takes $O (K n^{2})$ Time and still occupies $O (n^{2})$ space.

Control in planning by DP:

Sure we can iterate on the value function of a fixed policy to get closer to the true value function under that policy, but CAN we iterate towards a better policy? the ANSWER IS YES!

policy iteration by DP:

The rough idea is as follows: start with an initial policy DETERMINISTIC policy $π (s) = a$
Then extract the co-responding MRP from this policy, and then evaluate this policy and get its value function, using the above algorithm: call it $v_{π}$
recall $v_{π} (s) = \sum_{a \in A} π (a | s) q_{π} (s, a)$ and
$q_{π} (s, a) = R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s \to s^{'}}^{a} v_{π} (s^{'})$ .

Using the second equation get $q_{π}$ from $v_{π}$ . Then,
the new policy we get (deterministic) is one which acts greedily wrt to $v_{π}$ and hence $q_{π}$ . Ie at every state, the new policy picks the action with the highest action value for that state

π^{'} (s) = \arg max_{a \in A} q_{π} (s, a)

Now for the deterministic policies, $π (a | s) = 0 unless a = π (s) w h e r e π (a | s) = 1$ .

Consider any state $s$ .With our old deterministic policy, suppose $π (a | s) = 1$ if and only if $a = a_{π}$ .
in our new policy, $π^{'} (a | s) = 1$ if and only if $a = a^{*}$ where $a^{*} = \arg max_{a \in A} q_{π} (s, a)$ . Therefore $q_{π} (s, a_{π}) \leq q_{π} (s, a^{*})$ .

And notice that $v_{π} (s) = \sum_{a \neq a_{π}} π (a | s) q_{π} (s, a) + π (a_{π} | s) q_{π} (s, a_{π}) = 0 + q_{π} (s, a_{π})$ .
And with a similar decomposition,
$v_{π^{'}} (s) = \sum_{a \neq a^{*}} π^{'} (a | s) q_{π^{'}} (s, a) + π^{'} (a^{*} | s) q_{π^{'}} (s, a^{*}) = 0 + q_{π^{'}} (s, a^{*})$ .

And therefore, $v_{π} (s) \leq v_{π^{'}} (s)$ for any state $s$ . And therefore in the partial ordering defined on policies earlier $π \leq π^{'}$ .

This is cool! take any deterministic policy, calculates its state value using the above algorithm, then gets it action value and then make a new deterministic policy that is greedy on the action value.

We can iterate this process over and over! and if improvements stop we have the optimal value, policy and action policy!

Let us write some functions to do our bidding, and this time, we will keep the loops implict.

First we write a function to get an MRP from an MDP given a policy.

\begin{aligned} Function Extract MRP from MDP given det policy \\ Input: MDP < S, A, R, P, γ >, det policy π \\ Output: MRP under det policy < S, R^{π}, P^{π}, γ > \\ Signature : extract (S, A, R, P, γ, π) \\ R_{s}^{π} \leftarrow \sum_{a \in A} π (a | s) R_{s}^{a} \forall s \in S \\ P_{s \to s^{'}}^{π} = \sum_{a \in A} π (a | s) P_{s \to s^{'}}^{a} \forall s \in S \\ Return: < S, R^{π}, P^{π}, γ > \end{aligned}

Next, we write a function that does one step of the value function calculating iteration algorithm, given the policy and its co-responding induced MRP.

\begin{aligned} Function: Single step of value function aprroximation \\ Input: MRP < S, R^{π}, P^{π}, γ >, det policy, old value function v_{π} π \\ Output: new value function (closer to true value function) v_{π}^{'} \\ Signature: calculate-v (v, π, S, R^{π}, P^{π}, γ) \\ Return: v_{π}^{'} (s) \leftarrow R_{s}^{π} + γ \sum_{s^{'} \in S} P_{s \to s^{'}}^{π} v_{π} (s^{'}) \forall s \in S \end{aligned}

\begin{aligned} Function: Single step of policy iteration \\ Input: current policy π, MDP, approx/exact value function of the current policy, v_{π} \\ Output: new policy, greedy wrt to value function π^{'} where \forall s \in S, π^{'} (s) \geq π (s) \\ Signature: p-iterate (v_{π}, π, P, R, γ) \\ q_{π} (s, a) \leftarrow R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s \to s^{'}}^{a} v_{π} (s^{'}) \forall s, a \\ Return: v_{π}^{'} (s) \leftarrow \arg max_{a \in A} q_{π} (s, a) \forall s \end{aligned}

Nice!

So the policy iteration algorithm looks like: (and I wont write down the whole idea)

Initialize $π$
do $< S, R^{π}, P^{π}, γ > \leftarrow extract (S, A, R, P, γ, π)$
initialize $v_{π}$
$v_{π} \leftarrow$ do ( $calculate-v (v_{π}, π, S, R^{π}, P^{π}, γ)$ $K$ times or up to $ϵ$ closeness of each iteration).
$π \leftarrow$ do $p-iterate (v_{π}, π, P, R, γ)$

Here is a picture to summarize it all
!Support/Pasted image 20250328055823.png

But now we think, how the fuck do do we find the optimal value function without iterating on the policy? Cause after all, if we have the optimal state value function, we can quickly get the optimal action value function, and hence the optimal policy is just argmaxing on the action value function.

So in the next idea (value iteration) we will not evaluate the a policy and get the value function that way, but we will rather try to use the BELLMAN OPTIMALITY recursion to directly get the optimal value function.

Value iteration with DP:

First of all, we need a theorem.
!Support/Pasted image 20250328060316.png

Now Let us find some intuition on this, Imagine there is a ONE PLAYER GAME where you make some decisions. Also imagine that the game ends in exactly 3 different states, $s_{a}, s_{b}, s_{c}$ where $s_{a}$ has a reward of $- 1$ $s_{b}$ a reward of $0$ and $s_{c}$ of $+ 1$ .
then obviously $v * (s_{a}) = - 1, v * (s_{b}) = 0, v * (s_{c}) = 1$ . given this initial information, we could use the bellman optimality equation to back up this optimal value function for other states (imagine backing up these values through a tree where the three final states are the leaves). Below is the bellman optimality equation.

v^{*} (s) = max_{a \in A} (R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s \to s^{'}}^{a} v^{*} (s^{'}))

In truth, we just start with some initial $v$ and just iterate the above equation over and over, and somehow this gives us the optimal value function. What is going on?

Think about this: no matter what $v$ we start out with, at the first iteration of the above equation, AT LEAST THE leaf states have exact rewards assigned to them, and from there, maxing of rewards can propagate back up the state-action tree. Or if you like starting out with $v =< 0, . . .0 >$ we assign the correct (optimal $v$ ) for the leaf states, which is just the reward for the leaf state (where you can't really take any action). But we are not going to write proofs, this works because engineering :)