1RL - Planning

Last time we talked about MDP's, along with the MDP are some value functions you can calculate,
and we saw how when we "know" the MDP (ie we know the tuple <S,A,P,R,γ) Then we can caluclate the value functions, and hence recursively calculate the optimal value functions, and hence the optimal policy.

Planning, unlike reinforcement learning is the act of solving (finding optimals) of a known mdp.

There are two types of planning, both when MDP is fully known.
Prediction: given a policy, figure out its value function (evaluate the policy)
Control: find the optimal policy.

Prediction in planning by DP

vπ(s)=aAπ(a|s)(Rsa+γsSPssavπ(s))

Above we have the bellman equation for the state value function of an MDP

How do use dynamic programming to evaluate vπ? well first, we can ease ourself realizing that for a given policy, from an MDP we can extract out an MRP:

Notice that Rsa is a matrix that gives the immediate value for taking an action a in state s. Hence aAπ(a|s)Rsa=Rsπ. That is, for a fixed policy, if we know the expected reward for state action pairs(s,a), then under that policy, the expected reward for a state s is aAPπ(a|s)Rsa, where Pπ(a|s)=π(a|s).

And moreover, if the environment transition dynamics from state s to state s under the action a is given by the distribution tensor Pssa then to get the expected dynamics for a state s to s we have to average out the contribution of actions from out policy. Hence Pssπ=aAπ(a|s)Pssa.

The tuple <S,Pπ,Rπ,γ>is then a Markov reward process under the policy π

Now then, distributing the sums over,

vπ(s)=aAπ(a|s)Rsa+γsSaAπ(a|s)Pssavπ(s)

Eqivalently,

vπ(s)=Rsπ+γsSPssπvπ(s)$$.Nowforthealgorithmthatcalculates$vπ$usingdynamicprogramming:$$Algorithm policy evaluation given MRP of an MDP under that policyInput: MRP of an MDP under a policy π:<S,Pπ,Rπ,γ> Output: The state value function of that policy vπintitalize vπ[0,0,,0] a vector of zeros for each stateWhile True do:vπ[0,0,,0]For i  in Range[0,|S|1] do:vπ[i]vπ[i]+Rπ[i]For j  in Range[0,|S|1] do:vπ[i]vπ[i]+γPπ[i,j]vπ[j]vπvπReturn  vπ

So what's the time and space on the algorithm like?
well we need to store S that takes O(n) space (n is the number of states), Rπ also takes O(n) space, but Pπ is a matrix that takes O(n2) space. we also need to store vπ,vπ. so in total O(4n+n2)=O(n2) space complexity PER ITERATION.
And as you can see there are two loops nested, each going over the state, and the computation in the inner loop is O(1) and the computation of the outer loop before the inner loop starts is also O(1). Therefore overall O(n2) Time complexity PER ITERATION.

but how many iterations do we need to get a good estimate of the actual value function (of the policy) vπ? There is a theorem that it does converge. Suppose we need K iterations to converge, the algorithm then takes O(Kn2) Time and still occupies O(n2) space.

Control in planning by DP:

Sure we can iterate on the value function of a fixed policy to get closer to the true value function under that policy, but CAN we iterate towards a better policy? the ANSWER IS YES!

policy iteration by DP:

The rough idea is as follows: start with an initial policy DETERMINISTIC policy π(s)=a
Then extract the co-responding MRP from this policy, and then evaluate this policy and get its value function, using the above algorithm: call it vπ
recall vπ(s)=aAπ(a|s)qπ(s,a) and
qπ(s,a)=Rsa+γsSPssavπ(s).

Using the second equation get qπ from vπ. Then,
the new policy we get (deterministic) is one which acts greedily wrt to vπ and hence qπ. Ie at every state, the new policy picks the action with the highest action value for that state

π(s)=argmaxaAqπ(s,a)

Now for the deterministic policies, π(a|s)=0 unless a=π(s) where π(a|s)=1.

Consider any state s.With our old deterministic policy, suppose π(a|s)=1 if and only if a=aπ.
in our new policy, π(a|s)=1 if and only if a=a where a=argmaxaAqπ(s,a). Therefore qπ(s,aπ)qπ(s,a).

And notice that vπ(s)=aaππ(a|s)qπ(s,a)+π(aπ|s)qπ(s,aπ)=0+qπ(s,aπ).
And with a similar decomposition,
vπ(s)=aaπ(a|s)qπ(s,a)+π(a|s)qπ(s,a)=0+qπ(s,a).

And therefore, vπ(s)vπ(s) for any state s. And therefore in the partial ordering defined on policies earlier ππ.

This is cool! take any deterministic policy, calculates its state value using the above algorithm, then gets it action value and then make a new deterministic policy that is greedy on the action value.

We can iterate this process over and over! and if improvements stop we have the optimal value, policy and action policy!

Let us write some functions to do our bidding, and this time, we will keep the loops implict.

First we write a function to get an MRP from an MDP given a policy.

Function Extract MRP from MDP given det policyInput: MDP<S,A,R,P,γ>,  det policy πOutput: MRP under det policy<S,Rπ,Pπ,γ>Signature:extract(S,A,R,P,γ,π)RsπaAπ(a|s)Rsa  sSPssπ=aAπ(a|s)Pssa  sSReturn: <S,Rπ,Pπ,γ>

Next, we write a function that does one step of the value function calculating iteration algorithm, given the policy and its co-responding induced MRP.

Function: Single step of value function aprroximationInput: MRP<S,Rπ,Pπ,γ>,  det policy, old value function vπ πOutput: new value function (closer to true value function) vπSignature: calculate-v(v,π,S,Rπ,Pπ,γ)Return: vπ(s)Rsπ+γsSPssπvπ(s)  sSFunction: Single step of policy iterationInput: current policy π, MDP,  approx/exact value function of the current policy, vπOutput:   new policy, greedy wrt to value function πwhere sS, π(s)π(s)Signature: p-iterate(vπ,π,P,R,γ)qπ(s,a)Rsa+γsSPssavπ(s)  s,aReturn: vπ(s)argmaxaAqπ(s,a)  s

Nice!

So the policy iteration algorithm looks like: (and I wont write down the whole idea)

Here is a picture to summarize it all
!Support/Pasted image 20250328055823.png

But now we think, how the fuck do do we find the optimal value function without iterating on the policy? Cause after all, if we have the optimal state value function, we can quickly get the optimal action value function, and hence the optimal policy is just argmaxing on the action value function.

So in the next idea (value iteration) we will not evaluate the a policy and get the value function that way, but we will rather try to use the BELLMAN OPTIMALITY recursion to directly get the optimal value function.

Value iteration with DP:

First of all, we need a theorem.
!Support/Pasted image 20250328060316.png

Now Let us find some intuition on this, Imagine there is a ONE PLAYER GAME where you make some decisions. Also imagine that the game ends in exactly 3 different states, sa,sb,sc where sa has a reward of 1 sb a reward of 0 and sc of +1.
then obviously v(sa)=1,v(sb)=0,v(sc)=1. given this initial information, we could use the bellman optimality equation to back up this optimal value function for other states (imagine backing up these values through a tree where the three final states are the leaves). Below is the bellman optimality equation.

v(s)=maxaA(Rsa+γsSPssav(s))

In truth, we just start with some initial v and just iterate the above equation over and over, and somehow this gives us the optimal value function. What is going on?

Think about this: no matter what v we start out with, at the first iteration of the above equation, AT LEAST THE leaf states have exact rewards assigned to them, and from there, maxing of rewards can propagate back up the state-action tree. Or if you like starting out with v=<0,...0> we assign the correct (optimal v) for the leaf states, which is just the reward for the leaf state (where you can't really take any action). But we are not going to write proofs, this works because engineering :)