2RL - model-free control
We are back to reinforcement learning problems, not planning anymore, which means we don't have a model of our MDP given to us. The first method is Monte-Carlo estimation:
Monte-Carlo estimation:
- works only for episodic (terminating MDP)
- we will simply walk through and sample many episodes and the value of the state will be the mean reward across these episodes.
- we have no bootstrapping, we finish complete episodes and only then once we've sampled enough episodes, we will take the average discounted return for each state for the state value function and so on.
Lets do monte-Carlo policy evaluation
First visit monte-Carlo policy evaluation
For an episode
then we increment
We then do this over many episodes, and divide
Finally we can just do this for all states.