2RL - model-free control

We are back to reinforcement learning problems, not planning anymore, which means we don't have a model of our MDP given to us. The first method is Monte-Carlo estimation:

Monte-Carlo estimation:

Lets do monte-Carlo policy evaluation

First visit monte-Carlo policy evaluation

For an episode e, let t be the FIRST time we visit a state s.
then we increment Nvisit(s) (only on the first visit) and then we finish the episode, keep track of Gt and then vestim(s)+=Gt.

We then do this over many episodes, and divide vestim(s) by Nvisit(s).

Finally we can just do this for all states.