2RL - model-free control

We are back to reinforcement learning problems, not planning anymore, which means we don't have a model of our MDP given to us. The first method is Monte-Carlo estimation:

Monte-Carlo estimation:

works only for episodic (terminating MDP)
we will simply walk through and sample many episodes and the value of the state will be the mean reward across these episodes.
we have no bootstrapping, we finish complete episodes and only then once we've sampled enough episodes, we will take the average discounted return for each state for the state value function and so on.

Lets do monte-Carlo policy evaluation

First visit monte-Carlo policy evaluation

For an episode $e$ , let $t$ be the FIRST time we visit a state $s$ .
then we increment $N_{v i s i t} (s)$ (only on the first visit) and then we finish the episode, keep track of $G_{t}$ and then $v_{e s t i m} (s) + = G_{t}$ .

We then do this over many episodes, and divide $v_{e s t i m} (s)$ by $N_{v i s i t} (s)$ .

Finally we can just do this for all states.