Overview of model free methods

Model free means we are going to do sampling because we cannot enumerate what my next state would be. We only know it when we run it. Recall for known model, we have Policy & value iteration, for model free, we have the “sampling version” for them.

It should also be noticed that we will need function approximation because we need generalization. So we will first have a loss function or objective function, and then we’ll apply stochastic gradient descent to update our parameters.

The very basic bias-free method is the Monte Carlo. An overview of the methods can be found in Temporal difference.

These are the prediction methods (predict the value function). For the control part (improve the policy).

Recall we are sampling the state for MC evaluation, that means there could be state we never tried. No exploration. A common choice is $ϵ \leftarrow 1/ k$ . Well that $ϵ$ is very convenient, but how do we make sure it’s sound? That’s guaranteed by the theorem: GLIE Model-free control converges to the optimal action-value function, $q_{t} \to q_{*}$ .

We can also use TD learning instead of MC for policy evaluation and it’s still sound.

A very simple Tabular SARSA with both prediction and control:

Initialize Q (s, a) arbitrarily Repeat (for each episode): Initialize s Choose a from s using policy derived from Q (e.g., ϵ -greedy) Repeat (for each step of episode): Take action a, observe r, s^{'} Choose a^{'} from s^{'} using policy derived from Q (e.g., ϵ -greedy) Q (s, a) \leftarrow Q (s, a) + α [r + γ Q (s^{'}, a^{'}) - Q (s, a)] s \leftarrow s^{'}; a \leftarrow a^{'} until s is terminal

Yanda's Random Notes

Explorer

Overview of model free methods

Graph View