{% import macros as m with context %} .. _nstep_game: Week 11: N-step sarsa ============================================================================================================= {{ m.embed_game('week11_nstep') }} .. topic:: Controls :class: margin :kbd:`arrows` Move pacman and execute a step of Sarsa algorithm :kbd:`Space` Take a single action according to the current policy :kbd:`p` Follow the current policy :kbd:`r` Reset the game .. rubric:: Run locally :gitref:`../irlc/lectures/lec11/lecture_11_nstep_open.py` What you see ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The example show the n-step sarsa algorithm applied to a gridworld. The Gridworld has a single, terminal reward. The n-step Sarsa method will eventually learn the optimal policy. How it works ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The n-step method is a bit annoying to implement, and I would refer to the pseudo code in (:cite:t:`sutton`, Section 7.2) for a complete picture. However, the basic idea is that we define the n-step return: .. math:: G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n} ) What this means is that at a given time :math:`t`, we need to know what happens :math:`n`-steps into the future, and knowing this, we can compute the above number. So concretely, when Pacman is walking around and has taken at least :math:`n` steps, then in each subsequent step we can compute the corresponding :math:`n`-step return (in the past so to speak). Let's suppose that This allows Pacman to update :math:`Q(S_t, A_t)` -- but remember that :math:`t` is n-steps ago so to speak, and :math:`S_t`, :math:`A_t` is the state/action for pacman :math:`n` steps previously, and so on. The update is then: .. math:: Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (G_{t:t+n} - Q(S_t, A_t) ) This corresponds to regular Sarsa when :math:`n=1`. There is one small wrinkle (and quite frandkly, this is why the algorithm is annoying to implement): After the episode has terminated, we are 'missing' :math:`n` updates since we were updating the values :math:`n` steps in the past. These have to be applied, which is why you see :math:`n` Q-values change in the last step. This is concretely implemented in the last :python:`if`-statement in the pseuco-code in (:cite:t:`sutton`, Section 7.2).