{% import macros as m with context %} .. _sarsalambda_game: Week 12: Sarsa(Lambda) ============================================================================================================= {{ m.embed_game('week12_sarsa_lambda') }} .. topic:: Controls :class: margin :kbd:`arrows` Move pacman and execute a step of Sarsa(Lambda) algorithm :kbd:`Space` Take a single action according to the current policy :kbd:`p` Follow the current policy :kbd:`r` Reset the game .. rubric:: Run locally :gitref:`../irlc/lectures/lec12/lecture_12_sarsa_lambda_open.py` What you see ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The example show the Sarsa-Lambda-algorithm on a gridworld. The living reward is 0, agent obtains a reward of +1 at the exit square. The light-blue numbers are the value of the eligibility trace, i.e. :math:`E(s,a)`, which are used to update the :math:`Q`-values. How it works ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When the agent takes action :math:`a` in a state :math:`s`, and then get an immediate reward of :math:`r` and move to a new state :math:`s'`, and here takes action :math:`a'`, then the Q-values are updated according to the rule .. math:: :label: sarsalambda \delta & = r + \gamma Q(s', a') - Q(s, a) \\ E(s, a) & = E(s, a) + 1 Then for *all* states and actions the following updates are performed: .. math:: Q(s, a) & = Q(s, a) + \alpha \delta E(s,a) \\ E(s,a) & = \gamma \lambda E(s,a) Actions are selected epsilon-greedy with respect to the Q-values. .. warning:: Similar to :ref:`sarsa_game`, the updates the to :math:`Q`-values seem to lack a step behind where Pacman is because we need to know the next action :math:`a'` in order to update :math:`Q(s, a)` (see above). For visualization purposes, the simulation shows the value of the eligibility trace just after the update :math:`Q(s, a) \leftarrow Q(s, a) + \alpha \delta E(s,a)`, since I think it is easier to understand the update when one can see the value of the eligibility trace used in the update and not after it has been exponentially decayed :math:`E(s,a) \leftarrow \gamma \lambda E(s,a)`