{% import macros as m with context %} .. _tdlambda_game: Week 12: TD(Lambda) ============================================================================================================= {{ m.embed_game('week12_td_lambda') }} .. topic:: Controls :class: margin :kbd:`arrows` Move pacman and execute a step of Sarsa algorithm :kbd:`Space` Take a single action according to the current policy :kbd:`p` Follow the current policy :kbd:`r` Reset the game .. rubric:: Run locally :gitref:`../irlc/lectures/lec12/lecture_12_td_lambda.py` What you see ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The example show the TD(Lambda)-algorithm on a gridworld. The living reward is 0, agent obtains a reward of +1 at the exit square. The light-blue numbers are the value of the eligibility trace, i.e. :math:`E(s)`, which are used to update the :math:`Q`-values. How it works ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When the agent takes action :math:`a` in a state :math:`s`, and then get an immediate reward of :math:`r` and move to a new state :math:`s'`, the value function is updated as follows: .. math:: :label: tdlambda \delta & = r + \gamma V(s') - Q(s, a) \\ E(s) & = E(s) + 1 Then for *all* states and actions the following updates are performed: .. math:: V(s) & = V(s) + \alpha \delta E(s) \\ E(s) & = \gamma \lambda E(s) In this implementation, the game is refreshed (updated) right after :math:`V(s) \leftarrow V(s) + \alpha \delta E(s)`, but before the eligiblity trace is exponentially decayed. I think this is easier for visualization purposes.