{% import macros as m with context %} .. _game_td0: Week 10: TD-learning ============================================================================================================= {{ m.embed_game('week10_td') }} .. topic:: Controls :class: margin :kbd:`arrows` Move pacman and execute a step of TD(0) :kbd:`Space` Take a single action according to a random policy :kbd:`p` Follow the random policy :kbd:`r` Reset the game .. rubric:: Run locally :gitref:`../irlc/lectures/lec10/lecture_10_td_keyboard.py`. What you see ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The example shows the TD(0) algorithm applied to a deterministic gridworld environment with a living reward of :math:`-0.05` and a per step and no discount (:math:`\gamma = 1`). The goal is to get to the upper-right corner. Every time you move pacman the game will execute a single update of the TD(0) algorithm. It will take quite a few steps for the algorithm to converge, but after it has converged it will show the value function :math:`v_pi(s)` for the current policy -- which by default is the random policy. This means that the algorithm will compute the same result as policy evaluation seen in :ref:`policy_evaluation` How it works ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When you transition from a state :math:`s` to :math:`s'` the algorithm iteratively update :math:`V(s)` according to the rule: .. math:: V(s) \leftarrow V(s) + \alpha (R_{t+1} + \gamma V(s') - V(s) ) Where :math:`\alpha=0.5` is a learning rate.