{% import macros as m with context %} .. _dynaq_game: Week 13: DynaQ ============================================================================================================= {{ m.embed_game('week13_dynaq') }} .. topic:: Controls :class: margin :kbd:`arrows` Move pacman and execute a step of Sarsa algorithm :kbd:`Space` Take a single action according to the current policy :kbd:`p` Follow the current policy :kbd:`r` Reset the game .. rubric:: Run locally :gitref:`../irlc/lectures/lec13/lecture_13_dyna_q_5_maze.py` What you see ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The example show the Dyna-Q algorithm on a gridworld. The living reward is 0, agent obtains a reward of +1 on the exit square. The obstacles makes the problem harder to solve since exploration becomes very slow. How it works ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When the agent takes action :math:`a` in a state :math:`s`, and then get an immediate reward of :math:`r` and move to a new state :math:`s'`, and here takes action :math:`a'`, then the Q-values are updated according to the usual Q-learning rule (see :ref:`q_game`): .. math:: :label: dynaq Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s,a) \right] Then secondly, the agent pushes the triplet of :math:`(s, a, r, sp)` onto a list. Finally, since the four pieces of information above is all that is needed to perform a :math:`Q`-update, the agent can select :math:`n=5` such random triplets and thereby perform :math:`n` Q-updates on past states -- since :math:`Q`-learning is off policy, this is perfectly fine from a convergence perspective. This means that more than one :math:`Q(s, a)`-value is updated in each step, which makes dyna-Q converge much faster after the first episode.