{% import macros as m with context  %}

.. _dynaq_game:


Week 13: DynaQ
=============================================================================================================

{{ m.embed_game('week13_dynaq') }}


.. topic:: Controls
    :class: margin

    :kbd:`arrows`
        Move pacman and execute a step of Sarsa algorithm
    :kbd:`Space`
        Take a single action according to the current policy
    :kbd:`p`
        Follow the current policy
    :kbd:`r`
        Reset the game

    .. rubric:: Run locally

    :gitref:`../irlc/lectures/lec13/lecture_13_dyna_q_5_maze.py`

What you see
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The example show the Dyna-Q algorithm on a gridworld. The living reward is 0, agent obtains a reward of +1 on the exit square. The obstacles makes the problem
harder to solve since exploration becomes very slow.


How it works
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When the agent takes action :math:`a` in a state :math:`s`, and then get an immediate reward of :math:`r` and move to a new state :math:`s'`, and here takes action :math:`a'`, then the Q-values are updated according to the
usual Q-learning rule (see :ref:`q_game`):

.. math::
    :label: dynaq

    Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s,a) \right]

Then secondly, the agent pushes the triplet of :math:`(s, a, r, sp)` onto a list. Finally, since the four pieces of information above is all that is needed to perform a :math:`Q`-update,
the agent can select :math:`n=5` such random triplets and thereby perform :math:`n` Q-updates on past states -- since :math:`Q`-learning is off policy, this is perfectly fine
from a convergence perspective.

This means that more than one :math:`Q(s, a)`-value is updated in each step, which makes dyna-Q converge much faster after the first episode.