{% import macros as m with context  %}

.. _q_game:

Week 11: Q-learning
=============================================================================================================

{{ m.embed_game('week11_q') }}


.. topic:: Controls
    :class: margin

    :kbd:`arrows`
        Move pacman and execute a step of Q-learning algorithm
    :kbd:`Space`
        Take a single action according to the current policy
    :kbd:`p`
        Follow the current policy
    :kbd:`r`
        Reset the game

    .. rubric:: Run locally

    :gitref:`../irlc/lectures/lec11/lecture_11_q.py`

What you see
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The example show the Q-learning-algorithm on a gridworld. The living reward is 0, agent obtains a reward of +1 and -1 on the two exit squares.
The four values in each grid :math:`s` grid show the 4 Q-values :math:`Q(s,a)`, one for each action.

How it works
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When the agent takes action :math:`a` in a state :math:`s`, and then get an immediate reward of :math:`r` and move to a new state :math:`s'`, the Q-values are updated according to the rule

.. math::

    Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s,a) \right]


This update rule will eventually learn the optimal action-value function :math:`q_{*}(s,a)` -- as long as all actions are tried infinitely often.
Concretely, the agent will follow an epsilon-greedy policy with respect to the current Q-values :math:`Q(s,a)` shown in the
simulation. This ensures that the agent frequently takes action that it think are good, and eventually converge to the optimal Q-values.