{% import macros as m with context  %}

.. _bandit_game:


Week 8: Simple bandit
=============================================================================================================

{{ m.embed_game('week8_bandit') }}

.. topic:: Controls
    :class: margin


    :kbd:`0`, :kbd:`1`, ..., :kbd:`9`
        Select (take) a specific action :math:`a_k = 0, \dots, 9`.
    :kbd:`Space`
        Take a single action from the simple bandit algorithm
    :kbd:`q`
        Show the true average rewards :math:`q^*_k`. The highest indicate the best arm.
    :kbd:`p`
        Automatically take actions
    :kbd:`r`
        Reset the game

    .. rubric:: Run locally

    :gitref:`../irlc/lectures/lec08/demo_bandit.py`.

{#
.. margin::

    Stuff in the margin.
#}

What you see
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The example shows the simple bandit algorithm applies to to a 10-armed problem with binary rewards. In step :math:`k`, when you select an
arm :math:`a_k`, the environment gives you a reward of :math:`r_k=1` with probability :math:`q_k^*` and otherwise a reward of 0.

The height of the bars show the current average reward for each arm, and the numbers how often they have been tried. You can show the actual
average rewards  :math:`q_k^*` by pressing :kbd:`q` (although this is cheating! if we knew these values there would be no point in running a bandit algorithm!).


How it works
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

todo.