{% import macros as m with context %} .. _bandit_game: Week 8: UCB bandit algorithm ============================================================================================================= {{ m.embed_game('week8_ucb_bandit') }} .. topic:: Controls :class: margin :kbd:`0`, :kbd:`1`, ..., :kbd:`9` Select (take) a specific action :math:`a_k = 0, \dots, 9`. :kbd:`Space` Take a single action from the simple bandit algorithm :kbd:`q` Show the true average rewards :math:`q^*_k`. The highest indicate the best arm. :kbd:`u` Show the estimated upper bounds for each :math:`q_a` (as red lines) :kbd:`p` Automatically take actions :kbd:`r` Reset the game .. rubric:: Run locally :gitref:`../irlc/lectures/lec08/demo_bandit_ucb.py`. What you see ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The example shows the UCB bandit algorithm applies to to a 10-armed problem with binary rewards. In step :math:`k`, when you select an arm :math:`a_k`, the environment gives you a reward of :math:`r_k=1` with probability :math:`q_k^*` and otherwise a reward of 0. The height of the bars show the current average reward for each arm, and the numbers how often they have been tried. You can show the actual average rewards :math:`q_k^*` by pressing :kbd:`q` (although this is cheating! if we knew these values there would be no point in running a bandit algorithm!). How it works ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The UCB agent keeps track of the average reward for each arm, :math:`Q_t(a)`, and how many times each arm has been attempted, :math:`N_t(a)`. Then for each arm it computes the upper bounds: .. math:: Q_t(a) + c \sqrt{ \frac{\log t}{N_t(a) } } and select the arm associated with the highest upper bound. The constant :math:`c` is selected by the user at for instance :math:`c=2`.