Week 8: UCB bandit algorithm#
What you see#
The example shows the UCB bandit algorithm applies to to a 10-armed problem with binary rewards. In step \(k\), when you select an arm \(a_k\), the environment gives you a reward of \(r_k=1\) with probability \(q_k^*\) and otherwise a reward of 0.
The height of the bars show the current average reward for each arm, and the numbers how often they have been tried. You can show the actual average rewards \(q_k^*\) by pressing q (although this is cheating! if we knew these values there would be no point in running a bandit algorithm!).
How it works#
The UCB agent keeps track of the average reward for each arm, \(Q_t(a)\), and how many times each arm has been attempted, \(N_t(a)\). Then for each arm it computes the upper bounds:
and select the arm associated with the highest upper bound. The constant \(c\) is selected by the user at for instance \(c=2\).