Week 8: Simple bandit#
What you see#
The example shows the simple bandit algorithm applies to to a 10-armed problem with binary rewards. In step \(k\), when you select an arm \(a_k\), the environment gives you a reward of \(r_k=1\) with probability \(q_k^*\) and otherwise a reward of 0.
The height of the bars show the current average reward for each arm, and the numbers how often they have been tried. You can show the actual average rewards \(q_k^*\) by pressing q (although this is cheating! if we knew these values there would be no point in running a bandit algorithm!).
How it works#
The bandit keeps track of the average reward for each arm and this is stored as \(Q_t(a)\). Then with a probability of \(\epsilon\), the bandit selects a random action, and otherwise it selects the action \(a\) associated with the highest value of \(Q_t(a)\). This is called being greedy since we select the arm we (currently) think is best.