{% import macros as m with context  %}
{% set week = 8 %}
{{ m.exercise_head(week) }}

Bandits
-------------------------------------------------------------------------
Bandits are an example of a particular, simplified reinforcement learning setup where the emphasis is on exploration.
We are therefore going to model bandits as reinforcement learning problems using the :class:`~gymnasium.Env` and :class:`~irlc.ex01.agent.Agent`.

Let's first explore the bandit problem corresopnding to the simple 10-armed bandit in (:cite:t:`sutton`). Recall that this bandit:

- There are :math:`k=10` arms.
- An action :math:`a_k \in \{0, 1, .., 9\}` selects an arm, and we then obtain a reward :math:`r_t`
- An episode corresponds to e.g. :math:`T=1000` such interactions. The purpose is to maximize the accumulated reward
- We want to be able to reset the environment and re-do the experiment many times

A basic interaction with the environment is as follows:

.. runblock:: pycon

    >>> from irlc.ex08.bandits import StationaryBandit
    >>> env = StationaryBandit(k=10)                    # 10-armed testbed.
    >>> env.reset()                                     # Reset env.q_star
    >>> s, r, _, _, info = env.step(3)
    >>> print(f"The reward we got from taking arm a=3 was {r=}")

The script print out the immediate reward in the first step :math:`r_0`.

Internally, the environment will store the (expected) reward for each arm as :python:`env.q_star`. These values are reset (sampled from a normal distribution)
when the :func:`irlc.ex08.bandits.StationaryBandit.reset` method is called, and can be used to compute the regret. An example:


.. runblock:: pycon

    >>> from irlc.ex08.bandits import StationaryBandit
    >>> env = StationaryBandit(k=4)     # 4-armed testbed.
    >>> env.reset()                     # Reset all parameters.
    >>> env.q_star
    >>> env.reset()                     # Reset again
    >>> env.q_star
    >>> env.optimal_action              # This variable stores the optimal action
    >>> _, r, _, _, info = env.step(2)  # Take action a=2
    >>> print(f"Reward from a=2 was {r=}, the regret was {info['average_regret']=}")

This code also computes the per-step regret, :python:`info['average_regret']`, which for an action :math:`a` is defined as

.. math::
    \max_{a'} q^*_{a'} - q_a

Perhaps you can tell how it can be computed using ``env.optimal_action`` and ``env.q_star``?

Implementing a bandit environment
*************************************************************************

To implement a new bandit environment, all you need to do is to implement the
reward function and compute the regret. To simplify things I have made a helper class
:class:`~irlc.ex08.bandits.BanditEnvironment` where you only need to implement the :func:`~irlc.ex08.bandits.BanditEnvironment.reset` function and a function
:func:`~irlc.ex08.bandits.BanditEnvironment.bandit_step` which accepts an action and returns the reward :math:`r_t` and regret.

.. literalinclude:: ../../shared/output/bandits_doc.py
    :language: python


Training and visualizing bandit agents
****************************************************************************************
Agents for bandits problems are implemented as usual by overwritign the :func:`~irlc.ex01.agent.Agent.pi` and :class:`~irlc.ex01.agent.Agent.train` function in the :class:`~irlc.ex01.agent.Agent` class.
You can train the bandit agent by using the :func:`~irlc.ex01.agent.train` as usual, but to simplify things I have made a convenience method to train and visualize multiple bandits as follows:

.. tip::
    :class: margin

    If  you set :python:`use_cache=True` then old simulations will be stored and plotted until there are
    :python:`max_episodes` total simulations. Saving old simulations for later plotting is essential in reinforcement learning.

.. plot::
    :include-source:

    >>> from irlc.ex08.bandits import StationaryBandit, eval_and_plot
    >>> from irlc.ex08.simple_agents import BasicAgent
    >>> env = StationaryBandit(k=10)                                                                # 10-armed testbed
    >>> agents = [BasicAgent(env, epsilon=.1), BasicAgent(env, epsilon=.01)]                        # Create two agents
    >>> k = eval_and_plot(env, agents, num_episodes=10, steps=100, max_episodes=10, use_cache=True)    # Fight!


Classes and functions
-------------------------------------------------------------------------

.. autoclass:: irlc.ex08.bandits.BanditEnvironment
    :show-inheritance:
    :members:

.. autoclass:: irlc.ex08.bandits.StationaryBandit
    :show-inheritance:
    :members:


Solutions to selected exercises
-------------------------------------------------------------------------------------------------------

{% if show_solution[week] %}

.. admonition:: Solution to the conceptual problem 1
    :class: dropdown


    **Part a:** Since the simple bandit explore at random, we can never **rule out** that it explored, because it might select the
    optimal action during exploration.

    However, if the agent selects a sub-optimal action, we know it must have occured by exploration. We track the Q-values:

    - $t=1$: :math:`Q(1)=-1`,  :math:`Q(2)=0`,  :math:`Q(3)=0`,  :math:`Q(4)=0`
    - $t=2$: :math:`Q(1)=-1`,  :math:`Q(2)=1`,  :math:`Q(3)=0`,  :math:`Q(4)=0`
    - $t=3$: :math:`Q(1)=-1`,  :math:`Q(2)=\frac{-1}{2}`,  :math:`Q(3)=0`,  :math:`Q(4)=0`
    - $t=4$: :math:`Q(1)=-1`,  :math:`Q(2)=\frac{1}{3}`,  :math:`Q(3)=0`,  :math:`Q(4)=0`
    - $t=5$: :math:`Q(1)=-1`,  :math:`Q(2)=\frac{1}{3}`,  :math:`Q(3)=0`,  :math:`Q(4)=0`

    So we see the method must have selected a sub-optimal action at time :math:`t=4`and at time :math:`t=5`.

    **Part b+c:** Based on the figure, the average regret of the first three arms is the horizontal black line so about 0.1, -0.8 and 1.5.
    The best arm is :math:`a=3` since it has the highest average regret of all ten arms and therefore the regret is

    - :math:`1.5 - 0.1 = 1.4`
    - :math:`1.5 - (-0.8) = 2.3`
    - :math:`1.5 - 1.5 = 0`

    Note that the regret can never be negative. It is called regret because is the average reward we miss out on. Therefore, we have no regret when we take the optimal arm, and more regret the worse the arm is compared to the optimal.

{% endif %}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=3496d439-e79d-411b-a75a-afe3013687ec', 'Problem 8.1', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=6a6a40f6-6822-4b4a-a6cf-afe3013c9bfc', 'Problem 8.2', True) }}

.. admonition:: Conceptual Question: UCB Bandits
    :class: dropdown

    **Part a:** Since the rewards are always the same the :math:`Q`-values are just the average of the same number and therefore a constant. So:

    .. math::

        f_0(t) & = 5 + 2\sqrt{ \frac{ \log 1000 }{500} } \\
        f_1(t) & = 10 + 2\sqrt{ \frac{ \log 1000 }{500} }

    **Part b:** UCB1 will select arm 1 as it has the highest bound (see above). This is natural, sicne it has a higher reward estimate, and we have tried them the same amount of time and therefore there is no reason one should be preferred over the other due to exploration. \

    **Part c:** For :math:`t \geq 1000` will be:

    .. math::

        f_0(t) & = 5 + 2\sqrt{ \frac{ \log t }{500} } \\
        f_1(t) & = 10 + 2\sqrt{ \frac{ \log t }{t-500} }

    **Part d:** The 'rough sketch' indicate that the upper confidence bound grows for :math:`f_0` (logarithmically) and fall for :math:`f_1` (roghly as :math:`\frac{1}{t}`) towards 10. The curve for :math:`f_0` starts out 5 units below :math:`f_1` and so they cross... eventually.. long after the heat death of the universe:

    .. plot::
        :caption: Cheat mode on

        import matplotlib.pyplot as plt
        import numpy as np
        lt = np.linspace(np.log(1000), 5000)
        plt.plot(lt, 5 + 2*np.sqrt( lt / 500), 'k-')
        np.seterr(all="ignore") # To prevent numpy warnings due to very large numbers.
        plt.plot(lt, 10 + 2*np.sqrt(lt / (np.exp(lt) - 500) ), 'r-')
        plt.xlabel('log(t)')

    **Part e:** When they cross, the algorithm will change arm.