{% import macros as m with context  %}
{% set week = 1 %}
{{ m.exercise_head(week) }}

This weeks exercise will give you an introduction to the three main components of this course. They are:

The Environment
    Represented by an gymnasium :class:`gymnasium.Env` class.
    This class contains the python-implementation of the problem we want to solve (:cite:t:`herlau`, Subsection 4.3.1).
    It is responsible for maintaining an internal state (the state which the environment is in right now),
    and has a function which responds to actions (:func:`gymnasium.Env.step`).
The Agent
    Represented as a :class:`~irlc.ex01.agent.Agent` class.
    The agent interacts with the environment (:cite:t:`herlau`, Subsection 4.3.2). We can think about it as a robot
    (in our case, a simulated robot) which gets information from the environment and decides
    which actions to take. The agent can (and often do) maintain an internal state for planning or learning. This internal state may include
    a model of the environment.
Training
    Finally, the agent and environment must interact in the world loop (:cite:t:`herlau`, Subsection 4.3.4). In this course this is accomplished using the :func:`~irlc.ex01.agent.train` function.
    What it does it to feed observed states in the agent (i.e., call a method of the agent) which allows the agent to compute an action. This action
    is then fed back into the environment. This is sometimes called the **world loop**.


.. plot::
    :caption: An agent (represented by the yellow pacman) in an environment. Plot generated by the software in this course.

    from irlc.gridworld.gridworld_environments import FrozenLake
    from irlc import Agent, interactive
    env = FrozenLake(render_mode="human")  # Pass render_mode='human' for visualization.
    env, agent = interactive(env, Agent(env))  # For plotting the agents information.
    env.reset()     # You always need to call reset
    env.plot()      # Plot the environment.
    env.close()


.. _inventory_environment:

Inventory environment
------------------------------------------------------------------------------------------------------------------
The environment represents the problem we wish to solve. In reinforcement learning it could be a game, and in control theory it could be a simulation of e.g. a car driving around a track.

Nearly all environments will need to maintain an internal state :math:`x_k`. In a computer game, this represents the position of the player, enemies, etc., or in a control environment it represents positions and velocities
of the object(s) that we are trying to control. To do this effectively, we represent environments using classes. For instance, the inventory environment defined as so:


.. note::
    :class: margin

    To be annoying, gymnasium environments typically denote the state by :python:`s` (rather than :math:`x_k`), and actions by :python:`a` (rather than :math:`u_k`). This notation is taken from reinforcement learning.


.. literalinclude:: ../../shared/output/inventory_environment_a.py
    :language: python

The following example shows how we can use the environment:

.. note::
    :class: margin

    You can ignore the ``info``-dictionary for now.
    It is useful in some reinforcement learning methods where the ``info``-dictionary can be used to store extra information such as the (true) goal location for performance monitoring, but it is perhaps a bit annoying when you are just starting out.


.. runblock:: pycon

    >>> from irlc.ex01.inventory_environment import InventoryEnvironment
    >>> env = InventoryEnvironment(N=4)
    >>> x0, info = env.reset()
    >>> print("Initial state is", x0)
    >>> print("The 'extra information' is", info)

Mathematically, recall the states and actions are denoted by:

.. math::
    x_0, u_0, x_1, u_1, \cdots

When you call the ``reset``-function, it returns the starting state :math:`x_0` as the variable ``x0``, as well as a dictionary ``info`` with *optional extra information*.

.. tip::
    :class: margin

    Since actions and observations can be both discrete and continuous, different environments will use different action and observation spaces depending on the situation.
    In the example above they are instances of the ``Discrete`` observation space, representing integers :math:`\{0,1,2,\cdots,n-1\}`. You can
    get :math:`n` using ``env.observation_space.n`` and ``env.action_space.n``.

The environment also defines two variables, the ``observation_space`` and the ``action_space``. Lets take a look at both:

.. runblock:: pycon

    >>> import gymnasium as gym
    >>> from irlc.ex01.inventory_environment import InventoryEnvironment
    >>> env = InventoryEnvironment()
    >>> print(env.observation_space)
    >>> print(env.action_space)
    >>> a = env.action_space.sample() # Get a random action
    >>> print("Action is", a)
    >>> print("Is this a valid action?", a in env.action_space, "is 9 a valid action?", 9 in env.action_space)

The observation and action spaces can be used to initialize the algorithm we want to use to solve the problem.
Conveniently, the spaces contains a ``sample``-method which be used to generate a random action: ``env.action_space.sample()``.

The step-function
**********************************************************************************************************
To actually take (execute) an action, you just need to pass it into the ``step``-function of the environment:

.. runblock:: pycon

    >>> from irlc.ex01.inventory_environment import InventoryEnvironment
    >>> env = InventoryEnvironment()
    >>> s0, _ = env.reset()
    >>> next_state, reward, terminated, truncated, info = env.step(2) # Take action 2.
    >>> print(f"Went from state {s0} to {next_state} when taking action 3 and got {reward=}")


In this example, we first reset the environment to get the starting state :math:`s_0` as ``x0``,
then tell the environment we want to take action :math:`a_0=2`, after which the ``step`` function returns 5 arguments:

.. note::
    :class: margin

    We use reward (and not cost) to be consistent with the gymnasium environment specification. You can get the cost by multiplying with minus one, i.e. ``cost =  -reward``.

*   ``next_state``: The next state :math:`s_1` (the one the state is in at the end of the program),
*   ``reward``: the first reward :math:`r_1` (recall that the reward is 1-indexed)
*   ``terminated``: A :python:`bool` indicating if the environment terminated or not.
*   ``truncated``: A :python:`bool` indicating if the environment was forced to terminate prematurely. You can assume it is :python:`False`
*   ``info``: A dictionary with possible extra information. Similar to the dictionary returned by ``reset``.

You can assume the ``truncated``-variable is :python:`False`, and in most cases here in the beginning the :python:`info`-dictionary will be empty.

Agents
-------------------------------------
.. tip::
    :class: margin

    You can ignore the :func:`~irlc.ex01.Agent.train`-function for now. We will only consider against that need to be trained during the Reinforcement Learning part of the course.

The :class:`~irlc.ex01.agent.Agent` will be presented as a class with a policy function, which we denote by :func:`~irlc.ex01.Agent.pi`, and a training-function denoted by :func:`~irlc.ex01.Agent.train`.
The following code defines an Agent which simply generates random actions (i.e. random numbers from the set :math:`\{0, 1, 2\}`).

.. literalinclude:: ../../shared/output/inventory_environment_b.py
    :language: python

Notice that the agent **inherits** from the class ``Agent`` -- In this specific example this is cosmetic, but all agents you write should do this since it:

- Ensures the function signature (order and number of parameters) to the :func:`~irlc.ex01.Agent.pi` and :func:`~irlc.ex01.Agent.train`-function are the same
- It gives the agent access to a few helper functions. Most notably, the agent will by default implement a random policy -- since many reinforcement learning methods requires us to take random actions some of the time (exploration), this will actually be quite helpful


Training
-----------------------------------------
The training function lets the agent and environment interact with each other, i.e., generate episodes. This is how we eventually train and test all of our methods (i.e., agents).

The training function is quite simple, and it is very useful to experiment with your own version first to really understand how the agent and environment interacts.

The main part of the code is the generation of a single rollout, which implements the world loop as described in (:cite:t:`herlau`, Subsection 4.3.4), and which can be sketched as:

.. literalinclude:: ../../shared/output/inventory_environment_d.py
    :language: python


.. note::
    :class: margin

    In the :math:`f_k, g_k`-notation this is:

    .. math::

        \sum_{k=0}^{N-1} -g_{k}(x_k, u_k, w_k)


The last line will print out the total reward from one episode computed as:

.. math::

    \sum_{k=0}^{N-1} r_{k+1}

where :math:`r_k` is the ``reward``-variable in each step.

.. note::

    To summarize, the training-function will simulate the interaction between the agent and the environment for one episode as follows:

    - Reset the environment to get the first state ``x0``
    - Uses the agents policy to compute the first action, i.e. ``a0 = agent.pi(x0, 0)`` (the ``0`` refers to :math:`k=0`)
    - Let the environment compute the next state using ``env.step(a0)``
    - Repeat until the environment terminates.


The recommended train-function
********************************************************************************************************************

Although the train-function is very simple,
Eventually, you should use the :func:`~irlc.ex01.Agent.train`-method I have written for reproducibility/simplicity, and because it provides more features such as experiment management.
A basic usage of the :func:`~irlc.ex01.Agent.train`-function to train the agent for a single episode is as follows:

.. runblock:: pycon

    >>> from irlc.ex01.inventory_environment import InventoryEnvironment, RandomAgent
    >>> from irlc import train          # Import the train-function
    >>> env = InventoryEnvironment()    # set up an environment
    >>> agent = RandomAgent(env)        # Set up the RandomAgent (which takes random actions)
    >>> stats, _ = train(env,agent,num_episodes=1,verbose=False)  # Train for one complete episode (rollout)
    >>> print("Accumulated reward of first episode:", stats[0]['Accumulated Reward'])


The training function returns a ``stats``-variable which is actually a dictionary. This is because we may want to know more about what happened during an episode than just the
accumulated reward. Here is what the variable will contain in general:

.. runblock:: pycon

    >>> from irlc.ex01.inventory_environment import InventoryEnvironment
    >>> from irlc import Agent, train # Import the Agent and train-function
    >>> env = InventoryEnvironment()    # set up an environment
    >>> agent = Agent(env)              # Set up the default agent (which takes random actions)
    >>> stats, _ = train(env,agent,num_episodes=1,verbose=False)  # Train for one complete episode (rollout)
    >>> for k in stats[0].keys():
    ...    print(k, stats[0][k])


Multiple episodes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The previous code only computed the average reward of a single episode of length :math:`N` (``num_episodes=1``).
To estimate the average cost of a given policy we must do this :math:`T` times and compute the average:

This is easily accomplished using the training function:

.. runblock:: pycon

    >>> import numpy as np
    >>> from irlc import train, Agent
    >>> from irlc.ex01.inventory_environment import InventoryEnvironment
    >>> env = InventoryEnvironment()
    >>> stats, _ = train(env, Agent(env), num_episodes=1000,verbose=False)  # Perform 1000 rollouts using Agent class
    >>> avg_reward = np.mean([stat['Accumulated Reward'] for stat in stats])
    >>> print("[Agent class] Average cost of random policy J_pi_random(0)=", -avg_reward)

This code computes:

.. math::

    \text{average cost} \approx -\frac{ \sum_{t=1}^T \text{ Reward of episode number } t. }{T}


What this tells us is how good our policy is on average, in this case the average cost is :math:`\approx 4`. When we design policies, the lower the expected cost is, the better the policy is.

The advantage of using the environment, agent and train-functionality is that we get a high degree of reuseability.

.. note::
    :class: margin

    It is not important how :math:`Q`-learning works at this point. The example illustrates the environments and agents allows you to structure experiments in the same way throughout the course.

For instance, we can train a :math:`Q`-learning agent, it will learn a better policy and therefore obtain a lower cost of :math:`2.75`:

.. runblock:: pycon

    >>> import numpy as np
    >>> from irlc import train
    >>> from irlc.ex11.q_agent import QAgent  # You will implement this in week 11.
    >>> from irlc.ex01.inventory_environment import InventoryEnvironment
    >>> env = InventoryEnvironment()
    >>> stats, _ = train(env, QAgent(env), num_episodes=1000,verbose=False)  # Perform 1000 rollouts using Agent class
    >>> avg_reward = np.mean([stat['Accumulated Reward'] for stat in stats])
    >>> print("[Agent class] Average cost of random policy J_pi_random(0)=", -avg_reward)

Next week, we will compute the truly optimal cost.

Getting states and actions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The second output argument of the training function gives you access to the states and actions computed in each episode.
This code illustrates how you can print out the states and actions for 3 episodes:

.. runblock:: pycon

    >>> from irlc.ex01.inventory_environment import InventoryEnvironment
    >>> from irlc import Agent, train # Import the Agent and train-function
    >>> env = InventoryEnvironment()    # set up an environment
    >>> agent = Agent(env)              # Set up the default agent (which takes random actions)
    >>> _, trajectories = train(env,agent,num_episodes=3,verbose=False)  # Train for three complete episodes
    >>> for k, trajectory in enumerate(trajectories):
    ...    print("episiode", k, "states", trajectory.state)
    ...    print("episiode", k, "actions", trajectory.action)
    ...


Pacman
-------------------------------------------------------------------------------------------------------------------

The inventory-environment is a little bit boring, so we will lastly also look at the Pacman-example. Our goal is to build an agent that can play Pacman, but to get to that point
we first need to familiarize ourselves with the Pacman game environment.

The Pacman levels are represented as a python :python:`str`, which allows us to create varied Pacman environments. For instance, this is how we can craete
a Pacman game level based on a small maze with 4 food pellets:

.. runblock:: pycon

    >>> maze = """
    ... %%%%%%%
    ... %    .%
    ... %.P%% %
    ... %.   .%
    ... %%%%%%%
    ... """
    >>> from irlc.pacman.pacman_environment import PacmanEnvironment
    >>> env = PacmanEnvironment(maze)
    >>> x0, _ = env.reset() # Works just like any other environment.

This requires a bit too much imagination, so you can plot the environment by passing :python:`render_mode='human'` to the
environment (this is standard for gymnasium) and then use the :func:`~irlc.plotenv`-function to plot it:

.. plot::
    :caption: A basic example of plotting Pacman.
    :width: 400

    from irlc import plotenv
    from irlc.pacman.pacman_environment import PacmanEnvironment, datadiscs
    env = PacmanEnvironment(layout_str=datadiscs, render_mode='human')
    env.reset()
    plotenv(env)
    env.close()

If you want to try to play Pacman with a keyboard, you can copy-paste the following snippet into a text editor and it will work.

.. code-block:: python

    from irlc.pacman.pacman_environment import PacmanEnvironment, datadiscs
    from irlc import interactive, savepdf, Agent, train
    env = PacmanEnvironment(layout_str=datadiscs, render_mode='human')
    env, agent = Interactive(env, Agent(env)) # This makes the environment interactive. Ignore that it needs an Agent for now.
    train(env, agent, num_episodes=2)
    env.close()

The Pacman environment makes use of the freedom we have in specifying the actions
and the states. For instance, the states are actually objects, so we can ask a state to tell us what actions are available:

.. runblock:: pycon

    >>> from irlc.pacman.pacman_environment import PacmanEnvironment, datadiscs
    >>> from irlc import interactive, savepdf, Agent, train
    >>> env = PacmanEnvironment(layout_str=datadiscs, render_mode='human')
    >>> x0, _ = env.reset()
    >>> print("Available actions in the starting state are", x0.A())
    >>> env.close()

It is important your agent only uses actions that are available in a given state, otherwise you will get an error like this:

.. runblock:: pycon

    >>> from irlc.pacman.pacman_environment import PacmanEnvironment, datadiscs
    >>> from irlc import interactive, savepdf, Agent, train
    >>> env = PacmanEnvironment(layout_str=datadiscs, render_mode='human')
    >>> x0, _ = env.reset()
    >>> env.step("Right") # Results in an error.
    >>> env.close()

.. tip::
    :class: margin

    In this case Pacman could eat both pellets by going down. To get more interesting behavior, use the variable :python:`k` or the state :python:`x`.


Let's put all of this together. The following example defines a level with two food pellets and a simple agent that eats both of them:

.. runblock:: pycon

    >>> maze = """
    ... %%%%%%%
    ... % P   %
    ... % .%% %
    ... % .   %
    ... %%%%%%%"""
    >>> from irlc.pacman.pacman_environment import PacmanEnvironment
    >>> from irlc import Agent, train
    >>> env = PacmanEnvironment(maze)
    >>> class HungryHippo:
    ...     def pi(self, x, k):
    ...         return "South"
    ...
    >>> stats, _ = train(env, HungryHippo(), num_episodes=1)
    >>> print("Your score was", stats['accumulated_reward'])
    >>> env.close()


Classes and functions
------------------------------------------------------------------------------------------------------

.. autoclass:: irlc.ex01.agent.Agent
  :members:

.. autofunction:: irlc.ex01.agent.train

.. autofunction:: irlc.utils.player_wrapper.interactive

.. autofunction:: irlc.plotenv

Solutions to selected exercises
-------------------------------------------------------------------------------------------------------


{% if show_solution[week] %}

.. admonition:: Solution to problem 1
    :class: dropdown

    **Part a:** For action :math:`u=0` Bob ends up with :math:`1.1x_0` kroner.
    For action :math:`u=1` bob ends up with

    .. math::

        \mathbb{E}[\text{Amount} | u=1] = \frac{3}{4}(x_0 + 12) + \frac{1}{4} 0 = \frac{3}{4}(x_0 + 12)

    If we plug in :math:`x_0 = 20` we get 22 when :math:`u=0` and :math:`24` when :math:`u=1` so :math:`u=1` is right.

    **Part b:** The policy depends on :math:`x_0`. It outputs :math:`u=0` when

    .. math::
        1.1 x_0 < \frac{3}{4} x_0 + 9

    Simplify we get: :math:`\mu_0(x_0) = 0` when :math:`x_0 < \frac{180}{7}` and otherwise :math:`\mu_0(x_0) = 1`.

{% endif %}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=331022ed-b743-432b-95d5-b10f00f5665b', 'Problem 3 & 4: Inventory control', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=aea73b70-a0f1-47ea-8488-b10901724489', 'Problem 5', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=9e159fad-87fd-4da2-9869-b10901752af7', 'Problem 6', True) }}