{% import macros as m with context  %}
{% set week = 2 %}
{{ m.exercise_head(week) }}

This week will be concerned with dynamical programming. Dynamical programming is a technique that uses a
**model** of the environment to create an optimal plan. This means there are now 3 classes to content with:

The environment
    Represented by an OpenAI gymnasium :class:`~gymnasium.Env` class. This class represents the problem we are trying to solve and has e.g. a :meth:`~gymnasium.Env.step`-method
The agent
    Represented as a :class:`~irlc.ex01.agent.Agent` class. This class represents the agent which which acts in the environment. In our case,
    the agent should plan optimally using dynamical programming and the policy method  :meth:`~irlc.ex01.agent.Agent.pi`, should simply execute each step of that plan
The model
    This class (:class:`~irlc.ex02.dp_model.DPModel`) represents the dynamical programming model the agent use to plan. It must implement python functions equivalent to
    :math:`f_k(x_k, u_k, w_k)`, :math:`g_k(x_k, u_k, w_k)` and so on. The agent should use the **model** clas to plan, and not the **environment**.


The dynamical programming model
--------------------------------------------------
Using a class to represent a model can be confusing, but since we will use the same trick in control it is worth considering
why and how we choose to represent the model this way.

The model is basically just a collection of functions. For instance, consider these functions corresponding to :math:`f_k` and :math:`g_k`:

.. runblock:: pycon

    >>> def f(x, u, w, k):
    ...     return x+u-w #Compute and return x_{k+1} here
    >>>
    >>> def g(x, u, w, k):
    ...     return x**2 + u**2 #Compute and return g_k(x, u, w) here


A model is then just a class that collects the function as so:

.. runblock:: pycon

    >>> class MyModel:
    ...     def f(self, x, u, w, k):
    ...         return x+u-w #Compute and return x_{k+1} here
    ...     def g(self, x, u, w, k):
    ...         return x**2 + u**2 #Compute and return g_k(x, u, w) here
    >>>
    >>> model = MyModel()
    >>> print(model.g(2, 1, 0, 0))

.. note::
    :class: margin

    signature means the name and order of variables

What the class :class:`~irlc.ex02.dp_model.DPModel` (see below) does is just to specify the signature of the functions ``f``, ``g``, etc. shown above.
This helps us to specify the model in the same way every time, and is a very common technique in programming.
In other words, the class :class:`~irlc.ex02.dp_model.DPModel`  itself is very simple, and looks something like this:

.. runblock:: pycon

    >>> class SimplifiedDPModel:
    ...     def f(self, x, u, w, k):
    ...         raise NotImplementedError("Implement f_k here")
    ...     def g(self, x, u, w, k):
    ...         raise NotImplementedError("Implement f_k here")

And when we actually want to implement a specific model, for instance of the inventory environment, we can do that as follows (see (:cite:t:`herlau`, Subsection 5.1.2)):

.. literalinclude:: ../../shared/output/inventory_a.py
    :language: python

And the model can be used as:

.. runblock:: pycon

    >>> from irlc.ex02.inventory import InventoryDPModel
    >>> model = InventoryDPModel(N=5) # Plan over a horizon of N=5.
    >>> model.g(x=1, u=1, w=0, k=0) # Computes g_0(x=1, u=1, w=0)
    >>> print("State space at time k=0", model.S(0))

This means that when you implement the DP algorithm, you only need to refer to functions such as ``model.S``, ``model.f``,
and so on without thinking about any specific environment -- if you use the functions correctly, your implementation will
work with any specific model that matches the signature in the :class:`~irlc.ex02.dp_model.DPModel`  class!.

The following example shows the start of the DP algorithm implementation. It represents the cost functions :math:`J_k`` as
a list of dictionaries ``J``,
so that ``J[N][x]`` corresponds to the terminal cost :math:`J_N(x)`.

.. runblock:: pycon

    >>> def very_simplified_dp(model):
    ...     N = model.N                                  # Planning horizon
    ...     print("The planning horizon is", N)
    ...     print("The state space is", model.S(0))
    ...     J = [dict() for k in range(N+1)]             # Initialize J_k = [J_0, J_1, ..., J_N]
    ...     J[N] = {x: model.gN(x) for x in model.S(N) } # Initialize J_N by looping over terminal states model.S(N)
    ...     return J                                     # Insert the rest of the DP algorithm here.
    >>>
    >>> from irlc.ex02.inventory import InventoryDPModel
    >>> model = InventoryDPModel(N=4)
    >>> J = very_simplified_dp(model)                   # Call the DP algorithm and get J back.
    >>> print(J)


A primer on Pacman
-----------------------------------------------------------------------------------------------
In the first project, you will work with the Pacman game. I have included a more complete description of the
Pacman game at :ref:`pacman`. 


Classes and functions
----------------------------------------------------------------------------------
This section provides an overview of the most important classes and functions:

.. autoclass:: irlc.ex02.dp_model.DPModel
  :members:

.. autofunction:: irlc.ex02.dp.DP_stochastic


Solutions to selected exercises
-------------------------------------------------------------------------------------------------------

{% if show_solution[week] %}

.. admonition:: Solution to problem 1
    :class: dropdown

    **Part a:** If :math:`m=1` then :math:`p(w | x, u) = \frac{1}{2}`

    **Part b:** If :math:`m=2` then there are three outcomes, :math:`w=0, 1, 2`. Their probability is:

    .. math::

        p(w=0 | x, u) & = \frac{1}{4}, \\
        p(w=1 | x, u) & = p(v_1=0, v_2=1 | x, u) + p(v_1=1, v_2=0 | x, u) = \frac{1}{2}, \\
        p(w=2 | x, u) & = \frac{1}{4}

    In general this is a binomial distribution. Note the problem is reminiscent of the Pacman-problem where we want to compute the probability of each position (after the ghosts have moved), :math:`w`, which results as several individual moves by the ghosts :math:`v_i`. These probabilities need to be aggregated correctly in the code.

    **Part c:** This will be the case when:

    .. math::

        p(w'_k | x_k, u_k) = \begin{cases}\frac{1}{2} & \text{if } w'_k = x_k + u_k \\
        \frac{1}{2} & \text{if } w'_k = 1 + x_k + u_k \\
        0 & \text{otherwise}
        \end{cases}

    Note that the point of the problem is to get to an exact expression for :math:`p(w'_k | x_k, u_k)` -- not to say it is the same as the previous problem, or one half (without details), etc. The precision would be required if you wanted to implement it.
{% endif %}


{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=7bbcca3b-86f3-4b92-8b83-b2830164b40b', 'Problem 2.2: Deterministic Inventory', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=dc8c6b04-4492-4e3b-9fd8-b10f00fe8222', 'Problem 2.3: Stochastic DP', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=dd0cf60c-a130-4cc4-a5a2-b10f010058cb', 'Problem 2.4: Dynamical Programming Agent', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=a667094e-028d-4e0d-af37-b10f0109bd2a', 'Problem 2.7: Flower-store', True) }}