{% import macros as m with context  %}
{% set week = 9 %}
{{ m.exercise_head(week) }}


Markov decision processes
-------------------------------------------------------------------------
The Markov Decision Process (MDP) is a mathematical description of the problem solved by most reinforcement learning methods.
It will be represented as the class :class:`~irlc.ex09.mdp.MDP` and will stand in the same relationship to methods such as value iteration as the :class:`~irlc.ex02.dp_model.DPModel` play to the dynamical programming method.

Let's look at the MDP for the frozen-lake gridworld environment. It looks as follows:


.. plot::
    :include-source:

    # BookGridEnvironment
    from irlc.gridworld.gridworld_environments import FrozenLake
    from irlc import train, Agent, interactive
    env = FrozenLake(render_mode="human")  # Pass render_mode='human' for visualization.
    env, agent = interactive(env, Agent(env))  # Make th
    env.reset()     # You always need to call reset
    env.plot()      # Plot the environment.
    env.close()


We can instantiate the MDP class directly, or extract if from the gridworld environment as ``env.mdp``. A couple of examples:

This code plots the initial state, all non-terminal states, and the actions in the initial state:

.. runblock:: pycon

    >>> from irlc.gridworld.gridworld_environments import FrozenLake
    >>> mdp = FrozenLake().mdp  # Get the MDP.
    >>> mdp.initial_state # States are tuples (x, y)
    >>> mdp.A(mdp.initial_state) # Get all actions in the initial state


.. warning::
    The actions available in the different states are **not** the same in the gridworld environment. Specifically, the winning (goal) state only has a single action, which
    takes the environment to the terminal state. Let's take a look:

    .. runblock:: pycon

        >>> from irlc.gridworld.gridworld_environments import FrozenLake
        >>> mdp = FrozenLake().mdp
        >>> mdp.A( (0, 3) )         # Actions in the starting state
        >>> mdp.A( (0, 0) )         # Actions the bottom-left state

Transitions probabilities
********************************************************************
The transition probabilities:

.. math::
    P(s', r | s, a)

Are represented as a function :func:`~irlc.ex09.mdp.MDP.Psr` that returns a dictionary of the form ``{(s1, r1): p1, (s2, r2): p2, ...}``, compare to :func:`irlc.ex02.dp_model.DPModel.Pw`. An example:

.. runblock:: pycon

    >>> from irlc.gridworld.gridworld_environments import FrozenLake
    >>> mdp = FrozenLake().mdp
    >>> transitions = mdp.Psr(mdp.initial_state, 0)
    >>> for (sp, r), p in transitions.items():
    ...     print(f"P(s'={sp}, r={r} | s={mdp.initial_state}, a=0) = {p}")

.. note::
    It is easy to miss, but (:cite:t:`sutton`) distinguish between

    - The set of all states, :math:`\mathcal{S}^+` corresponding to :func:`~irlc.ex09.mdp.MDP.states`
    - The set of all non non-terminal states :math:`\mathcal{S}` corresponding to :func:`~irlc.ex09.mdp.MDP.nonterminal_states`

    The value-function for a terminal states. Both the policy, value-function, action-value function and transition probabilities :math:`P(s',r|s,a)` are **only** defined in non-terminal states,
    and it is a common mistake in reinforcement learning to update or evaluate them in terminal states. Let's take a look:

    .. runblock:: pycon

        >>> from irlc.gridworld.gridworld_environments import FrozenLake
        >>> mdp = FrozenLake().mdp  # Get the MDP.
        >>> mdp.nonterminal_states
        >>> mdp.states
        >>> terminal_states = [s for s in mdp.states if s not in mdp.nonterminal_states]
        >>> terminal_states


Building a MDP
************************************************************************
Building your own MDP is easy. Let's try to implement a MDP corresponding to a small gridworld example (see (:cite:t:`sutton`), Example 4.1)

.. literalinclude:: ../../shared/output/small_gridworld.py
    :language: python

Note that:

- The states are tuples of row, col coordinates ``(i,j)``
- Each state has 4 actions, so the action-space function ``A(state)`` simply returns the list ``[0, 1, 2, 3]``
- The ``is_terminal`` function determines if we are in a corner (:math:`(0,0)` or :math:`(3,3)`)
- The transition probabilities are deterministic. So the dictionary it returns has a single key ``(next_state, reward)`` and value of ``1``


Classes and functions
-------------------------------------------------------------------------


.. autoclass:: irlc.ex09.mdp.MDP
    :show-inheritance:
    :members:


Solutions to selected exercises
-------------------------------------------------------------------------------------------------------

{% if show_solution[week] %}

.. admonition:: Solution to Sutton problem 3.14
    :class: dropdown

    The update is :math:`V(s) \leftarrow \sum_a \pi(a|s) \sum_{s', r} p(s', r  |s, a) (r + \gamma V(s'))`. In our case the per-step reward is :math:`r=0`,
    :math:`\gamma = 0.9` and the policy is uniform :math:`\pi(a | s) = \frac{1}{4}`. Reading of the four values of the value function surrounding the center
    square we get:

    .. math::

        V(s) \leftarrow \sum_{a=1}^4 \frac{1}{4} (0 + \gamma V(s') ) = \frac{0.9(2.3 + 0.7 - 0.4 + 0.4 ) }{4} = 0.675.


.. admonition:: Solution to Sutton problem 4.1
    :class: dropdown

    According to the relationships we saw at the beginning of Todays lecture the action-value function satisfy:

    .. math::

        q_{\pi}(s,a) =  \mathbb{E}[r + \gamma \sum_a \pi(a'|s') q_\pi(s', a') | s, a] =   \mathbb{E}[-1 + \gamma v_\pi(s', a') | s, a]

    Furthermore, when we take an action, we know what the next state :math:`s'` is so the expectation disappear. We can now read of the
    value function from figure 4.1 and we get that in the first case :math:`q(11,down) = -1` and in the second that :math:`q(7,down) = -1 + \gamma v_\pi(11) = -1 - 14\gamma = -15`.


.. admonition:: Solution to Problem 2 (Bellmans equations)
    :class: dropdown

    **Part a:** Recall Bellmans equation (it does not matter which) says: :math:`v(s) = \mathbb{E}[R + \gamma v(s') \mid S_t = s]`.
    Therefore:

    .. math::

        v_1 = \frac{1}{2} \left( R + \gamma v_1 \right) + \frac{1}{2} \gamma v_2 = \frac{1}{2} \left( 1 + \gamma v_1 + \gamma v_2 \right)

    Solving we get that :math:`v_1 = \frac{1 + \gamma v_2}{2-\gamma} = 2`.

    **Part b:** 	We already got the first equation from the previous problem. Applying a similar argument to the second equation leaves us with:

    .. math::

			-1 & = (\gamma - 2) v_1  + \gamma v_2 \\
			- 2 & = (\gamma - 2) v_2  + \gamma v_1

    This can be written as a linear system of equations as:

    .. math::

		\begin{bmatrix}
			-1 \\ -2
		\end{bmatrix} = \begin{bmatrix} \gamma - 2 & \gamma \\ \gamma & \gamma - 2 \end{bmatrix} \mathbf{v}

    .. tip::

        During the exam, a few students correctly found the system of two equations (in my oppinion, the difficult part),
        but when they used Maple to re-write them as a linear system Maple obtained some very strange expressions probably due to the :math:`\gamma` -- when you use Maple to
        simplify or solve, try to check that you can re-produce the result by hand if you have time.

    **Bonus question:** The inverse of :math:`A` is :math:`A^{-1} = \begin{bmatrix} 2- \gamma & -\gamma \\ -\gamma & 2-\gamma \end{bmatrix}` (use google!). Then find :math:`\mathbf{v}` as

    .. math::

         \mathbf{v} = A^{-1} b = \frac{1}{4 - 4\gamma} \begin{bmatrix} 2 + \gamma \\ 2 \end{bmatrix}.

{% endif %}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=3c44f154-2419-48d4-acd2-b146015dfe76', 'Problem 9.1-9.3', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=2bf01550-197a-4a81-938c-b146015dfe76', 'Problem 9.4: Policy evaluation', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=349172af-f639-44c8-9698-b146015dfe76', 'Problem 9.5: Policy iteration', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=6c860e51-c2fd-4776-9d2c-b1460169dd7c', 'Problem 9.6: Value iteration', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=df72bb2e-2a18-4789-b10f-b1460169dd7c', 'Problem 9.8', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=566dd3c0-69ec-4e5f-8a53-b1460169dd60', 'Problem 9.9 - Gambler', True) }}