{% import macros as m with context  %}
{% set week = 10 %}
{{ m.exercise_head(week) }}


Tabular methods (Q-learning, Sarsa, etc.)
-------------------------------------------------------------------------
As the name suggests, tabular methods requires us to maintain a table of :math:`Q`-values or state-values :math:`V`. The :math:`Q`-values in particular can be a bit tricky to keep track of and I have therefore made a helper class :class:`irlc.ex09.rl_agent.TabularAgent` which will
hopefully simplify the process.

.. note::
    The main complication we need to deal with when representing the Q-values is when different states have different action spaces, i.e. when :math:`\mathcal{A}(s) \neq \mathcal{A}(s')`. Gymasiums ways of
    dealing with this situation is to use the ``info``-dictionary, e.g. so that :python:`s, info = env.reset()` will specify a :python:`info['mask']` variable which is a numpy ndarray so that a given action is available if :math:`info['mask'][a] == 1`.
    You can read more about this choice at `The gymnasium discrete space documentation <https://gymnasium.farama.org/api/spaces/fundamental/#discrete>`_.

The :math:`Q`-values behave like a 2d numpy ndarray:

.. runblock:: pycon

    >>> from irlc.ex09.rl_agent import TabularAgent
    >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
    >>> env = BookGridEnvironment()
    >>> agent = TabularAgent(env, epsilon=0.3) # Use epsilon-greedy exploration.
    >>> state, _ = env.reset()
    >>> state
    >>> agent.Q[state, 1] = 2 # Update a Q-value
    >>> agent.Q[state, 1]     # Get a Q-value
    >>> agent.Q[state, 0]     # Q-values are by default zero

To implement masking, the :python:`agent.Q`-table has two special functions which requires the ``info``-dictionary. **As long as you stick to these two functions and pass the correct info dictionary you will not get into troubles**.

- To get the optimal action use ``agent.Q.get_optimal_action(s, info_s)``

  .. runblock:: pycon

      >>> from irlc.ex09.rl_agent import TabularAgent
      >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
      >>> env = BookGridEnvironment()
      >>> agent = TabularAgent(env)
      >>> state, info = env.reset()               # Get the info-dictionary corresponding to s
      >>> agent.Q[state, 1] = 2.5                 # Update a Q-value; action a=1 is now optimal.
      >>> agent.Q.get_optimal_action(state, info) # Note we pass along the info-dictionary corresopnding to this state

- To get all Q-values corresponding to a state use ``agent.Q.get_Qs(s, info_s)``

  .. runblock:: pycon

      >>> from irlc.ex09.rl_agent import TabularAgent
      >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
      >>> env = BookGridEnvironment()
      >>> agent = TabularAgent(env)
      >>> state, info = env.reset()                  # Get the info-dictionary corresponding to s
      >>> agent.Q[state, 1] = 2.5                    # Update a Q-value; action a=1 is now optimal.
      >>> actions, Qs = agent.Q.get_Qs(state, info)  # Note we pass along the info-dictionary corresopnding to this state
      >>> actions                                    # All actions that are available in this state (after masking)
      >>> Qs                                         # All Q-values available in this state (after masking)

You can combine this functionality to get e.g. the maximal Q-value using ``agent.Q[s, agent.Q.get_optimal_action(s, info)]``.

.. note::
    The Q-table will remember the masking information for a given state and warn you if you are trying to access an action
    that has been previously masked.

We often want to perform :math:`\varepsilon`-greedy exploration. To simplify this, the agent has the function ``agent.pi_eps``. Since this function
uses the Q-values, it also requires an ``info``-dictionary:

.. runblock:: pycon

    >>> from irlc.ex09.rl_agent import TabularAgent
    >>> from irlc.gridworld.gridworld_environments import BookGridEnvironment
    >>> env = BookGridEnvironment()
    >>> agent = TabularAgent(env, epsilon=0.1)  # epsilon-greedy exploration
    >>> state, info = env.reset()               # to get a state and info-dictionary
    >>> a = agent.pi_eps(state, info)           # Epsilon-greedy action selection
    >>> a

.. warning::
    In the ``train(s, a, r, sp, done, info_s, info_sp)``-method, remember to use the ``info``-dictionary corresponding to the state.

        - use ``self.Q.get_Qs(s, info_s)`` and ``self.Q.get_Qs(sp, info_sp)``
        - **never** use ``self.Q.get_Qs(s, info_sp)``


Classes and functions
-------------------------------------------------------------------------


.. autoclass:: irlc.ex09.rl_agent.TabularAgent
    :show-inheritance:
    :members:

.. autoclass:: irlc.ex09.rl_agent.TabularQ
    :show-inheritance:
    :members:


Solutions to selected exercises
-------------------------------------------------------------------------------------------------------

{% if show_solution[week] %}

.. admonition:: Solution to Problem 1 -  The single-state example
    :class: dropdown

    **Part a:** Comparing to the :math:`T=3` example clearly :math:`N^{\text{first} }(s_1) = 1` and the return is:

    .. math::

        G_0 = R_1 + \gamma R_2 + \gamma^2 R_3 = 1 + \gamma + \gamma^2 = \frac{1 - \gamma^3}{1-\gamma}

    Therefore the accumulated reward is:

    .. math::

        S^{\text{first}}(s_1) = G_0 = \sum_{t=0}^{T-1}\gamma^t R_{t+1}  =  \sum_{t=0}^{T-1}\gamma^t = \frac{1-\gamma^{T} }{1-\gamma} \equiv h(T).


    **Part b:**
    From the previous example, it is clear that the return only depends on the number of future times :math:`T` we visit a state.
    If we again consider the :math:`T=3` example we see that :math:`N^{\text{every} }(s_1) = T` since we visit :math:`s_1` a total of :math:`T` times.
    Each time we compute the (future) return :math:`G_t`; in the first visit this is equal to what we got before, in the next visit it is based on :math:`T-1` future rewards, and so on. For :math:`T=3`:

    .. math::

        S^{\text{every}}(s_1) & = (R_1 + \gamma R_2 + \gamma^2 R_3) + ( R_2 + \gamma R_3)  + (R_3) \\
            & = h(3) + h(2) + h(1)

    So in general:

    .. math::

        S^{\text{every}}(s_1) & = \sum_{t=1}^T h(t) = \sum_{t=1}^T \frac{1-\gamma^{t} }{1-\gamma} = \frac{T - \gamma \sum_{k=0}^{T-1} \gamma^k}{1-\gamma} \\
                    & = \frac{T - \gamma \frac{1-\gamma^T }{1-\gamma} }{1-\gamma} \\
                    & = \frac{T - \gamma h(T)  }{1-\gamma}

    **Part c:** Given the above we can compute the two estimators for :math:`m=1` and see that they clearly give different results. Although they agree when
    :math:`T \rightarrow \infty`, this limit is not interesting since the value of :math:`T` is random not under our control.

    **Part d:** Given :math:`m` estimates clearly :math:`N^{\text{first} }(s_1) = m` and

    .. math::

        S^{\text{first}}(s_1) = \sum_{k=1}^m h(T_k) =  \frac{m - \sum_{k=1}^m \gamma^{T_k} }{1-\gamma}

    **Part e:** Again clearly :math:`N^{\text{every} }(s_1) = \sum_{k=1}^m T_k` and

    .. math::

        S^{\text{every}}(s_1)  = \sum_{k=1}^m \frac{T_k - \gamma h(T_k)  }{1-\gamma} =  \frac{\sum_{k=1}^m T_k - \gamma \sum_{k=1}^m h(T_k)  }{1-\gamma}


    **Part f:** By calculation we get that:

    .. math::

        \frac{m}{ N^{\text{every} }  (s_1) } = \frac{1}{ \frac{1}{m } \sum_{k=1}^m T_k }  \approx \frac{1}{ \mathbb{E}[T]  } = \frac{1}{ \frac{1}{1-p} } = 1-p

    According to the law of large numbers the error in this approximation scale as :math:`\frac{1}{\sqrt{m}}`.

    **Part g:** Starting with first-visit we have that:

    .. math::

        \frac{ S^{\text{first} } (s_1) }{ N^{\text{first} }(s_1) } & = \frac{1}{m} \sum_{k=1}^m h(T_k) =  \frac{1 -  \frac{1}{m} \sum_{k=1}^m \gamma^{T_k} }{1-\gamma} \\
        & \approx \frac{1}{1-\gamma}\left[ 1 - \mathbb{E}[\gamma^T]   \right] \\
        & = \frac{1}{1-\gamma}\left[1 - \gamma \frac{ (1-p)  }{ 1 - \gamma p }  \right] \\
        & = \frac{1}{1-\gamma p}

    For every visit we get that

    .. math::

        \frac{S^{\text{every} } (s_1) }{ N^{\text{every} }(s_1) } & = \frac{1}{1-\gamma}  \frac{1}{m} \left[  \sum_{k=1}^m T_k - \gamma \sum_{k=1}^m h(T_k)  \right] \frac{m}{ N^{\text{every} }(s_1)   } \\
                & \approx   \left[  \frac{1}{m}  \sum_{k=1}^m T_k - \gamma \frac{1}{m} \sum_{k=1}^m h(T_k)  \right] \frac{  1-p }{1-\gamma}
                & =   \left[  \frac{1}{1-p}  - \gamma  \frac{1}{1-\gamma p}  \right] \frac{ 1-p }{1-\gamma}  \\
                & = \frac{1}{1-\gamma p}

    So we see that first and every visit agree when :math:`m` is large and the true return is :math:`v_\pi(s_1) = \frac{1}{1-\gamma p}`.

    .. admonition:: Bonus proofs

        - The chance of staying in :math:`s_1` is :math:`p`, the chance of staying :math:`T-1` times is :math:`p^{T-1}`, and the chance of jumping to :math:`s_2` is
          :math:`1-p`. Therefore :math:`p(T) = p^{T-1}(1-p)`.
        - .. math::
                \mathbb{E}[T] & = (1-p)\sum_{T=0}^\infty T p^{T-1} = (1-p) \sum_{T=0}^\infty \frac{d}{dp} p^T  \\
                                & = (1-p) \frac{d}{dp}  \sum_{T=0}^\infty p^T = (1-p)\frac{d}{dp} \frac{1}{1-p} = (1-p)\frac{1}{(1-p)^2} = \frac{1}{1-p}.
        - .. math::
             \mathbb{E}[\gamma^T] = (1-p)\sum_{T=1}^\infty T\gamma^T p^{T-1} = (1-p)\gamma \sum_{T=1}^\infty (\gamma p)^{T-1} =  \frac{(1-p)\gamma}{1-\gamma p}.


{% endif %}


{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=6e346c07-86a5-4675-8e86-b14f01472f3b', 'Problem 10.1-10.2: MC Value estimation', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=02e99fd4-3d50-464e-a089-afe30129c52d', 'Problem 10.3-10.4: MC control', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=3bb0db76-7868-40ba-8e0d-afe3012fecf9', 'Problem 10.5: TD Learning', True) }}