{% import macros as m with context  %}
{% set week = 11 %}
{{ m.exercise_head(week) }}

Linear function approximators
----------------------------------------------------------------------------------
The idea behind linear function approximation of :math:`Q`-values is that

- We initialize (and eventually learn) a :math:`d`-dimensional weight vector :math:`w \in \mathbb{R}^d`
- We assume there exists a function to compute a :math:`d`-dimensional feature vector :math:`x(s,a) \in \mathbb{R}^d`
- The :math:`Q`-values are then represented as

  .. math::
     Q(s,a) = x(s,a)^\top w

Learning is therefore entirely about updating :math:`w`.
We are going to use a class, :class:`~irlc.ex11.feature_encoder.LinearQEncoder`, to implement the tile-coding procedure for defining :math:`x(s,a)` as described in (:cite:t:`sutton`).

The following example shows how you initialize the linear :math:`Q`-values and compute them in a given state:

.. runblock:: pycon

    >>> import gymnasium as gym
    >>> env = gym.make('MountainCar-v0')
    >>> from irlc.ex11.feature_encoder import LinearQEncoder
    >>> Q = LinearQEncoder(env, tilings=8) # as in (:cite:t:`sutton`)
    >>> s, _ = env.reset()
    >>> a = env.action_space.sample()
    >>> Q(s,a) # Compute a Q-value.
    >>> Q.d             # Get the number of dimensions
    >>> Q.x(s,a)[:4]    # Get the first four coordinates of the x-vector
    >>> Q.w[:4]         # Get the first four coordinates of the w-vector

For learning, you can simply update :math:`w` as any other variable, and there is a convenience method to get the optimal action. The following example will illustrate a basic usage:


.. runblock:: pycon

    >>> import gymnasium as gym
    >>> env = gym.make('MountainCar-v0')
    >>> from irlc.ex11.feature_encoder import LinearQEncoder
    >>> Q = LinearQEncoder(env, tilings=8)
    >>> s, _ = env.reset()
    >>> a = env.action_space.sample()
    >>> Q.w = Q.w + 2 * Q.w     # w <-- 3*w
    >>> Q.get_optimal_action(s) # Get the optimal action in state s

.. note::
    Depending on how :math:`x(s,a)` is defined, the linear encoder can behave very differently. I have therefore included
    a few different classes in ``irlc.ex09.feature_encoder`` which only differ in how :math:`x(s,a)` is computed. I have chosen to focus this guide on the linear tile-encoder
    which is used in the MountainCar environment and is the main example in (:cite:t:`sutton`). The API for the other classes is entirely similar.


Classes and functions
-------------------------------------------------------------------------


.. autoclass:: irlc.ex11.feature_encoder.FeatureEncoder
    :show-inheritance:
    :members:


.. autoclass:: irlc.ex11.feature_encoder.LinearQEncoder
    :show-inheritance:
    :members:


Solutions to selected exercises
-------------------------------------------------------------------------------------------------------

{% if show_solution[week] %}

.. admonition:: Solution to the conceptual exam problem
    :class: dropdown

    **Part a:** Since the immediate reward is zero, the next :math:`Q`-value will be determined by the :math:`Q`-value
    associated with the state north of the agent and the action the agent generates in that state:

    .. math::
        Q(s,a) = Q(s,a) + \alpha (r + \gamma Q(s', a') - Q(s,a) )

    If the exploration rate is non-zero, all actions $a'$ may occur, giving rise to two different new values.
    This mean the $Q$-value can be updated to:

    .. math::

        Q(s, \texttt{North} )= 0.0,  0.432

    **Part b:** It is evident that we need to propagate the $Q$-value from the northern square to the $Q$-value we wish to update.
    To do this, we first go :math:`\texttt{north}`, but then to change the :math:`Q`-value we must select :math:`\texttt{east}` in that state.
    Therefore, we backtrack (:math:`\texttt{west}`, :math:`\texttt{south}`). Pacman will now be on the current state, and a single step
    :math:`\texttt{east}` will mean pacman updates the green :math:`Q`-value, and it can be updated to a non-zero value
    since the next action generated by Sarsa can select the :math:`Q`-value associated with the red square.
    The answer is therefore 5 and the actions are:

    .. math::
        \texttt{north},  \texttt{east}, \texttt{west}, \texttt{south}, \texttt{east}

    **Part c:** After convergence, Sarsa will have learned the $Q$-values associated with
    the :math:`\varepsilon`-soft policy :math:`\pi`, i.e. to :math:`q_\pi`.
    It will clearly attempt to move the agent towards the goal square with a :math:`+1` reward,
    and since :math:`\gamma<1` it will attempt to do so quickly.
    The fastest way to do that is either north or south of the central pillar.
    The southern way, associate with :math:`Q_e`, takes the agent next to the dangerous square with a :math:`-1` reward. There is a chance
    of at least :math:`\frac{\varepsilon}{4}` of randomly falling into that square using the :math:`\varepsilon`-soft policy.
    We can therefore conclude this path is far more dangerous, and this must be reflected in the :math:`Q`-values.
    Hence, :math:`Q_e < Q_n`. Note that this will not be true for :math:`Q`-learning, where the two paths are the same. Note this argument is similar
    to the cliffwalking example we saw in the exercises and in (:cite:t:`sutton`).

{% endif %}


{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=28374cb3-7857-4dfe-9326-afea011f039a', 'Problem 11.1: Q-learning agent', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=e87906f1-ce7b-44c0-8d03-afea0124acd6', 'Problem 11.2: Sarsa-learning agent', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=32fbdba5-3ba5-4c8f-a956-afea0129b143', 'Problem 11.3: Semi-gradient Q-agent', True) }}