{% import macros as m with context  %}
{% set week = 13 %}
{{ m.exercise_head(week) }}

Deep Q-learning
-------------------------------------------------------------------------
To help implementing deep Q learning I have provided a couple of helper classes.

The replay buffer
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The replay buffer, :class:`~irlc.ex13.buffer.BasicBuffer`, is basically a list that holds consecutive observations :math:`(s_t, a_t, r_{t+1}, s_{t+1})`.
It has a function to push experience into the buffer, and a function to sample a batch from the buffer:

.. literalinclude:: ../../shared/output/deepq_agent_buffer_b_stripped.py


The deep network
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The second helper class represents the :math:`Q`-network, and it is this class which actually does all the deep learning, and you can find
a description in :class:`~irlc.ex13.dqn_network.DQNNetwork`.

Lets say the state has dimension :math:`n`. The :math:`Q`-network accepts a tensor of shape
``batch_size x n``, and returns a tensor of shape ``batch_size x actions``. An example:

.. runblock:: pycon

    >>> from irlc.ex13.torch_networks import TorchNetwork
    >>> import gymnasium as gym
    >>> import numpy as np
    >>> env = gym.make("CartPole-v1")
    >>> Q = TorchNetwork(env, trainable=True, learning_rate=0.001) # DQN network requires an env to set network dimensions
    >>> batch_size = 32 # As an example
    >>> states = np.random.rand(batch_size, env.observation_space.shape[0]) # Creates some dummy input
    >>> states.shape    # batch_size x n
    >>> qvals = Q(states) # Evaluate Q(s,a)
    >>> qvals.shape # This is a tensor of dimension batch_size x actions
    >>> print(qvals[0,1]) # Get Q(s_0, 1)
    >>> Y = np.random.rand(batch_size, env.action_space.n) # Generate target Q-values (training data)
    >>> Q.fit(states, Y)                      # Train the Q-network for 1 gradient descent step

Finally, to implement double-:math:`Q` learning we have to adapt weights in one network from another. This can be done using the method :func:`~irlc.ex13.dqn_network.DQNNetwork.update_Phi` which computes:

.. math::

    w_i \leftarrow w_i + \tau (w'_i - w_i)

An example:

.. literalinclude:: ../../shared/output/double_deepq_agent_target_stripped.py


Classes and functions
-------------------------------------------------------------------------


.. autoclass:: irlc.ex13.dqn_network.DQNNetwork
    :members:

.. autoclass:: irlc.ex13.buffer.BasicBuffer
    :members:


Solutions to selected exercises
-------------------------------------------------------------------------------------------------------

{% if show_solution[week] %}


.. admonition:: Solution to the conceptual problem 13.1
    :class: dropdown

    Recall that according to the the Bellman equations the optimal value function satisfy

    .. math::

        v_*(s) = \max_a \mathbb{E}\left[ R_{t+1} + \gamma v_*(S_{t+1}) | S_t=s, A_t = a\right]

    and if a function satisfy this relationship it must be the optimal value function. We now note that

    .. math::

        \max_a Q(s,a) = h(s) + \max_a g(s,a) - \max_a g(s,a) = h(s)

    So therefore if we we take :math:`\max_a` on on both sides of 

    .. math::

        Q(s,a) = \mathbb{E}\left[ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') | S_t=s, A_t = a\right]

    We get

    .. math::

        h(s) = \max_a \mathbb{E}\left[ R_{t+1} + \gamma  h(S_{t+1} ) | S_t=s, A_t = a\right]

    So therefore :math:`h(s)` must be the optimal value function.

{% endif %}


{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=a90a495e-39fb-4a05-a8d2-aff100dab923', 'Problem 13.1: Dyna-Q', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=467b39f5-5e7d-420e-90bc-aff100dfee43', 'Problem 13.2: Tabular double-Q', True) }}

{{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=cc90b190-cbf7-4747-8e67-aff100e9e95d', 'Problem 13.3: DQN', True) }}