{% import macros as m with context %} {% set week = 13 %} {{ m.exercise_head(week) }} Deep Q-learning ------------------------------------------------------------------------- To help implementing deep Q learning I have provided a couple of helper classes. The replay buffer ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The replay buffer, :class:`~irlc.ex13.buffer.BasicBuffer`, is basically a list that holds consecutive observations :math:`(s_t, a_t, r_{t+1}, s_{t+1})`. It has a function to push experience into the buffer, and a function to sample a batch from the buffer: .. literalinclude:: ../../shared/output/deepq_agent_buffer_b_stripped.py The deep network ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The second helper class represents the :math:`Q`-network, and it is this class which actually does all the deep learning, and you can find a description in :class:`~irlc.ex13.dqn_network.DQNNetwork`. Lets say the state has dimension :math:`n`. The :math:`Q`-network accepts a tensor of shape ``batch_size x n``, and returns a tensor of shape ``batch_size x actions``. An example: .. runblock:: pycon >>> from irlc.ex13.torch_networks import TorchNetwork >>> import gymnasium as gym >>> import numpy as np >>> env = gym.make("CartPole-v1") >>> Q = TorchNetwork(env, trainable=True, learning_rate=0.001) # DQN network requires an env to set network dimensions >>> batch_size = 32 # As an example >>> states = np.random.rand(batch_size, env.observation_space.shape[0]) # Creates some dummy input >>> states.shape # batch_size x n >>> qvals = Q(states) # Evaluate Q(s,a) >>> qvals.shape # This is a tensor of dimension batch_size x actions >>> print(qvals[0,1]) # Get Q(s_0, 1) >>> Y = np.random.rand(batch_size, env.action_space.n) # Generate target Q-values (training data) >>> Q.fit(states, Y) # Train the Q-network for 1 gradient descent step Finally, to implement double-:math:`Q` learning we have to adapt weights in one network from another. This can be done using the method :func:`~irlc.ex13.dqn_network.DQNNetwork.update_Phi` which computes: .. math:: w_i \leftarrow w_i + \tau (w'_i - w_i) An example: .. literalinclude:: ../../shared/output/double_deepq_agent_target_stripped.py Classes and functions ------------------------------------------------------------------------- .. autoclass:: irlc.ex13.dqn_network.DQNNetwork :members: .. autoclass:: irlc.ex13.buffer.BasicBuffer :members: Solutions to selected exercises ------------------------------------------------------------------------------------------------------- {% if show_solution[week] %} .. admonition:: Solution to the conceptual problem 13.1 :class: dropdown Recall that according to the the Bellman equations the optimal value function satisfy .. math:: v_*(s) = \max_a \mathbb{E}\left[ R_{t+1} + \gamma v_*(S_{t+1}) | S_t=s, A_t = a\right] and if a function satisfy this relationship it must be the optimal value function. We now note that .. math:: \max_a Q(s,a) = h(s) + \max_a g(s,a) - \max_a g(s,a) = h(s) So therefore if we we take :math:`\max_a` on on both sides of .. math:: Q(s,a) = \mathbb{E}\left[ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') | S_t=s, A_t = a\right] We get .. math:: h(s) = \max_a \mathbb{E}\left[ R_{t+1} + \gamma h(S_{t+1} ) | S_t=s, A_t = a\right] So therefore :math:`h(s)` must be the optimal value function. {% endif %} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=a90a495e-39fb-4a05-a8d2-aff100dab923', 'Problem 13.1: Dyna-Q', True) }} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=467b39f5-5e7d-420e-90bc-aff100dfee43', 'Problem 13.2: Tabular double-Q', True) }} {{ m.embed('https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=cc90b190-cbf7-4747-8e67-aff100e9e95d', 'Problem 13.3: DQN', True) }}