Exercise 13: Deep-Q learning#
Note
The exercises material is divided into general information (found on this page) and the actual exercise instructions. You can download this weeks exercise instructions from here:
You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date
Deep Q-learning#
To help implementing deep Q learning I have provided a couple of helper classes.
The replay buffer#
The replay buffer, BasicBuffer
, is basically a list that holds consecutive observations \((s_t, a_t, r_{t+1}, s_{t+1})\).
It has a function to push experience into the buffer, and a function to sample a batch from the buffer:
# deepq_agent.py
self.memory = BasicBuffer(replay_buffer_size) if buffer is None else buffer
self.memory.push(s, a, r, sp, done) # save current observation
""" First we sample from replay buffer. Returns numpy Arrays of dimension
> [self.batch_size] x [...]]
for instance 'a' will be of dimension [self.batch_size x 1].
"""
s,a,r,sp,done = self.memory.sample(self.batch_size)
The deep network#
The second helper class represents the \(Q\)-network, and it is this class which actually does all the deep learning, and you can find
a description in DQNNetwork
.
Lets say the state has dimension \(n\). The \(Q\)-network accepts a tensor of shape
batch_size x n
, and returns a tensor of shape batch_size x actions
. An example:
>>> from irlc.ex13.torch_networks import TorchNetwork
>>> import gymnasium as gym
>>> import numpy as np
>>> env = gym.make("CartPole-v1")
>>> Q = TorchNetwork(env, trainable=True, learning_rate=0.001) # DQN network requires an env to set network dimensions
>>> batch_size = 32 # As an example
>>> states = np.random.rand(batch_size, env.observation_space.shape[0]) # Creates some dummy input
>>> states.shape # batch_size x n
(32, 4)
>>> qvals = Q(states) # Evaluate Q(s,a)
>>> qvals.shape # This is a tensor of dimension batch_size x actions
(32, 2)
>>> print(qvals[0,1]) # Get Q(s_0, 1)
-0.043145366
>>> Y = np.random.rand(batch_size, env.action_space.n) # Generate target Q-values (training data)
>>> Q.fit(states, Y) # Train the Q-network for 1 gradient descent step
Finally, to implement double-\(Q\) learning we have to adapt weights in one network from another. This can be done using the method update_Phi()
which computes:
An example:
# double_deepq_agent.py
self.target.update_Phi(self.Q, tau=self.tau)
Classes and functions#
- class irlc.ex13.dqn_network.DQNNetwork[source]#
A class representing a deep Q network. Note that this function is batched. I.e.
s
is assumed to be a numpy array of dimensionbatch_size x n
The following example shows how you can evaluate the Q-values in a given state. An example:
>>> from irlc.ex13.torch_networks import TorchNetwork >>> import gymnasium as gym >>> import numpy as np >>> env = gym.make("CartPole-v1") >>> Q = TorchNetwork(env, trainable=True, learning_rate=0.001) # DQN network requires an env to set network dimensions >>> batch_size = 32 # As an example >>> states = np.random.rand(batch_size, env.observation_space.shape[0]) # Creates some dummy input >>> states.shape # batch_size x n (32, 4) >>> qvals = Q(states) # Evaluate Q(s,a) >>> qvals.shape # This is a tensor of dimension batch_size x actions (32, 2) >>> print(qvals[0,1]) # Get Q(s_0, 1) -0.12550023 >>> Y = np.random.rand(batch_size, env.action_space.n) # Generate target Q-values (training data) >>> Q.fit(states, Y) # Train the Q-network for 1 gradient descent step
- update_Phi(source, tau=0.01)[source]#
Update (adapts) the weights in this network towards those in source by a small amount.
For each weight \(w_i\) in (this) network, and each corresponding weight \(w'_i\) in the
source
network, the following Polyak update is performed:\[w_i \leftarrow w_i + \tau (w'_i - w_i)\]- Parameters:
source – Target network to update towards
tau – Update rate (rate of change \(\\tau\)
- Returns:
None
- class irlc.ex13.buffer.BasicBuffer(max_size=2000)[source]#
The buffer class is used to keep track of past experience and sample it for learning.
- __init__(max_size=2000)[source]#
Creates a new (empty) buffer.
- Parameters:
max_size – Maximum number of elements in the buffer. This should be a large number like 100’000.
- push(state, action, reward, next_state, done)[source]#
Add information from a single step, \((s_t, a_t, r_{t+1}, s_{t+1}, \text{done})\) to the buffer.
>>> import gymnasium as gym >>> from irlc.ex13.buffer import BasicBuffer >>> env = gym.make("CartPole-v1") >>> b = BasicBuffer() >>> s, info = env.reset() >>> a = env.action_space.sample() >>> sp, r, done, _, info = env.step(a) >>> b.push(s, a, r, sp, done) >>> len(b) # Get number of elements in buffer 1
- Parameters:
state – A state \(s_t\)
action – Action taken \(a_t\)
reward – Reward obtained \(r_{t+1}\)
next_state – Next state transitioned to \(s_{t+1}\)
done –
True
if the environment terminated elseFalse
- Returns:
None
- sample(batch_size)[source]#
Sample
batch_size
elements from the buffer for use in training a deep Q-learning method. The elements returned all be numpyndarray
where the first dimension is the batch dimension, i.e. of sizebatch_size
.>>> import gymnasium as gym >>> from irlc.ex13.buffer import BasicBuffer >>> env = gym.make("CartPole-v1") >>> b = BasicBuffer() >>> s, info = env.reset() >>> a = env.action_space.sample() >>> sp, r, done, _, _ = env.step(a) >>> b.push(s, a, r, sp, done) >>> S, A, R, SP, DONE = b.sample(batch_size=32) >>> S.shape # Dimension batch_size x n (32, 4) >>> R.shape # Dimension batch_size x 1 (32, 1)
- Parameters:
batch_size – Number of elements to sample
- Returns:
S - Matrix of size
batch_size x n
of sampled statesA - Matrix of size
batch_size x n
of sampled actionsR - Matrix of size
batch_size x n
of sampled rewardsSP - Matrix of size
batch_size x n
of sampled states transitioned toDONE - Matrix of size
batch_size x 1
of bools indicating if the environment terminated
Solutions to selected exercises#
Problem 13.1: Dyna-Q
Problem 13.2: Tabular double-Q
Problem 13.3: DQN