Exercise 13: Q-learning and deep-Q learning#

Note

  • The exercises material is divided into general information (found on this page) and the actual exercise instructions. You can download this weeks exercise instructions from here:

  • You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.

  • To get the newest version of the course material, please see Making sure your files are up to date

Deep Q-learning#

To help implementing deep Q learning I have provided a couple of helper classes.

The replay buffer#

The replay buffer, BasicBuffer, is basically a list that holds consecutive observations \((s_t, a_t, r_{t+1}, s_{t+1})\). It has a function to push experience into the buffer, and a function to sample a batch from the buffer:

# deepq_agent.py
self.memory = BasicBuffer(replay_buffer_size) if buffer is None else buffer 
self.memory.push(s, a, r, sp, done) # save current observation 
""" First we sample from replay buffer. Returns numpy Arrays of dimension 
> [self.batch_size] x [...]]
for instance 'a' will be of dimension [self.batch_size x 1]. 
"""
s,a,r,sp,done = self.memory.sample(self.batch_size) 

The deep network#

The second helper class represents the \(Q\)-network, and it is this class which actually does all the deep learning, and you can find

a description in DQNNetwork.

Lets say the state has dimension \(n\). The \(Q\)-network accepts a tensor of shape batch_size x n, and returns a tensor of shape batch_size x actions. An example:

>>> from irlc.ex13.torch_networks import TorchNetwork
>>> import gymnasium as gym
>>> import numpy as np
>>> env = gym.make("CartPole-v1")
>>> Q = TorchNetwork(env, trainable=True, learning_rate=0.001) # DQN network requires an env to set network dimensions
>>> batch_size = 32 # As an example
>>> states = np.random.rand(batch_size, env.observation_space.shape[0]) # Creates some dummy input
>>> states.shape    # batch_size x n
(32, 4)
>>> qvals = Q(states) # Evaluate Q(s,a)
>>> qvals.shape # This is a tensor of dimension batch_size x actions
(32, 2)
>>> print(qvals[0,1]) # Get Q(s_0, 1)
-0.22357672
>>> Y = np.random.rand(batch_size, env.action_space.n) # Generate target Q-values (training data)
>>> Q.fit(states, Y)                      # Train the Q-network for 1 gradient descent step

Finally, to implement double-\(Q\) learning we have to adapt weights in one network from another. This can be done using the method update_Phi() which computes:

\[w_i \leftarrow w_i + \tau (w'_i - w_i)\]

An example:

# double_deepq_agent.py
self.target.update_Phi(self.Q, tau=self.tau) 

Classes and functions#

class irlc.ex13.dqn_network.DQNNetwork[source]#

A class representing a deep Q network. Note that this function is batched. I.e. s is assumed to be a numpy array of dimension batch_size x n

The following example shows how you can evaluate the Q-values in a given state. An example:

>>> from irlc.ex13.torch_networks import TorchNetwork
>>> import gymnasium as gym
>>> import numpy as np
>>> env = gym.make("CartPole-v1")
>>> Q = TorchNetwork(env, trainable=True, learning_rate=0.001) # DQN network requires an env to set network dimensions
>>> batch_size = 32 # As an example
>>> states = np.random.rand(batch_size, env.observation_space.shape[0]) # Creates some dummy input
>>> states.shape    # batch_size x n
(32, 4)
>>> qvals = Q(states) # Evaluate Q(s,a)
>>> qvals.shape # This is a tensor of dimension batch_size x actions
(32, 2)
>>> print(qvals[0,1]) # Get Q(s_0, 1)
0.011426022
>>> Y = np.random.rand(batch_size, env.action_space.n) # Generate target Q-values (training data)
>>> Q.fit(states, Y)                      # Train the Q-network for 1 gradient descent step
update_Phi(source, tau=0.01)[source]#

Update (adapts) the weights in this network towards those in source by a small amount.

For each weight \(w_i\) in (this) network, and each corresponding weight \(w'_i\) in the source network, the following Polyak update is performed:

\[w_i \leftarrow w_i + \tau (w'_i - w_i)\]
Parameters:
  • source – Target network to update towards

  • tau – Update rate (rate of change \(\tau\)

Returns:

None

fit(s, target)[source]#

Fit the network weights by minimizing

\[\frac{1}{B}\sum_{i=1}^B \sum_{a=1}^K \| q_\phi(s_i)_a - y_{i,a} \|^2\]

where target corresponds to \(y\) and is a [batch_size x actions] matrix of target Q-values. :type s: :param s: :type target: :param target: :return:

class irlc.ex13.buffer.BasicBuffer(max_size=2000)[source]#

The buffer class is used to keep track of past experience and sample it for learning.

__init__(max_size=2000)[source]#

Creates a new (empty) buffer.

Parameters:

max_size – Maximum number of elements in the buffer. This should be a large number like 100’000.

push(state, action, reward, next_state, done)[source]#

Add information from a single step, \((s_t, a_t, r_{t+1}, s_{t+1}, \text{done})\) to the buffer.

>>> import gymnasium as gym
>>> from irlc.ex13.buffer import BasicBuffer
>>> env = gym.make("CartPole-v1")
>>> b = BasicBuffer()
>>> s, info = env.reset()
>>> a = env.action_space.sample()
>>> sp, r, done, _, info = env.step(a)
>>> b.push(s, a, r, sp, done)
>>> len(b) # Get number of elements in buffer
1
Parameters:
  • state – A state \(s_t\)

  • action – Action taken \(a_t\)

  • reward – Reward obtained \(r_{t+1}\)

  • next_state – Next state transitioned to \(s_{t+1}\)

  • doneTrue if the environment terminated else False

Returns:

None

sample(batch_size)[source]#

Sample batch_size elements from the buffer for use in training a deep Q-learning method. The elements returned all be numpy ndarray where the first dimension is the batch dimension, i.e. of size batch_size.

>>> import gymnasium as gym
>>> from irlc.ex13.buffer import BasicBuffer
>>> env = gym.make("CartPole-v1")
>>> b = BasicBuffer()
>>> s, info = env.reset()
>>> a = env.action_space.sample()
>>> sp, r, done, _, _ = env.step(a)
>>> b.push(s, a, r, sp, done)
>>> S, A, R, SP, DONE = b.sample(batch_size=32)
>>> S.shape # Dimension batch_size x n
(32, 4)
>>> R.shape # Dimension batch_size x 1
(32, 1)
Parameters:

batch_size – Number of elements to sample

Returns:

  • S - Matrix of size batch_size x n of sampled states

  • A - Matrix of size batch_size x n of sampled actions

  • R - Matrix of size batch_size x n of sampled rewards

  • SP - Matrix of size batch_size x n of sampled states transitioned to

  • DONE - Matrix of size batch_size x 1 of bools indicating if the environment terminated

save(path)[source]#

Use this to save the content of the buffer to a file

Parameters:

path – Path where to save (use same argument with load)

Returns:

None

load(path)[source]#

Use this to load buffer content from a file

Parameters:

path – Path to load from (use same argument with save)

Returns:

None

Solutions to selected exercises#