Exercise 13: Deep-Q learning

Exercise 13: Deep-Q learning#

Note

This page contains background information which may be useful in future exercises or projects. You can download this weeks exercise instructions from here:
- 02465ex13_Python.pdf
Slides: [1x] ([6x]). Reading: Chapter 6.7-6.9; 8-8.4; 16-16.2; 16.5; 16.6, [SB18].
You are encouraged to prepare the homework problems 1 (indicated by a hand in the PDF file) at home and present your solution during the exercise session.
To get the newest version of the course material, please see Making sure your files are up to date

Deep Q-learning#

To help implementing deep Q learning I have provided a couple of helper classes.

The replay buffer#

The replay buffer, BasicBuffer, is basically a list that holds consecutive observations \((s_t, a_t, r_{t+1}, s_{t+1})\). It has a function to push experience into the buffer, and a function to sample a batch from the buffer:

# deepq_agent.py
self.memory = BasicBuffer(replay_buffer_size) if buffer is None else buffer 
self.memory.push(s, a, r, sp, done) # save current observation 
""" First we sample from replay buffer. Returns numpy Arrays of dimension 
> [self.batch_size] x [...]]
for instance 'a' will be of dimension [self.batch_size x 1]. 
"""
s,a,r,sp,done = self.memory.sample(self.batch_size) 

The deep network#

The second helper class represents the \(Q\)-network, and it is this class which actually does all the deep learning, and you can find a description in DQNNetwork.

Lets say the state has dimension \(n\). The \(Q\)-network accepts a tensor of shape batch_size x n, and returns a tensor of shape batch_size x actions. An example:

>>> from irlc.ex13.torch_networks import TorchNetwork
>>> import gymnasium as gym
>>> import numpy as np
>>> env = gym.make("CartPole-v1")
>>> Q = TorchNetwork(env, trainable=True, learning_rate=0.001) # DQN network requires an env to set network dimensions
>>> batch_size = 32 # As an example
>>> states = np.random.rand(batch_size, env.observation_space.shape[0]) # Creates some dummy input
>>> states.shape    # batch_size x n
(32, 4)
>>> qvals = Q(states) # Evaluate Q(s,a)
>>> qvals.shape # This is a tensor of dimension batch_size x actions
(32, 2)
>>> print(qvals[0,1]) # Get Q(s_0, 1)
0.15748727
>>> Y = np.random.rand(batch_size, env.action_space.n) # Generate target Q-values (training data)
>>> Q.fit(states, Y)                      # Train the Q-network for 1 gradient descent step

Finally, to implement double-\(Q\) learning we have to adapt weights in one network from another. This can be done using the method update_Phi() which computes:

\[w_i \leftarrow w_i + \tau (w'_i - w_i)\]

An example:

# double_deepq_agent.py
self.target.update_Phi(self.Q, tau=self.tau) 

Classes and functions#

class irlc.ex13.dqn_network.DQNNetwork[source]#

A class representing a deep Q network. Note that this function is batched. I.e. s is assumed to be a numpy array of dimension batch_size x n

The following example shows how you can evaluate the Q-values in a given state. An example:

>>> from irlc.ex13.torch_networks import TorchNetwork
>>> import gymnasium as gym
>>> import numpy as np
>>> env = gym.make("CartPole-v1")
>>> Q = TorchNetwork(env, trainable=True, learning_rate=0.001) # DQN network requires an env to set network dimensions
>>> batch_size = 32 # As an example
>>> states = np.random.rand(batch_size, env.observation_space.shape[0]) # Creates some dummy input
>>> states.shape    # batch_size x n
(32, 4)
>>> qvals = Q(states) # Evaluate Q(s,a)
>>> qvals.shape # This is a tensor of dimension batch_size x actions
(32, 2)
>>> print(qvals[0,1]) # Get Q(s_0, 1)
-0.24552459
>>> Y = np.random.rand(batch_size, env.action_space.n) # Generate target Q-values (training data)
>>> Q.fit(states, Y)                      # Train the Q-network for 1 gradient descent step

update_Phi(source, tau=0.01)[source]#

Update (adapts) the weights in this network towards those in source by a small amount.

For each weight \(w_i\) in (this) network, and each corresponding weight \(w'_i\) in the source network, the following Polyak update is performed:

\[w_i \leftarrow w_i + \tau (w'_i - w_i)\]

Parameters:

source – Target network to update towards
tau – Update rate (rate of change \(\\tau\)

Returns:

None

fit(s, target)[source]#

Fit the network weights by minimizing

\[\frac{1}{B}\sum_{i=1}^B \sum_{a=1}^K \| q_\phi(s_i)_a - y_{i,a} \|^2\]

where target corresponds to \(y\) and is a [batch_size x actions] matrix of target Q-values. :type s: :param s: :type target: :param target: :return:

class irlc.ex13.buffer.BasicBuffer(max_size=2000)[source]#

The buffer class is used to keep track of past experience and sample it for learning.

__init__(max_size=2000)[source]#

Creates a new (empty) buffer.

Parameters:: max_size – Maximum number of elements in the buffer. This should be a large number like 100’000.

push(state, action, reward, next_state, done)[source]#

Add information from a single step, \((s_t, a_t, r_{t+1}, s_{t+1}, \text{done})\) to the buffer.

>>> import gymnasium as gym
>>> from irlc.ex13.buffer import BasicBuffer
>>> env = gym.make("CartPole-v1")
>>> b = BasicBuffer()
>>> s, info = env.reset()
>>> a = env.action_space.sample()
>>> sp, r, done, _, info = env.step(a)
>>> b.push(s, a, r, sp, done)
>>> len(b) # Get number of elements in buffer
1

Parameters:

state – A state \(s_t\)
action – Action taken \(a_t\)
reward – Reward obtained \(r_{t+1}\)
next_state – Next state transitioned to \(s_{t+1}\)
done – True if the environment terminated else False

Returns:

None

sample(batch_size)[source]#

Sample batch_size elements from the buffer for use in training a deep Q-learning method. The elements returned all be numpy ndarray where the first dimension is the batch dimension, i.e. of size batch_size.

>>> import gymnasium as gym
>>> from irlc.ex13.buffer import BasicBuffer
>>> env = gym.make("CartPole-v1")
>>> b = BasicBuffer()
>>> s, info = env.reset()
>>> a = env.action_space.sample()
>>> sp, r, done, _, _ = env.step(a)
>>> b.push(s, a, r, sp, done)
>>> S, A, R, SP, DONE = b.sample(batch_size=32)
>>> S.shape # Dimension batch_size x n
(32, 4)
>>> R.shape # Dimension batch_size x 1
(32, 1)

Parameters:

batch_size – Number of elements to sample

Returns:

S - Matrix of size batch_size x n of sampled states
A - Matrix of size batch_size x n of sampled actions
R - Matrix of size batch_size x n of sampled rewards
SP - Matrix of size batch_size x n of sampled states transitioned to
DONE - Matrix of size batch_size x 1 of bools indicating if the environment terminated

save(path)[source]#

Use this to save the content of the buffer to a file

Parameters:: path – Path where to save (use same argument with load)
Returns:: None

load(path)[source]#

Use this to load buffer content from a file

Parameters:: path – Path to load from (use same argument with save)
Returns:: None

Solutions to selected exercises#

Problem 13.1: Dyna-Q

Problem 13.2: Tabular double-Q

Problem 13.3: DQN

Exercise 13: Deep-Q learning

Contents

Exercise 13: Deep-Q learning#

Deep Q-learning#

The replay buffer#

The deep network#

Classes and functions#

Solutions to selected exercises#