Assignment #2 Solution




These questions require thought, but do not require long answers. Please be as concise as possible.

We encourage students to discuss in groups for assignments. We ask that you abide by the university Honor Code and that of the Computer Science department. If you have discussed the problems with others, please include a statement saying who you discussed problems with. Failure to follow these instructions will be reported to the O ce of Community Standards. We reserve the right to run a fraud-detection software on your code. Please refer to website, Academic Collaboration and Misconduct section for details about collaboration policy.

Please review any additional instructions posted on the assignment page. When you are ready to submit, please follow the instructions on the course website. Make sure you test your code using the provided commands and do not edit outside of the marked areas.

You’ll need to download the starter code and ll the appropriate functions following the instructions from the handout and the code’s documentation. Training DeepMind’s network on Pong takes roughly 12 hours on GPU, so please start early! (Only a completed run will recieve full credit) We will give you access to an Azure GPU cluster. You’ll nd the setup instructions on the course assignment page.


In this assignment we will implement deep Q learning, following DeepMind’s paper ([mnih2015human] and [mnih-atari-2013]) that learns to play Atari from raw pixels. The purpose is to understand the e ec-tiveness of deep neural network as well as some of the techniques used in practice to stabilize training and achieve better performance. You’ll also have to get comfortable with Tensor ow. We will train our networks on the Pong-v0 environment from OpenAI gym, but the code can easily be applied to any other environment.

In Pong, one player wins if the ball passes by the other player. Winning a game gives a reward of 1, while losing gives a negative reward of -1. An episode is over when one of the two players reaches 21 wins. Thus, the nal score is between -21 (lost episode) or +21 (won episode). Our agent plays against a decent hard-coded AI player. Average human performance is 3 (reported in [mnih-atari-2013]). If you go to the end of the homework successfully, you will train an AI agent with super-human performance, reaching at least +10 (hopefully more!).

  • Test Environment (5 pts)

Before running our code on Pong, it is crucial to test our code on a test environment. You should be able to run your models on CPU in no more than a few minutes on the following environment:

  • 4 states: 0; 1; 2; 3
  • 5 actions: 0; 1; 2; 3; 4. Action 0 i 3 goes to state i, while action 4 makes the agent stay in the same state.
  • Rewards: Going to state i from states 0, 1, and 3 gives a reward R(i), where R(0) = 0:1; R(1) =

0:2; R(2) = 0; R(3) = 0:1. If we start in state 2, then the rewards de nd above are multiplied by 10. See Table 1 for the full transition and reward structure.

  • One episode lasts 5 time steps (for a total of 5 actions) and always starts in state 0 (no rewards at the initial state).


State (s) Action (a) Next State (s’) Reward (R)
0 0 0 0.1
0 1 1 -0.2
0 2 2 0.0
0 3 3 -0.1
0 4 0 0.1
1 0 0 0.1
1 1 1 -0.2
1 2 2 0.0
1 3 3 -0.1
1 4 1 -0.2
2 0 0 -1.0
2 1 1 2.0
2 2 2 0.0
2 3 3 1.0
2 4 2 0.0
3 0 0 0.1
3 1 1 -0.2
3 2 2 0.0
3 3 3 -0.1
3 4 3 -0.1

Table 1: Transition table for the Test Environment

An example of a path (or an episode) in the test environment is shown in Figure 1, and the trajectory can be represented in terms of st; at; Rt as: s0 = 0; a0 = 1; R0 = 0:2; s1 = 1; a1 = 2; R1 = 0; s2 = 2; a2 = 4; R2 = 0; s3 = 2; a3 = 3; R3 = ( 0:1) ( 10) = 1; s4 = 3; a4 = 0; R4 = 0:1; s5 = 0.

Figure 1: Example of a path in the Test Environment

  1. (written 5pts) What is the maximum sum of rewards that can be achieved in a single episode in the test environment, assuming = 1?
  • Q-learning (12 pts)

Tabular setting In the tabular setting, we maintain a table Q(s; a) for each tuple state-action. Given an experience sample (s; a; r; s0), our update rule is

r + a0 2A 0 0 (1)
Q(s; a) = Q(s; a) + max Q (s ; a ) Q (s; a)  ;

where     2 R is the learning rate,       the discount factor.

Approximation setting Due to the scale of Atari environments, we cannot reasonably learn and store a Q value for each state-action tuple. We will instead represent our Q values as a function q^(s; a; w) where w are parameters of the function (typically a neural network’s weights and bias parameters). In this approximation setting, our update rule becomes

r + a0 2A 0 0 q^(s; a; w) rw q^(s; a; w): (2)
w = w + max q^(s ; a ; w)

In other words, we are try to minimize

L(w) = Es;a;r;s0 r +  a0 2A 0 0 2
q^(s; a; w)
max q^(s ; a ; w) (3)

Target Network DeepMind’s paper [mnih2015human] [mnih-atari-2013] maintains two sets of pa-rameters, w (to compute q^(s; a)) and w (target network, to compute q^(s0; a0)) such that our update rule becomes

w = w +   r +  a0 2A s0; a0; wq^(s; a; w)  rwq^(s; a; w): (4)
max q^

The target network’s parameters are updated with the Q-network’s parameters occasionally and are kept xed between individual updates. Note that when computing the update, we don’t compute gradients with respect to w (these are considered xed weights).

Replay Memory As we play, we store our transitions (s; a; r; s0) in a bu er. Old examples are deleted as we store new transitions. To update our parameters, we sample a minibatch from the bu er and perform a stochastic gradient descent update.

-Greedy Exploration Strategy During training, we use an -greedy strategy. DeepMind’s paper [mnih2015human] [mnih-atari-2013] decreases from 1 to 0:1 during the rst million steps. At test time, the agent choses a random action with probability soft = 0:05.

There are several things to be noted:

  • In this assignment, we will update w every learning freq steps by using a minibatch of experi-ences sampled from the replay bu er.
  • DeepMind’s deep Q network takes as input the state s and outputs a vector of size = number of actions. In the Pong environment, we have 6 actions, thus q^(s; w) 2 R6.
  • The input of the deep Q network is the concatenation 4 consecutive steps, which results in an input after preprocessing of shape (80 80 4).

We will now examine these assumptions and implement the epsilon-greedy strategy.

  1. (written 3pts) What is one bene t of using experience replay?
  2. (written 3pts) What is one bene t of the target network?
  3. (written 3pts) What is one bene t of representing the Q function as q^(s; w) 2 RK
  4. (coding 3pts) Implement the get action and update functions in q1 Test your implementation by running python q1
  • Linear Approximation (26 pts)
  1. (written 3pts) Show that Equations (1) and (2) from section 2 above are exactly the same when q^(s; a; w) = wT x(s; a), where w 2 RjSjjAj and x : S A ! RjSjjAj such that

1           if s0 = s; a0 = a

x(s; a)s0 ;a0  =

0    otherwise

for all (s; a) 2 S A, x(s; a) is a vector of length jSjjAj where the element corresponding to s0 2 S; a0 2 A is 1 when s0 = s; a0 = a and is 0 otherwise.

  1. (written 3pts) Derive the gradient with regard to the value function parameter w 2 Rn given q^(s; a; w) = wT x(s; a) for any function x(s; a) 7!x 2 Rn and write the update rule for w.
  2. (coding 15pts) We will now implement linear approximation in Tensor ow. This question will setup the whole pipeline for the remiander of the assignment. You’ll need to implement the following functions in q2 (pleasd read throughq2 :
  • add placeholders op
  • get q values op

add update target op

  • add loss op
  • add optimizer op

Test your code by running python q2 locally on CPU. This will run linear approx-imation with Tensor ow on the test environment. Running this implementation should only take a minute or two.

  1. (written 5pts) Do you reach the optimal achievable reward on the test environment? Attach the plot scores.png from the directory results/q2 linear to your writeup.
  • Implementing DeepMind’s DQN (15 pts)
  1. (coding 10pts) Implement the deep Q-network as described in [mnih2015human] by implementing get q values op in q3 The rest of the code inherits from what you wrote for linear approximation. Test your implementation locally on CPU on the test environment by running python q3 Running this implementation should only take a minute or two.
  2. (written 5pts) Attach the plot of scores, scores.png, from the directory results/q3 nature to your writeup. Compare this model with linear approximation. How do the nal performances compare? How about the training time?
  • DQN on Atari (27 pts)

The Atari environment from OpenAI gym returns observations (or original frames) of size (210 160 3), the last dimension corresponds to the RGB channels lled with values between 0 and 255 (uint8). Following DeepMind’s paper [mnih2015human], we will apply some preprocessing to the observations:

  • Single frame encoding: To encode a single frame, we take the maximum value for each pixel color value over the frame being encoded and the previous frame. In other words, we return a pixel-wise max-pooling of the last 2 observations.
  • Dimensionality reduction: Convert the encoded frame to grey scale, and rescale it to (80 80 1). (See Figure 2)

The above preprocessing is applied to the 4 most recent observations and these encoded frames are stacked together to produce the input (of shape (80 80 4)) to the Q-function. Also, for each time we decide on an action, we perform that action for 4 time steps. This reduces the frequency of decisions without impacting the performance too much and enables us to play 4 times as many games while training. You can refer to the Methods Section of [mnih2015human] for more details.

(a) Original input (210      160    3) with RGB colors        (b) After preprocessing in grey scale of shape (80  80    1)

Figure 2: Pong-v0 environment

  1. (written 2pts) Why do we use the last 4 time steps as input to the network for playing Atari games?
  2. (written 5pts) What’s the number of parameters of the DQN model (for Pong) if the input to the Q-network is a tensor of shape (80; 80; 4) and we use “SAME” padding? How many parameters are required for the linear Q-network, assuming the input is still of shape (80; 80; 4)? How do the number of parameters compare between the two models?
  3. (coding and written 5pts). Now, we’re ready to train on the Atari Pong-v0 environment. First, launch linear approximation on pong with python q4 train atari on Azure’s GPU. This will train the model for 500,000 steps and should take approximately an hour. What do you notice about the performance?
  4. (coding and written 10 pts). In this question, we’ll train the agent with DeepMind’s architecture on the Atari Pong-v0 environment. Run python q5 train atari on Azure’s GPU. This will train the model for 5 million steps and should take around 12 hours. Attach the plot scores.png from the directory results/q5 train atari nature to your writeup. You should get a score of around 13-15 after 5 million time steps. As stated previously, the Deepmind paper claims

average human performance is  3.

As the training time is roughly 12 hours, you may want to check after a few epochs that your network is making progress. The following are some training tips:

  • If you terminate your terminal session, the training will stop. In order to avoid this, you should use screen to run your training in the background.
  • The evaluation score printed on terminal should start at -21 and increase.
  • The max of the q values should also be increasing
  • The standard deviation of q shouldn’t be too small. Otherwise it means that all states have similar q values

You may want to use Tensorboard to track the history of the printed metrics. You can monitor your training with Tensorboard by typing the command tensorboard –logdir=results and then connecting to ip-of-you-machine:6006. Below are our Tensorboard graphs from one training session:


  1. (written 5pts) Compare the performance of the DeepMind architecure with the linear Q-network approximation. How can you explain the gap in performance?
  • Real world RL with neural networks (10 pts)

Given a stream of batches of n environment interactions (si; ai; ri; s0i) we want to learn the optimal value function using a neural network. The underlying MDP has a nite sized action space.

  1. (written 4pts) Your friend rst suggests the following approach
  • Initialize parameters of neural network V

For each batch of k tuples (si; ai; ri; s0i) do Stochastic Gradient Descent with loss function V (si)j2 where yi = maxai [ri + V (s0i)]

What is the problem with this approach? (Hint: Think about the type of data we have)

  1. (written 3pts) Your friend now suggests the following
  • Initialize parameters of neural network for state-action value function Q (s; a)
  • For each batch of k tuples (si; ai; ri; s0i) do Stochastic Gradient Descent with loss function Q (si; ai)j2 where yi = ri + V (s0i)

Pki=0 jyi

Pki=0 jyi

Now as we just have the network Q (s; a) how would you determine V (s) needed for the above training procedure?

  1. (written 3pts) Is the above method of learning the Q network guaranteed to give us an approximation of the optimal state action value function?

error: Content is protected !!