Programming Assignment 5: Reinforcement Learning Solution




Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. This homework programming assignment consists of 2 parts. In the first part, you are tasked with solving a simple MDP using value iteration algorithm. In the second part, you are asked to train a tic-tac-toe player using Q-learning.

Please use Python 3.6 orPython 3.7 to implement your homework assignment.

1. Homework Description

1.1 Task 1: 2D Maze (30 pts)

Task Description

Your agent is trapped in a 2D Maze, and your are helping your agent to determine the policy to get out of the maze.

There are obstacles in the maze that your agent should avoid. If your agent crashes into an obstacle, 100 Health Points of your agent will be deducted (HP -100). Each time your agent moves, 1 HP will also be deducted (HP -1). When your agent arrives at the destination, your agent will receive 100 HP (HP +100). (Hint: think of HP as reward)

However, there is also uncertainty in the agent’s navigation. The agent will go in the correct direction 70% of the time (10% in each other direction, including borders).

The goal is to compute the best policy given the maze usingvalue iteration.


  • Policy:

    • The lowercase letter x represents an obstacle

    • Period . represents destination

    • ^ v < > represents NORTH, SOUTH, WEST, EAST respectively

  • Utilities here are just for demonstration, do NOT round in your algorithm

  • It is guaranteed that there is NO tie in all test cases

Value Iteration

Recall the bellman equation to calculate the expected utility starting in sand acting optimally:

And the value iteration algorithm:

Parameter Selection

Please use exact same values

  • Discount factor gamma = 0.9

  • Maximum error allowed epsilon = 0.1

  • Initial conditions U0= 0.0

Input and Output Format

The input file will be formatted as follows:



followed by number_of_obstacles lines of x, y #location of obstacles x, y #location of destination

For the above mentioned example, the input is:

And your output should be :

NOblank line at the end of the file, \nfor line break


You are given blackbox51.txt, blackbox52.txt for development, and your code will only be graded with hidden test cases blackbox53.txt and blackbox54.txt

The program should be run in the following way:

python3 input_file output_file => output_file

For example, when we grade with hidden blackbox53, we will run:

python3 blackbox53.txt blackbox53-policy.txt => blackbox53-policy.txt

The score will be calculated as:

Correctness (Exact Same)









Hint: Linux diff command or python filecmp.cmp() function can be used to compare two files

1.2 Task 2: Tic-Tac-Toe (70 pts)

Task Description

In this part, you are asked to solve a real-life problem Tic-tac-toe. If you don’t remember it, familiarize yourself with the rules (

You are given 6 files:

  • tic-tac-toe board, 3 by 3 gridx











  • RandomPlayer.pyc, SmartPlayer.pyc, PerfectPlayer.pyc (cpython-3.6): think of them as 3 blackboxes. You do not need to know how things work inside of each player, but it may be helpful to know how they behave. See next section for more details

  • Q-Learning Player to be implemented

  • where all players will be called to play tic-tac-toe games and where your QLearner will be trained and tested, i.e. grading script, simply run python3

Your task is to complete In, method move() and learn() must be implemented, and the number of games GAME_NUMneeded for training the Q-Learner must be set. Any other helper functions can be added inside the QLearnerclass.

To see how these two methods and the variable GAME_NUM will be used, please see for more details.

You are not supposed to modify, TicTacToe.pyas well as other Players. The only python file you need to edit is


  • RandomPlayer: moves randomly

  • SmartPlayer: somehow better than RandomPlayer, but cannot beat PerfectPlayer

  • PerfectPlayer: never lose

If we let these 3 players play against each other, game results will look like this:

Figure 2.1


Recall the formula:

Q(s,a) ← (1 – ) Q(s,a)+ (R(s) + max Q(s’,a’))


You are encouraged to make any improvement to your Q-Learner in terms of speed and performance to get prepared for the upcoming competition.

Hint: The rewards will only be assigned for the last action taken by the agent

Parameters Selection

You are free to choose values for all parameters, i.e. reward for WIN,DRAW,LOSE, learning rate, discount factor and initial conditions.


This time, the grading script ( is given to you. After execution, the game results will be printed out. See Figure 2.2

Figure 2.2

In summary section, you can see the Win/Draw rate against each opponent as well as the final grade for task 2. The task 2 score is calculated as:






Win+Draw Rate

>= 95%




< 95% and >= 85%

15 * Rate

15 * Rate

10 * Rate

< 85%




If your player is not based on Q-Learning (e.g. rule-based), you will lose ALLpoints for task 2.

In your implementation, please do not use any existing machine learning library call. You must implement the algorithm yourself. Please develop your code yourself and do not copy from other students or from the Internet.

Your code will be autograded for technical correctness. Please name your file correctly, or you will wreak havoc on the autograder. The maximum running time is 3 minutes for each task.

2. Submission:

2.1 Submit your code to Vocareum

  • Submit,​​ on​ ​Vocareum

  • The program will terminate and fail if it exceeds the 3 minutestime limit.

  • After the submission script/grading script finishes, you can view your submission report immediately to see if your code works, while the grading report will be released after the deadline of the homework.

  • You don’t have to keep the page open while the scripts are running.

You do NOT need to submit your code or report to Blackboard. Your score on Vocareumwill be your final score. The submission instructions and examples are provided in Appendix A.

Appendix A: Vocareum Submission

This time your total score on Vocareum will be your final grade for assignment 5, so it is important for you to fully understand how to submit your work and how final grades will be generated.

There are 2 parts on Vocareum as well. Two parts are independent so that you can work on and test them separately.

Switch Task

Task 1

As usual, upload your GridMDP.pyto the work directory.

After you submit the code, you will be able to see the submission report. The submission scriptwill only check blackbox51and blackbox52.

Passedmeans you are correct on that blackbox

Otherwise, you will see Failed

After the due time, the grading script will check your code using blackbox53and blackbox54to generate score for task1.

Task 2

Only​​ needs to be submitted.

All starter code (including, * and are the SAMEon Vocareum, and you do not need to submit them.

If everything works fine, you should be able to see the submission report either in terminalor under Detailstab.

It is really important for you to know that the grade shown in the submission report is NOTyour final grade. To be fair, the grading script(exact SAMEas submission script) will run your​​againto generate the final score after the deadline.

Final Grade

After grades published, you can check your grades on Vocareum and the total score here will be your final grade for this assignment.

error: Content is protected !!