GridWorld is a 2D rectangular grid of size (N rows, N columns) with an agent starting off at one grid cell, moving from cell to cell through the grid, and eventually exiting after
collecting a reward. This grid environment is described as follows:
State space: GridWorld has N rows × N columns distinct states. We use s to denote the state. The agent starts in the bottom-left cell (row 1, column 1, marked as a green cell). There exist one or more terminal states (blue cells) that can be located anywhere in the grid (except the bottom-left cell). There may also be walls (red cells) that the agent cannot be moved to. An instance of this grid is shown in the figure below.
Actions: At every non-terminal state, the agent can either walk or run in any of the four directions (Up, Down, Left, and Right), which results in 8 possible actions: “Walk Up”, “Walk Down”, “Walk Left”, “Walk Right”, “Run Up”, “Run Down”, “Run Left”, “Run Right”. At the terminal state, the only possible action is “Exit”. We use A(s) to denote the set of all possible actions at state s .
Transition model: GridWorld is stochastic because the actions can be unreliable. In this environment, action “Walk X” (X can be Up, Down, Left, or Right) moves the agent one cell in the X direction with probability pwalk , but with probabilities 0.5 (1 − pwalk) and 0.5 (1 − pwalk) moves the agent one cell at angles of 90° and -90° to the direction X, respectively.
Furthermore, action “Run X” moves the agent two cells in the X direction with probability prun , but with probabilities 0.5 (1 − prun) and 0.5 (1 − prun) moves the agent two cells at angles of 90° and -90° to the direction X, respectively.
If moving in a particular direction causes the agent to bump into a wall, the movement fails, and the agent stays in the same cell “i, j“ . We write P (s′|s, a) to denote the probability of reaching state s′ if action a is done in state s . The following examples illustrate the environment dynamics:
Assume that the agent chooses action “Walk Up” at “4,4” as shown in figure below. This action moves the agent to (5,4) with probability pwalk , but with probability 0.5 (1 − pwalk) , it moves the agent right to “4,5”, and with probability 0.5 (1 − pwalk) , it moves the agent left to “4,3”.
Assume that the agent chooses action “Run Up” at “4,4” as shown in figure below. This action moves the agent two cells up, but because it causes the agent to bump into a wall, the agent stays at “4,4” with probability prun . With probability 0.5 (1 − prun) , the agent moves two cells right to “4,6”. Finally, this action moves the agent two cells left with probability 0.5 (1 − prun) , but because of the wall at “4,2”, the agent stays at “4,4” with probability 0.5 (1 − prun) . Hence, as a result of this action, the agent moves to “4,6” with probability 0.5 (1 − prun) and stays at “4,4” with probability prun + 0.5 (1 − prun) .
Assume that the agent chooses action “Walk Right” at “4,1” as shown in figure below. Then, the agent moves to “5,1” and “3,1”, each with probability 0.5 (1 − pwalk) and stays at “4,1” with probability pwalk .
Assume that the agent chooses action “Run Right” at “4,1” as shown in figure below. Then, the agent moves to “2,1” with probability 0.5 (1 − prun) and stays at “4,1” with probability prun + 0.5 (1 − prun) .
Rewards: When the agent takes action a in state s , it receives a reward, R(s, a) . For all non-terminal states, s:
R(s,Walk X) = Rwalk (a constant).
R(s,Run X) =Rrun (a constant).
Furthermore, if there are K terminal states, the reward in the kth terminal state, s is R(s,Exit) = Rterminal(k) (a constant).
The agent prefers receiving its reward as soon as possible. Therefore, it applies a discount factor, γ , to future rewards. For example, if it runs once and then exits from the
kth terminal state, it will receive a total reward of Rrun + γRterminal(k). If it instead walks twice and then exits from the kth terminal state, it will receive a total reward of
Rwalk + γRwalk + γ2 Rterminal(k).
In this assignment, you need to find the action that the agent should take at each state to maximize its expected reward. You can use any algorithm for this homework.
Ties between the possible actions are broken based on the following preference order:
“Walk Up” > “Walk Down” > “Walk Left” > “Walk Right” > “Run Up” > “Run Down” > “Run Left” > “Run Right”.
The grid will contain at most 1,000,000 cells.
As the wall states are never reached, it does not matter what the optimal actions are at these states. However, for consistency, you should print “None” as the optimal action for a wall state.
<GRID SIZE> includes two positive integers separated by a comma “,” where the first one is the number of rows ( N rows ) of the grid, and the second one is the number of
columns ( N columns ) of the grid.
<WALL CELLS NUMBER> is a number (greater than or equal to zero) that specifies the number of wall cells in the grid.
<WALL CELLS POSITION> contains <WALL CELLS NUMBER> lines where each line includes 2 values separated by a comma “,”. The two values in each line indicate the row and column numbers of the gird where a wall is located.
<TERMINAL STATES NUMBER> is a number (greater than or equal to one) that specifies the number of terminal states in the grid.
<TERMINAL STATES POSITION and REWARDS> contains <TERMINAL STATES NUMBER> lines where each line includes 3 values separated by a comma “,”. The first two values in the k -th line indicate the row and column numbers of the grid where the
kthterminal state is located, and the third value (a float) denotes Rterminal(k) . <TRANSITION MODEL> contains two probability values separated by a comma “,”, where the first value is pwalk and the second value is prun .
<REWARDS> contains two floating-point numbers separated by a comma “,” where the
first value is Rwalk and the second value is Rrun.
<DISCOUNT FACTOR>which is a number in the interval [0,1].
<OPTIMAL ACTION GRID> contains N rows lines where each line includes N columns strings separated by a comma “,”. The j -th string at the i -th line should indicate the
optimal action (one of “Walk Up”, “Walk Down”, “Walk Left”, “Walk Right”, “Run Up”, “Run Down”, “Run Left”, or “Run Right” for non-terminal states; “Exit” for terminal states; or “None” for wall states) at state “ N rows − i + 1 , j ”.
Your homework submission will be tested as follows: Your program should take no command-line arguments. It should read a text file called “input.txt” in the current directory that contains a problem definition. It should write a file “output.txt” with your solution. The formats for files “input.txt” and “output.txt” are specified below. End-of-line convention is Unix (because Vocareum is a Unix system).
During grading, 50 test cases will be used. If your homework submission passes all 50 test cases, you receive 100 points. For each test case that fails, you will lose 2 points.
Note that if your code does not compile, or somehow fails to load and parse input.txt, or writes an incorrectly formatted output.txt, or no output.txt at all, or OuTpUt.TxT, you will get zero points.
Sample Test Cases
You are provided with 3 sample test cases and outputs. Please find them in the Homework 3 folder.
You must use Python 2.7 to implement your homework assignment. You are allowed to use standard libraries only. You have to implement any other functions or methods by yourself.
You need to create a file named “hw3cs561s2018.py”. When you submit the homework on labs.vocareum.com, the following commands will be executed: python hw3cs561s2018.py
Your code must create a file named “output.txt” and print its output there. For each test case, the grading script will put an “input.txt” file in your work folder, runs your program (which reads “input.txt”), and check the “output.txt” file generated by your code. The grading script will replace the files automatically, so you do NOT need to do anything for that part.
Homework should be submitted through Vocareum. Homework submitted through email, in person during office hours, or through any channel other than Vocareum will not be accepted. Please only upload your code to the “/work” directory. Don’t create any subfolder or upload any other files. Please refer to http://help.vocareum.com/article/30-getting-started-students to get started with Vocareum.
Your program should handle all test cases within a reasonable time. A maximum runtime of 3 minutesper test case will be set on Vocareum.
You are strongly recommended to submit at least two hours ahead of the deadline, as the submission system around the deadline might be slow and possibly cause late submission. Late submissions will not be graded.