Classical BlocksWorld Domain [Winograd, 1971].
The objective is to put blocks on top of each other in a specific order to form a tower. Initially all blocks are unstacked and are on the table. STATE: The state of the MDP is defined by n integer values [s_1 ... s_n]: si = j indicates that block i is on top of j (for compactness s_i = i indicates that the block i is on the table).
[0 1 2 3 4 0] => means all blocks on table except block 5 which is on top of block 0
ACTIONS: At each step, the agent can take a block, and put it on top of another block or move it to the table, given that blocks do not have any other blocks on top of them prior to this action.
TRANSITION: There is 30% probability of failure for each move, in which case the agent drops the moving block on the table. Otherwise the move succeeds.
REWARD: The reward is .001 for each step where the tower is not built and +1.0 when the tower is built.
REFERENCE:
See also
Alborz Geramifard, Finale Doshi, Joshua Redding, Nicholas Roy, and Jonathan How. Online discovery of feature dependencies. International Conference on Machine Learning (ICML), pages 881888. ACM, June 2011
reward when the tower is completed
reward per step
Total number of blocks
discount factor
Used to plot the domain
Goal tower size
A simple Chain MDP.
STATE: s0 <> s1 <> ... <> sn
ACTIONS: are left [0] and right [1], deterministic.
Note
The actions [left, right] are available in ALL states, but if left is selected in s0 or right in sn, then s remains unchanged.
The task is to reach sn from s0, after which the episode terminates.
Note
Optimal policy is to always to go right.
REWARD: 1 per step, 0 at goal (terminates)
REFERENCE:
See also
Michail G. Lagoudakis, Ronald Parr, and L. Bartlett Leastsquares policy iteration. Journal of Machine Learning Research (2003) Issue 4.
Parameters:  chainSize – Number of states ‘n’ in the chain. 

Reward for each timestep spent in the goal region
Used for graphical radius of states
Reward for each timestep
Number of states in the chain
Set by the domain = min(100,rows*cols)
Random start location, goal is to proceed to nearest reward.
STATE: s0 <> s1 <> ... <> s49
ACTIONS: left [0] or right [1]
Actions succeed with probability .9, otherwise execute opposite action. Note that the actions [left, right] are available in ALL states, but if left is selected in s0 or right in s49, then s remains unchanged.
Note
The optimal policy is to always go to the nearest goal
REWARD: of +1 at states 10 and 41 (indices 9 and 40). Reward is obtained when transition out of the reward state, not when first enter.
Note that this class provides the function :py:meth`~rlpy.Domains.FiftyChain.L_inf_distance_to_V_star`, which accepts an arbitrary representation and returns the error between it and the optimal policy. The user can also enforce actions under the optimal policy (ignoring the agent’s policy) by setting using_optimal_policy=True in FiftyChain.py.
REFERENCE:
See also
Michail G. Lagoudakis, Ronald Parr, and L. Bartlett Leastsquares policy iteration. Journal of Machine Learning Research (2003) Issue 4.
Reward for each timestep spent in the goal region
Indices of states with rewards
Parameters:  representation – An arbitrary learned representation of the value function. 

Returns:  the Linfinity distance between the parameter representation and the optimal one. 
Number of states in the chain
Set by the domain = min(100,rows*cols)
Probability of taking the other (unselected) action
A domain based on the last puzzle of Doors and Rooms Game stage 53.
The goal of the game is to get all elements of a 4x4 board to have value 1.
The initial state is the following:
1 0 0 0
0 0 0 0
0 1 0 0
0 0 1 0
STATE: a 4x4 array of binary values.
ACTION: Invert the value of a given [Row, Col] (from 0>1 or 1>0).
TRANSITION: Determinisically flip all elements of the board on the same row OR col of the action.
REWARD: 1 per step. 0 when the board is solved [all ones] REFERENCE:
See also
The GridWorld domain simulates a pathplanning problem for a mobile robot in an environment with obstacles. The goal of the agent is to navigate from the starting point to the goal state. The map is loaded from a text file filled with numbers showing the map with the following coding for each cell:
STATE: The Row and Column corresponding to the agent’s location.
ACTIONS: Four cardinal directions: up, down, left, right (given that the destination is not blocked or out of range).
TRANSITION: There is 30% probability of failure for each move, in which case the action is replaced with a random action at each timestep. Otherwise the move succeeds and the agent moves in the intended direction.
REWARD: The reward on each step is .001 , except for actions that bring the agent to the goal with reward of +1.
Up, Down, Left, Right
Number of rows and columns of the map
Reward constants
Movement Noise
Number of rows and columns of the map
Set by the domain = min(100,rows*cols)
Simulation of HIV Treatment. The aim is to find an optimal drug schedule.
STATE: The state contains concentrations of 6 different cells:
ACTIONS: The therapy consists of 2 drugs (reverse transcriptase inhibitor [RTI] and protease inhibitor [PI]) which are activated or not. The action space contains therefore of 4 actions:
REFERENCE:
See also
Ernst, D., Stan, G., Gonc, J. & Wehenkel, L. Clinical data based optimal STI strategies for HIV: A reinforcement learning approach In Proceedings of the 45th IEEE Conference on Decision and Control (2006).
measurement every 5 days
total of 1000 days with a measurement every 5 days
whether observed states are in log10 space or not
only update the graphs in showDomain every x steps
Implementation of a simulator that models one of the Stanford autonomous helicopters (an XCell Tempest helicopter) in the flight regime close to hover.
Adapted from the RLCommunity Java Implementation
STATE: The state of the helicopter is described by a 20dimensional vector with the following entries:
REFERENCE:
See also
Abbeel, P., Ganapathi, V. & Ng, A. Learning vehicular dynamics, with application to modeling helicopters. Advances in Neural Information Systems (2006).
[m] maximum deviation in position in each dimension
[m/s] maximum velocity in each dimension
all possible actions
discount factor
length of one timestep
wind in neutral orientation
Warning
This domain has an internal hidden state, as it actually is a POMDP. Besides the 12dimensional observable state, there is an internal state saved as self.hidden_state_ (time and longterm noise which simulated gusts of wind). be aware of this state if you use this class to produce samples which are not in order
Implementation of a simulator that models one of the Stanford autonomous helicopters (an XCell Tempest helicopter) in the flight regime close to hover.
Adapted from the RLCommunity Java Implementation
STATE: The state of the helicopter is described by a 12dimensional vector with the following entries:
REFERENCE:
See also
Abbeel, P., Ganapathi, V. & Ng, A. Learning vehicular dynamics, with application to modeling helicopters. Advances in Neural Information Systems (2006).
Formulated as an MDP, the intruder monitoring task is to guard danger zones using cameras so that if an intruder moves to a danger zone, at least one camera is pointing at that location.
All locations are on a 2D grid.
The episode is finished after 1000 steps.
STATE:
Location of: [ Agent_1, Agent_2, ... Agent n ]
Location of: [ Intruder_1, Intruder_2, ... Intruder_m ]
Where n is number of agents, m is number of intruders.
ACTIONS: [Up, Down, Left, Right, Remain]^n (one action for each agent).
TRANSITION: Each agent can move in 4 directions + stay still. There is no noise on any movements. Each intruder moves with a fixed policy (specified by the user) By Default, intruder policy is uniform random.
Map of the world contains fixed number of danger zones. Maps are simple text files contained in the Domains/IntruderMonitoringMaps/ directory.
REWARD:
1 for every visit of an intruder to a danger zone with no camera present.
The team receives a penalty whenever there is an intruder on a danger zone in the absence of an agent. The task is to allocate agents on the map so that intruders do not enter the danger zones without attendance of an agent.
Actions: Up, Down, Left, Right, Null
Number of rows and columns of the map
Rewards
Parameters:  s_i – The state of a single agent (where the domain state s = [s_0, ... s_i ... s_NUMBER_OF_AGENTS]). 

Returns:  a valid actions for the agent in state s_i to take. 
Default random action among possible.
Number of Cooperating agents
Number of Intruders
Number of rows and columns of the map
directory with maps shipped with rlpy
Returns all possible actions for a single (2D) agent state s_i (where the domain state s = [s_0, ... s_i ... s_NUMBER_OF_AGENTS])
 tile the [R,C] for all actions
 add all actions to the results
 Find feasible rows and add them as possible actions
Move all intruders according to the IntruderPolicy(), default uniform random action. Move all agents according to the selected action a. Calculate the reward = Number of danger zones being violated by intruders while no agents are present (ie, intruder occupies a danger cell with no agent simultaneously occupying the cell).
The goal is to drive an under accelerated car up to the hill.
STATE: Position and velocity of the car [x, xdot]
ACTIONS: [Acc backwards, Coast, Acc forward]
TRANSITIONS: Move along the hill with some noise on the movement.
REWARD: 1 per step and 0 at or beyond goal (xgoal > 0).
There is optional noise on vehicle acceleration.
REFERENCE: Based on RLCommunity Java Implementation
Parameters:  noise – Magnitude of noise (times accelerationFactor) in stochastic velocity changes 

XPosition of the goal location (Should be at/near hill peak)
Reward for reach the goal.
Upper bound on car velocity
Upper bound on domain position
Hill peaks are generated as sinusoid; this is freq. of that sinusoid.
Magnitude of noise (times accelerationFactor) in stochastic velocity changes
Persistent Search and Track Mission with multiple Unmanned Aerial Vehicle (UAV) agents.
Goal is to perform surveillance and communicate it back to base in the presence of stochastic communication and “health” (overall system functionality) constraints, all without without losing any UAVs because of running out of fuel.
STATE:
Each UAV has 4 state dimensions:
Domain state vector consists of 4 blocks of states, each corresponding to a property of the UAVs (listed above)
So for example:
>>> state = [1,2,9,3,1,0,1,1]
corresponds to blocks
>>> loc, fuel, act_status, sens_status = [1,2], [9,3], [1,0], [1,1]
which has the meaning:
UAV 1 in location 1, with 9 fuel units remaining, and sensor + actuator with status 1 (functioning). UAV 2 in location 2, 3 fuel units remaining, actuator with status 0 and sensor with status 1.
ACTIONS:
Each UAV can take one of 3 actions: {RETREAT, LOITER, ADVANCE} Thus, the action space is \(3^n\), where n is the number of UAVs.
Detailed Description The objective of the mission is to fly to the surveillance node and perform surveillance on a target, while ensuring that a communication link with the base is maintained by having a UAV with a working actuator loitering on the communication node.
Movement of each UAV is deterministic with 5% failure rate for both the actuator and sensor of each UAV on each step. A penalty is applied for each unit of fuel consumed, which occurs when a UAV moves between locations or when it is loitering above a COMMS or SURVEIL location (ie, no penalty when loitering at REFUEL or BASE).
A UAV with a failed sensor cannot perform surveillance. A UAV with a failed actuator cannot perform surveillance or communication, and can only take actions leading it back to the REFUEL or BASE states, where it may loiter.
Loitering for 1 timestep at REFUEL assigns fuel of 10 to that UAV.
Loitering for 1 timestep at BASE assigns status 1 (functioning) to Actuator and Sensor.
Finally, if any UAV has fuel 0, the episode terminates with large penalty.
REWARD
The objective of the mission is to fly to the surveillance node and perform surveillance on a target, while ensuring that a communication link with the base is maintained by having a UAV with a working actuator loitering on the communication node.
The agent receives: + 20 if an ally with a working sensor is at surveillance node while an ally with a working motor is at the communication node, apenalty of  50 if any UAV crashes and always some small penalty for burned fuel.
REFERENCE:
See also
J. D. Redding, T. Toksoz, N. Ure, A. Geramifard, J. P. How, M. Vavrina, and J. Vian. Distributed MultiAgent Persistent Surveillance and Tracking With Health Management. AIAA Guidance, Navigation, and Control Conference (2011).
Parameters:  NUM_UAV – the number of UAVs in the domain 

Negative reward coefficient: for fuel burn penalty [not mentioned in MDP Tutorial]
Number of fuel units at start
Reward (negative) coefficient for movement (i.e., fuel burned while loitering might be penalized above, but no movement cost)
Number of targets in surveillance region; SURVEIL_REWARD is multiplied by the number of targets successfully observed
Probability that an actuator fails on this timestep for a given UAV
Probability that a sensor fails on this timestep for a given UAV
Perstep, perUAV reward coefficient for performing surveillance on each step [C_cov]
Discount factor
Appends the arguments into an nparray to create an RLPy state vector.
Convert generic RLPy state s to internal state
Parameters:  s – RLPy state 

Returns:  PST.StateStruct – the custom structure used by this domain. 
Converts a custom PST.StateStruct to an RLPy state vector.
Parameters:  sState – the PST.StateStruct object 

Returns:  RLPy state vector 
Returns a list of unique id’s based on possible permutations of a list of integer lists. The length of the integer lists need not be the same.
Parameters: 


Returns:  int – unique value associated with a list of lists of this length. 
Given a list of lists of the form [[0,1,2],[0,1],[1,2],[0,1]]... return unique id for each permutation between lists; eg above, would return 3*2*2*2 values ranging from 0 to 3^4 1 (3 is max value possible in each of the lists, maxValue)
Pacman domain, which acts as a wrapper for the Pacman implementation from the BerkeleyX/CS188.1x course project 3.
STATE: The state vector has a series of dimensions:
nf and nc are mapdependent, and ng can be set as a parameter. Based on above, total dimensionality of state vector is mapdependent, and given by (2 + 3*ng + nf + nc).
ACTIONS: Move Pacman [up, down, left, right, stay]
REWARD: See the Berkeley project website below for more info.
Note
The visualization runs as fast as your CPU will permit; to slow things down so gameplay is actually visible, decomment time.sleep() in the showDomain() method.
REFERENCE: This domain is an RLPy wrapper for the implementation from the BerkeleyX/CS188.1x course project 3
See the original source code (zipped)
For more details of the domain see the original package in the Domains/PacmanPackage folder.
location of layouts shipped with rlpy
Checks whether the game should terminate at the given state. (Terminate for failure, ie eaten by ghost or out of time, and for success, all food on map eaten.) If game should terminate, returns the proper indication to step function. Accounts for scoring changes in terminal states.
get the internal game state represented as a numpy array
The goal of this domain is to maneuver a small ball on a plate into a hole. The plate may contain obstacles which should be avoided.
REFERENCE:
See also
G.D. Konidaris and A.G. Barto: Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining. Advances in Neural Information Processing Systems 22, pages 10151023, December 2009.
default location of config files shipped with rlpy
Implementation of the puddle world benchmark as described in references below.
STATE: 2dimensional vector, s, each dimension is continuous in [0,1]
ACTIONS: [right, up, left, down]  NOTE it is not possible to loiter.
REFERENCE:
See also
Jong, N. & Stone, P.: Kernelbased models for reinforcement learning, ICML (2006)
See also
Sutton, R. S.: Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding, NIPS(1996)
This is a simple simulation of Remote Controlled Car in a room with no obstacle.
STATE: 4 continuous dimensions:
x, y: (center point on the line connecting the back wheels),
speed (S on the webpage)
positive values => turning right, negative values => turning left
ACTIONS: Two action dimensions:
This leads to 3 x 3 = 9 possible actions.
REWARD: 1 per step, 100 at goal.
REFERENCE:
Simulation of system of network computers.
Computers in a network randomly fail and influence the probability of connected machines failing as well  the system administrator must work to keep as many machines running as possible, but she can only fix one at a time.
STATE:
Each computer has binary state {BROKEN, RUNNING}.
The state space is thus 2^n, where n is the number of computers in the system.
All computers are connected to each other by a fixed topology and initially have state RUNNING.
Example
[1 1 0 1] > computers 0,1,3 are RUNNING, computer 2 is BROKEN.
ACTIONS: The action space is the integers [0,n], where n corresponds to taking no action, and [0,n1] selects a computer to repair.
Repairing a computer causes its state to become RUNNING regardless of its previous state. However, penalty 0.75 is applied for taking a repair action.
REWARD: +1 is awarded for each computer with status RUNNING, but 0.75 is applied for any repair action taken (ie, a != 0)
Visualization Broken computers are colored red, and any links to other computers change from solid to dotted, reflecting the higher probability of failure of those machines.
REFERENCE
See also
Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient Solution Algorithms for Factored MDPs. Journal of Artificial Intelligence Research (2003) Issue 19, p 399468.
Parameters:  networkmapname – The name of the file to use as the computer network map. Assumed to be located in the SystemAdministratorMaps directory of RLPy. 

For ring structures, Parr enforces assymetry by having one machine get extra reward for being RUNNING.
Probability that a machine becomes RUNNING after a repair action is taken
Probability of a machine randomly selfrepairing (no penalty)
Discount factor
Maximum number of steps
Parameters:  path – Path to the map file, of form ‘/Domains/SystemAdministratorMaps/mapname.txt’ 

Sets the internal variables _Neighbors and _Edges, where each cell of _Neighbors is a list containing the neighbors of computer node <i> at index <i>, and _Edges is a list of tuples (node1, node2) where node1 and node2 share an edge and node1 < node2.
A swimmer consisting of a chain of d links connected by rotational joints. Each joint is actuated. The goal is to move the swimmer to a specified goal position.
Note
adapted from Yuval Tassas swimmer implementation in Matlab available at http://www.cs.washington.edu/people/postdocs/tassa/code/
See also
Tassa, Y., Erez, T., & Smart, B. (2007). Receding Horizon Differential Dynamic Programming. In Advances in Neural Information Processing Systems.