RLPy

RLPy

The Reinforcement Learning Library for Education and Research

Table Of Contents

Previous topic

Bicycle Balancing

Next topic

Finite Track CartPole: Balance Task

This Page

BlocksWorld

class rlpy.Domains.BlocksWorld.BlocksWorld(blocks=6, towerSize=6, noise=0.3)[source]

Classical BlocksWorld Domain [Winograd, 1971].

The objective is to put blocks on top of each other in a specific order to form a tower. Initially all blocks are unstacked and are on the table. STATE: The state of the MDP is defined by n integer values [s_1 ... s_n]: si = j indicates that block i is on top of j (for compactness s_i = i indicates that the block i is on the table).

[0 1 2 3 4 0] => means all blocks on table except block 5 which is on top of block 0

ACTIONS: At each step, the agent can take a block, and put it on top of another block or move it to the table, given that blocks do not have any other blocks on top of them prior to this action.

TRANSITION: There is 30% probability of failure for each move, in which case the agent drops the moving block on the table. Otherwise the move succeeds.

REWARD: The reward is -.001 for each step where the tower is not built and +1.0 when the tower is built.

REFERENCE:

See also

Alborz Geramifard, Finale Doshi, Joshua Redding, Nicholas Roy, and Jonathan How. Online discovery of feature dependencies. International Conference on Machine Learning (ICML), pages 881-888. ACM, June 2011

GOAL_REWARD = 1

reward when the tower is completed

STEP_REWARD = -0.001

reward per step

blocks = 0

Total number of blocks

discount_factor = 1

discount factor

domain_fig = None

Used to plot the domain

towerSize = 0

Goal tower size

Linear Chain MDP

class rlpy.Domains.ChainMDP.ChainMDP(chainSize=2)[source]

A simple Chain MDP.

STATE: s0 <-> s1 <-> ... <-> sn

ACTIONS: are left [0] and right [1], deterministic.

Note

The actions [left, right] are available in ALL states, but if left is selected in s0 or right in sn, then s remains unchanged.

The task is to reach sn from s0, after which the episode terminates.

Note

Optimal policy is to always to go right.

REWARD: -1 per step, 0 at goal (terminates)

REFERENCE:

See also

Michail G. Lagoudakis, Ronald Parr, and L. Bartlett Least-squares policy iteration. Journal of Machine Learning Research (2003) Issue 4.

Parameters:chainSize – Number of states ‘n’ in the chain.
GOAL_REWARD = 0

Reward for each timestep spent in the goal region

RADIUS = 0.5

Used for graphical radius of states

STEP_REWARD = -1

Reward for each timestep

chainSize = 0

Number of states in the chain

episodeCap = 0

Set by the domain = min(100,rows*cols)

Fifty-State Chain MDP

class rlpy.Domains.FiftyChain.FiftyChain[source]

Random start location, goal is to proceed to nearest reward.

STATE: s0 <-> s1 <-> ... <-> s49

ACTIONS: left [0] or right [1]

Actions succeed with probability .9, otherwise execute opposite action. Note that the actions [left, right] are available in ALL states, but if left is selected in s0 or right in s49, then s remains unchanged.

Note

The optimal policy is to always go to the nearest goal

REWARD: of +1 at states 10 and 41 (indices 9 and 40). Reward is obtained when transition out of the reward state, not when first enter.

Note that this class provides the function :py:meth`~rlpy.Domains.FiftyChain.L_inf_distance_to_V_star`, which accepts an arbitrary representation and returns the error between it and the optimal policy. The user can also enforce actions under the optimal policy (ignoring the agent’s policy) by setting using_optimal_policy=True in FiftyChain.py.

REFERENCE:

See also

Michail G. Lagoudakis, Ronald Parr, and L. Bartlett Least-squares policy iteration. Journal of Machine Learning Research (2003) Issue 4.

GOAL_REWARD = 1

Reward for each timestep spent in the goal region

GOAL_STATES = [9, 40]

Indices of states with rewards

L_inf_distance_to_V_star(representation)[source]
Parameters:representation – An arbitrary learned representation of the value function.
Returns:the L-infinity distance between the parameter representation and the optimal one.
chainSize = 50

Number of states in the chain

episodeCap = 50

Set by the domain = min(100,rows*cols)

p_action_failure = 0.1

Probability of taking the other (unselected) action

storeOptimalPolicy()[source]

Computes and stores the optimal policy on this particular chain.

Warning

This ONLY applies for the scenario where two states provide reward - this policy will be suboptimal for all other domains!

FlipBoard

class rlpy.Domains.FlipBoard.FlipBoard[source]

A domain based on the last puzzle of Doors and Rooms Game stage 5-3.

The goal of the game is to get all elements of a 4x4 board to have value 1.

The initial state is the following:

1 0 0 0
0 0 0 0
0 1 0 0
0 0 1 0

STATE: a 4x4 array of binary values.

ACTION: Invert the value of a given [Row, Col] (from 0->1 or 1->0).

TRANSITION: Determinisically flip all elements of the board on the same row OR col of the action.

REWARD: -1 per step. 0 when the board is solved [all ones] REFERENCE:

GridWorld

class rlpy.Domains.GridWorld.GridWorld(mapname='/home/bob/git/rlpy/rlpy/Domains/GridWorldMaps/4x5.txt', noise=0.1, episodeCap=None)[source]

The GridWorld domain simulates a path-planning problem for a mobile robot in an environment with obstacles. The goal of the agent is to navigate from the starting point to the goal state. The map is loaded from a text file filled with numbers showing the map with the following coding for each cell:

  • 0: empty
  • 1: blocked
  • 2: start
  • 3: goal
  • 4: pit

STATE: The Row and Column corresponding to the agent’s location.

ACTIONS: Four cardinal directions: up, down, left, right (given that the destination is not blocked or out of range).

TRANSITION: There is 30% probability of failure for each move, in which case the action is replaced with a random action at each timestep. Otherwise the move succeeds and the agent moves in the intended direction.

REWARD: The reward on each step is -.001 , except for actions that bring the agent to the goal with reward of +1.

ACTIONS = array([[-1, 0], [ 1, 0], [ 0, -1], [ 0, 1]])

Up, Down, Left, Right

COLS = 0

Number of rows and columns of the map

GOAL_REWARD = 1

Reward constants

NOISE = 0

Movement Noise

ROWS = 0

Number of rows and columns of the map

episodeCap = None

Set by the domain = min(100,rows*cols)

HIV Treatment

class rlpy.Domains.HIVTreatment.HIVTreatment[source]

Simulation of HIV Treatment. The aim is to find an optimal drug schedule.

STATE: The state contains concentrations of 6 different cells:

  • T1: non-infected CD4+ T-lymphocytes [cells / ml]
  • T1*: infected CD4+ T-lymphocytes [cells / ml]
  • T2: non-infected macrophages [cells / ml]
  • T2*: infected macrophages [cells / ml]
  • V: number of free HI viruses [copies / ml]
  • E: number of cytotoxic T-lymphocytes [cells / ml]

ACTIONS: The therapy consists of 2 drugs (reverse transcriptase inhibitor [RTI] and protease inhibitor [PI]) which are activated or not. The action space contains therefore of 4 actions:

  • 0: none active
  • 1: RTI active
  • 2: PI active
  • 3: RTI and PI active

REFERENCE:

See also

Ernst, D., Stan, G., Gonc, J. & Wehenkel, L. Clinical data based optimal STI strategies for HIV: A reinforcement learning approach In Proceedings of the 45th IEEE Conference on Decision and Control (2006).

dt = 5

measurement every 5 days

episodeCap = 200

total of 1000 days with a measurement every 5 days

logspace = True

whether observed states are in log10 space or not

showDomain(a=0, s=None)[source]

shows a live graph of each concentration

show_domain_every = 20

only update the graphs in showDomain every x steps

Helicopter Hovering

class rlpy.Domains.HelicopterHover.HelicopterHoverExtended(noise_level=1.0, discount_factor=0.95)[source]

Implementation of a simulator that models one of the Stanford autonomous helicopters (an XCell Tempest helicopter) in the flight regime close to hover.

Adapted from the RL-Community Java Implementation

STATE: The state of the helicopter is described by a 20-dimensional vector with the following entries:

  • 0: xerr [helicopter x-coord position - desired x-coord position] – helicopter’s x-axis points forward
  • 1: yerr [helicopter y-coord position - desired y-coord position] – helicopter’s y-axis points to the right
  • 2: zerr [helicopter z-coord position - desired z-coord position] – helicopter’s z-axis points down
  • 3: u [forward velocity]
  • 4: v [sideways velocity (to the right)]
  • 5: w [downward velocity]
  • 6: p [angular rate around helicopter’s x axis]
  • 7: q [angular rate around helicopter’s y axis]
  • 8: r [angular rate around helicopter’s z axis]
  • 9-12: orientation of heli in world as quaterion
  • 13-18: current noise due to gusts (usually not observable!)
  • 19: t number of timesteps in current episode

REFERENCE:

See also

Abbeel, P., Ganapathi, V. & Ng, A. Learning vehicular dynamics, with application to modeling helicopters. Advances in Neural Information Systems (2006).

MAX_POS = 20.0

[m] maximum deviation in position in each dimension

MAX_VEL = 10.0

[m/s] maximum velocity in each dimension

actions = array([[-0.2 , -0.2 , -0.2 , 0. ], [-0.2 , -0.2 , -0.2 , 0.15], [-0.2 , -0.2 , -0.2 , 0.3 ], ..., [ 0.2 , 0.2 , 0.2 , 0.15], [ 0.2 , 0.2 , 0.2 , 0.3 ], [ 0.2 , 0.2 , 0.2 , 0.5 ]])

all possible actions

discount_factor = 0.95

discount factor

dt = 0.01

length of one timestep

wind = array([ 0., 0., 0.])

wind in neutral orientation

class rlpy.Domains.HelicopterHover.HelicopterHover(noise_level=1.0, discount_factor=0.95)[source]

Warning

This domain has an internal hidden state, as it actually is a POMDP. Besides the 12-dimensional observable state, there is an internal state saved as self.hidden_state_ (time and long-term noise which simulated gusts of wind). be aware of this state if you use this class to produce samples which are not in order

Implementation of a simulator that models one of the Stanford autonomous helicopters (an XCell Tempest helicopter) in the flight regime close to hover.

Adapted from the RL-Community Java Implementation

STATE: The state of the helicopter is described by a 12-dimensional vector with the following entries:

  • 0: xerr [helicopter x-coord position - desired x-coord position] – helicopter’s x-axis points forward
  • 1: yerr [helicopter y-coord position - desired y-coord position] – helicopter’s y-axis points to the right
  • 2: zerr [helicopter z-coord position - desired z-coord position] – helicopter’s z-axis points down
  • 3: u [forward velocity]
  • 4: v [sideways velocity (to the right)]
  • 5: w [downward velocity]
  • 6: p [angular rate around helicopter’s x axis]
  • 7: q [angular rate around helicopter’s y axis]
  • 8: r [angular rate around helicopter’s z axis]
  • 9-11: orientation of the world in the heli system as quaterion

REFERENCE:

See also

Abbeel, P., Ganapathi, V. & Ng, A. Learning vehicular dynamics, with application to modeling helicopters. Advances in Neural Information Systems (2006).

Intruder Monitoring

class rlpy.Domains.IntruderMonitoring.IntruderMonitoring(mapname='/home/bob/git/rlpy/rlpy/Domains/IntruderMonitoringMaps/4x4_2A_3I.txt')[source]

Formulated as an MDP, the intruder monitoring task is to guard danger zones using cameras so that if an intruder moves to a danger zone, at least one camera is pointing at that location.

All locations are on a 2-D grid.

The episode is finished after 1000 steps.

STATE:

Location of: [ Agent_1, Agent_2, ... Agent n ]

Location of: [ Intruder_1, Intruder_2, ... Intruder_m ]

Where n is number of agents, m is number of intruders.

ACTIONS: [Up, Down, Left, Right, Remain]^n (one action for each agent).

TRANSITION: Each agent can move in 4 directions + stay still. There is no noise on any movements. Each intruder moves with a fixed policy (specified by the user) By Default, intruder policy is uniform random.

Map of the world contains fixed number of danger zones. Maps are simple text files contained in the Domains/IntruderMonitoringMaps/ directory.

REWARD:

-1 for every visit of an intruder to a danger zone with no camera present.

The team receives a penalty whenever there is an intruder on a danger zone in the absence of an agent. The task is to allocate agents on the map so that intruders do not enter the danger zones without attendance of an agent.

ACTIONS_PER_AGENT = array([[-1, 0], [ 1, 0], [ 0, -1], [ 0, 1], [ 0, 0]])

Actions: Up, Down, Left, Right, Null

COLS = 0

Number of rows and columns of the map

INTRUSION_PENALTY = -1.0

Rewards

IntruderPolicy(s_i)[source]
Parameters:s_i – The state of a single agent (where the domain state s = [s_0, ... s_i ... s_NUMBER_OF_AGENTS]).
Returns:a valid actions for the agent in state s_i to take.

Default random action among possible.

NUMBER_OF_AGENTS = 0

Number of Cooperating agents

NUMBER_OF_INTRUDERS = 0

Number of Intruders

ROWS = 0

Number of rows and columns of the map

default_map_dir = '/home/bob/git/rlpy/rlpy/Domains/IntruderMonitoringMaps'

directory with maps shipped with rlpy

possibleActionsPerAgent(s_i)[source]

Returns all possible actions for a single (2-D) agent state s_i (where the domain state s = [s_0, ... s_i ... s_NUMBER_OF_AGENTS])

  1. tile the [R,C] for all actions
  2. add all actions to the results
  3. Find feasible rows and add them as possible actions
step(a)[source]

Move all intruders according to the IntruderPolicy(), default uniform random action. Move all agents according to the selected action a. Calculate the reward = Number of danger zones being violated by intruders while no agents are present (ie, intruder occupies a danger cell with no agent simultaneously occupying the cell).

Mountain Car

class rlpy.Domains.MountainCar.MountainCar(noise=0)[source]

The goal is to drive an under accelerated car up to the hill.

STATE: Position and velocity of the car [x, xdot]

ACTIONS: [Acc backwards, Coast, Acc forward]

TRANSITIONS: Move along the hill with some noise on the movement.

REWARD: -1 per step and 0 at or beyond goal (x-goal > 0).

There is optional noise on vehicle acceleration.

REFERENCE: Based on RL-Community Java Implementation

Parameters:noise – Magnitude of noise (times accelerationFactor) in stochastic velocity changes
GOAL = 0.5

X-Position of the goal location (Should be at/near hill peak)

GOAL_REWARD = 0

Reward for reach the goal.

XDOTMAX = 0.07

Upper bound on car velocity

XMAX = 0.6

Upper bound on domain position

hillPeakFrequency = 3.0

Hill peaks are generated as sinusoid; this is freq. of that sinusoid.

isTerminal()[source]
Returns:True if the car has reached or exceeded the goal position.
noise = 0

Magnitude of noise (times accelerationFactor) in stochastic velocity changes

step(a)[source]

Take acceleration action a, adding noise as specified in __init__().

Persistent Search and Track Mission

class rlpy.Domains.PST.PST(NUM_UAV=3)[source]

Persistent Search and Track Mission with multiple Unmanned Aerial Vehicle (UAV) agents.

Goal is to perform surveillance and communicate it back to base in the presence of stochastic communication and “health” (overall system functionality) constraints, all without without losing any UAVs because of running out of fuel.

STATE:

Each UAV has 4 state dimensions:

  • LOC: position of a UAV: BASE (0), REFUEL (1), COMMS (2), SURVEIL (3).
  • FUEL: integer fuel qty remaining.
  • ACT_STATUS: Actuator status: see description for info.
  • SENS_STATUS: Sensor status: see description for info.

Domain state vector consists of 4 blocks of states, each corresponding to a property of the UAVs (listed above)

So for example:

>>> state = [1,2,9,3,1,0,1,1]

corresponds to blocks

>>> loc, fuel, act_status, sens_status = [1,2], [9,3], [1,0], [1,1]

which has the meaning:

UAV 1 in location 1, with 9 fuel units remaining, and sensor + actuator with status 1 (functioning). UAV 2 in location 2, 3 fuel units remaining, actuator with status 0 and sensor with status 1.

ACTIONS:

Each UAV can take one of 3 actions: {RETREAT, LOITER, ADVANCE} Thus, the action space is \(3^n\), where n is the number of UAVs.

Detailed Description The objective of the mission is to fly to the surveillance node and perform surveillance on a target, while ensuring that a communication link with the base is maintained by having a UAV with a working actuator loitering on the communication node.

Movement of each UAV is deterministic with 5% failure rate for both the actuator and sensor of each UAV on each step. A penalty is applied for each unit of fuel consumed, which occurs when a UAV moves between locations or when it is loitering above a COMMS or SURVEIL location (ie, no penalty when loitering at REFUEL or BASE).

A UAV with a failed sensor cannot perform surveillance. A UAV with a failed actuator cannot perform surveillance or communication, and can only take actions leading it back to the REFUEL or BASE states, where it may loiter.

Loitering for 1 timestep at REFUEL assigns fuel of 10 to that UAV.

Loitering for 1 timestep at BASE assigns status 1 (functioning) to Actuator and Sensor.

Finally, if any UAV has fuel 0, the episode terminates with large penalty.

REWARD

The objective of the mission is to fly to the surveillance node and perform surveillance on a target, while ensuring that a communication link with the base is maintained by having a UAV with a working actuator loitering on the communication node.

The agent receives: + 20 if an ally with a working sensor is at surveillance node while an ally with a working motor is at the communication node, apenalty of - 50 if any UAV crashes and always some small penalty for burned fuel.

REFERENCE:

See also

J. D. Redding, T. Toksoz, N. Ure, A. Geramifard, J. P. How, M. Vavrina, and J. Vian. Distributed Multi-Agent Persistent Surveillance and Tracking With Health Management. AIAA Guidance, Navigation, and Control Conference (2011).

Parameters:NUM_UAV – the number of UAVs in the domain
FUEL_BURN_REWARD_COEFF = -1

Negative reward coefficient: for fuel burn penalty [not mentioned in MDP Tutorial]

FULL_FUEL = 10

Number of fuel units at start

MOVE_REWARD_COEFF = 0

Reward (negative) coefficient for movement (i.e., fuel burned while loitering might be penalized above, but no movement cost)

NUM_TARGET = 1

Number of targets in surveillance region; SURVEIL_REWARD is multiplied by the number of targets successfully observed

P_ACT_FAIL = 0.05

Probability that an actuator fails on this timestep for a given UAV

P_SENSOR_FAIL = 0.05

Probability that a sensor fails on this timestep for a given UAV

SURVEIL_REWARD = 20

Per-step, per-UAV reward coefficient for performing surveillance on each step [C_cov]

discount_factor = 0.9

Discount factor

properties2StateVec(locations, fuel, actuator, sensor)[source]

Appends the arguments into an nparray to create an RLPy state vector.

state2Struct(s)[source]

Convert generic RLPy state s to internal state

Parameters:s – RLPy state
Returns:PST.StateStruct – the custom structure used by this domain.
struct2State(sState)[source]

Converts a custom PST.StateStruct to an RLPy state vector.

Parameters:sState – the PST.StateStruct object
Returns:RLPy state vector
vecList2id(x, maxValue)[source]

Returns a list of unique id’s based on possible permutations of a list of integer lists. The length of the integer lists need not be the same.

Parameters:
  • x – A list of varying-length lists
  • maxValue – the largest value a cell of x can take.
Returns:

int – unique value associated with a list of lists of this length.

Given a list of lists of the form [[0,1,2],[0,1],[1,2],[0,1]]... return unique id for each permutation between lists; eg above, would return 3*2*2*2 values ranging from 0 to 3^4 -1 (3 is max value possible in each of the lists, maxValue)

vecList2idHelper(x, actionIDs, ind, curActionList, maxValue, limits)[source]

Helper method for vecList2id().

Returns:a list of unique id’s based on possible permutations of this list of lists.

See vecList2id()

Pacman

class rlpy.Domains.Pacman.Pacman(noise=0.1, timeout=30, layoutFile='/home/bob/git/rlpy/rlpy/Domains/PacmanPackage/layouts/trickyClassic.lay', numGhostAgents=1000)[source]

Pacman domain, which acts as a wrapper for the Pacman implementation from the BerkeleyX/CS188.1x course project 3.

STATE: The state vector has a series of dimensions:

  • [2] The x and y coordinates of pacman
  • [3 * ng] the x and y coordinates as well as the scare time of each ghost (“scare time” is how long the ghost remains scared after consuming a capsule.)
  • [nf] binary variables indicating if a food is still on the board or not
  • [nc] binary variables for each capsule indicating if it is still on the board or not

nf and nc are map-dependent, and ng can be set as a parameter. Based on above, total dimensionality of state vector is map-dependent, and given by (2 + 3*ng + nf + nc).

ACTIONS: Move Pacman [up, down, left, right, stay]

REWARD: See the Berkeley project website below for more info.

Note

The visualization runs as fast as your CPU will permit; to slow things down so gameplay is actually visible, de-comment time.sleep() in the showDomain() method.

REFERENCE: This domain is an RLPy wrapper for the implementation from the BerkeleyX/CS188.1x course project 3

See the original source code (zipped)

For more details of the domain see the original package in the Domains/PacmanPackage folder.

layoutFile:
filename of the map file
noise:
with this probability pacman makes a random move instead the one specified by the action
default_layout_dir = '/home/bob/git/rlpy/rlpy/Domains/PacmanPackage/layouts'

location of layouts shipped with rlpy

isTerminal()[source]

Checks whether the game should terminate at the given state. (Terminate for failure, ie eaten by ghost or out of time, and for success, all food on map eaten.) If game should terminate, returns the proper indication to step function. Accounts for scoring changes in terminal states.

s0()[source]

re-initializes internal states when an episode starts, returns a s vector

state

get the internal game state represented as a numpy array

step(a)[source]

Applies actions from outside the Pacman domain to the given state. Internal states accounted for along with scoring and terminal checking. Returns a tuple of form (reward, new state vector, terminal)

Pinball

class rlpy.Domains.Pinball.Pinball(noise=0.1, episodeCap=1000, configuration='/home/bob/git/rlpy/rlpy/Domains/PinballConfigs/pinball_simple_single.cfg')[source]

The goal of this domain is to maneuver a small ball on a plate into a hole. The plate may contain obstacles which should be avoided.

STATE:
The state is given by a 4-dimensional vector, consisting of position and velocity of the ball.
ACTIONS:
There are 5 actions, standing for slanting the plat in x or y direction or a horizontal position of the plate.
REWARD:
Slanting the plate costs -4 reward in addition to -1 reward for each timestep. When the ball reaches the hole, the agent receives 10000 units of reward.

REFERENCE:

See also

G.D. Konidaris and A.G. Barto: Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining. Advances in Neural Information Processing Systems 22, pages 1015-1023, December 2009.

configuration:
location of the configuration file
episodeCap:
maximum length of an episode
noise:
with probability noise, a uniformly random action is executed
default_config_dir = '/home/bob/git/rlpy/rlpy/Domains/PinballConfigs'

default location of config files shipped with rlpy

PuddleWorld

class rlpy.Domains.PuddleWorld.PuddleWorld(noise_level=0.01, discount_factor=1.0)[source]

Implementation of the puddle world benchmark as described in references below.

STATE: 2-dimensional vector, s, each dimension is continuous in [0,1]

ACTIONS: [right, up, left, down] - NOTE it is not possible to loiter.

REWARD: 0 for goal state, -1 for each step, and an additional penalty
for passing near puddles.

REFERENCE:

See also

Jong, N. & Stone, P.: Kernel-based models for reinforcement learning, ICML (2006)

See also

Sutton, R. S.: Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding, NIPS(1996)

RCCar

class rlpy.Domains.RCCar.RCCar(noise=0)[source]

This is a simple simulation of Remote Controlled Car in a room with no obstacle.

STATE: 4 continuous dimensions:

  • x, y: (center point on the line connecting the back wheels),

  • speed (S on the webpage)

  • heading (theta on the webpage) w.r.t. body frame.

    positive values => turning right, negative values => turning left

ACTIONS: Two action dimensions:

  • accel [forward, coast, backward]
  • phi [turn left, straight, turn Right]

This leads to 3 x 3 = 9 possible actions.

REWARD: -1 per step, 100 at goal.

REFERENCE:

System Administrator

class rlpy.Domains.SystemAdministrator.SystemAdministrator(networkmapname='/home/bob/git/rlpy/rlpy/Domains/SystemAdministratorMaps/20MachTutorial.txt')[source]

Simulation of system of network computers.

Computers in a network randomly fail and influence the probability of connected machines failing as well - the system administrator must work to keep as many machines running as possible, but she can only fix one at a time.

STATE:

Each computer has binary state {BROKEN, RUNNING}.

The state space is thus 2^n, where n is the number of computers in the system.

All computers are connected to each other by a fixed topology and initially have state RUNNING.

Example

[1 1 0 1] -> computers 0,1,3 are RUNNING, computer 2 is BROKEN.

ACTIONS: The action space is the integers [0,n], where n corresponds to taking no action, and [0,n-1] selects a computer to repair.

Repairing a computer causes its state to become RUNNING regardless of its previous state. However, penalty -0.75 is applied for taking a repair action.

REWARD: +1 is awarded for each computer with status RUNNING, but -0.75 is applied for any repair action taken (ie, a != 0)

Visualization Broken computers are colored red, and any links to other computers change from solid to dotted, reflecting the higher probability of failure of those machines.

REFERENCE

See also

Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient Solution Algorithms for Factored MDPs. Journal of Artificial Intelligence Research (2003) Issue 19, p 399-468.

Parameters:networkmapname – The name of the file to use as the computer network map. Assumed to be located in the SystemAdministratorMaps directory of RLPy.
IS_RING = False

For ring structures, Parr enforces assymetry by having one machine get extra reward for being RUNNING.

P_REBOOT_REPAIR = 1.0

Probability that a machine becomes RUNNING after a repair action is taken

P_SELF_REPAIR = 0.04

Probability of a machine randomly self-repairing (no penalty)

discount_factor = 0.95

Discount factor

episodeCap = 200

Maximum number of steps

loadNetwork(path)[source]
Parameters:path – Path to the map file, of form ‘/Domains/SystemAdministratorMaps/mapname.txt’

Sets the internal variables _Neighbors and _Edges, where each cell of _Neighbors is a list containing the neighbors of computer node <i> at index <i>, and _Edges is a list of tuples (node1, node2) where node1 and node2 share an edge and node1 < node2.

setNeighbors()[source]

Sets the internal NEIGHBORS variable

Note

Requires a call to setUniqueEdges() first.

setUniqueEdges(neighborsList)[source]
Parameters:neighborsList – each element at index i is a list of nodes connected to the node at i.

Constructs a list (node1, node2) where node1 and node2 share an edge and node1 < node2 and sets the unique Edges of the network (all edges are bidirectional).

Swimmer

class rlpy.Domains.Swimmer.Swimmer(d=3, k1=7.5, k2=0.3)[source]

A swimmer consisting of a chain of d links connected by rotational joints. Each joint is actuated. The goal is to move the swimmer to a specified goal position.

States:
2 dimensions: position of nose relative to goal
d -1 dimensions: angles
2 dimensions: velocity of the nose
d dimensions: angular velocities
Actions:
each joint torque is discretized in 3 values: -2, 0, 2

Note

adapted from Yuval Tassas swimmer implementation in Matlab available at http://www.cs.washington.edu/people/postdocs/tassa/code/

See also

Tassa, Y., Erez, T., & Smart, B. (2007). Receding Horizon Differential Dynamic Programming. In Advances in Neural Information Processing Systems.

d:
number of joints