The Reinforcement Learning Library for Education and Research

Previous topic

MDP Solvers

Next topic

Acrobot with Euler Integration

This Page


class rlpy.Domains.Domain.Domain[source]

The Domain controls the environment in which the Agent resides as well as the reward function the Agent is subject to.

The Agent interacts with the Domain in discrete timesteps called episodes (see step()). At each step, the Agent informs the Domain what indexed action it wants to perform. The Domain then calculates the effects this action has on the environment and updates its internal state accordingly. It also returns the new state to the agent, along with a reward/penalty, and whether or not the episode is over (thus resetting the agent to its initial state).

This process repeats until the Domain determines that the Agent has either completed its goal or failed. The Experiment controls this cycle.

Because Agents are designed to be agnostic to the Domain that they are acting within and the problem they are trying to solve, the Domain needs to completely describe everything related to the task. Therefore, the Domain must not only define the observations that the Agent receives, but also the states it can be in, the actions that it can perform, and the relationships between the three.

The Domain class is a base clase that provides the basic framework for all Domains. It provides the methods and attributes that allow child classes to interact with the Agent and Experiment classes within the RLPy library. Domains should also provide methods that provide visualization of the Domain itself and of the Agent’s learning (showDomain() and showLearning() respectively)

All new domain implementations should inherit from Domain.


Though the state s can take on almost any value, if a dimension is not marked as ‘continuous’ then it is assumed to be integer.

actions_num = 0

The number of Actions the agent can perform

continuous_dims = []

List of the continuous dimensions of the domain

discount_factor = 0.9

The discount factor by which rewards are reduced

discrete_statespace_limits = []

Limits of each dimension of a discrete state space. This is the same as statespace_limits, without the extra -.5, +.5 added to each dimension

episodeCap = None

The cap used to bound each episode (return to state 0 after)


Any stochastic behavior in __init__() is broken out into this function so that if the random seed is later changed (eg, by the Experiment), other member variables and functions are updated accordingly.


Returns True if the current Domain.state is a terminal one, ie, one that ends the episode. This often results from either a failure or goal state being achieved.

The default definition does not terminate.

Returns:True if the state is a terminal state, False otherwise.

Loads the random state stored in the self.random_state_backup

logger = None

A simple object that records the prints in a file


The default version returns an enumeration of all actions [0, 1, 2...]. We suggest overriding this method in your domain, especially if not all actions are available from all states.

Parameters:s – The state to query for possible actions (overrides self.state if s != None)
Returns:A numpy array containing every possible action in the domain.


These actions must be integers; internally they may be handled using other datatypes. See vec2id() and id2vec() for converting between integers and multidimensional quantities.


Begins a new episode and returns the initial observed state of the Domain. Sets self.state accordingly.

Returns:A numpy array that defines the initial domain state.
sampleStep(a, num_samples)[source]

Sample a set number of next states and rewards from the domain. This function is used when state transitions are stochastic; deterministic transitions will yield an identical result regardless of num_samples, since repeatedly sampling a (state,action) pair will always yield the same tuple (r,ns,terminal). See step().

  • a – The action to attempt
  • num_samples – The number of next states and rewards to be sampled.

A tuple of arrays ( S[], A[] ) where S is an array of next states, A is an array of rewards for those states.


Stores the state of the the random generator. Using loadRandomState this state can be loaded.

show(a=None, representation=None)[source]

Shows a visualization of the current state of the domain and that of learning.

See showDomain() and showLearning(), both called by this method.


Some domains override this function to allow an optional s parameter to be passed, which overrides the self.state internal to the domain; however, not all have this capability.

  • a – The action being performed
  • representation – The learned value function Representation.

Abstract Method:

Shows a visualization of the current state of the domain.

Parameters:a – The action being performed.

Abstract Method:

Shows a visualization of the current learning, usually in the form of a gridded value function and policy. It is thus really only possible for 1 or 2-state domains.

Parameters:representation – the learned value function Representation to generate the value function / policy plots.
state_space_dims = 0

Number of dimensions of the state space

states_num = 0

The number of possible states in the domain

statespace_limits = []

Limits of each dimension of the state space. Each row corresponds to one dimension and has two elements [min, max]


Abstract Method:

Performs the action a and updates the Domain state accordingly. Returns the reward/penalty the agent obtains for the state/action pair determined by Domain.state and the parameter a, the next state into which the agent has transitioned, and a boolean determining whether a goal or fail state has been reached.


Domains often specify stochastic internal state transitions, such that the result of a (state,action) pair might vary on different calls (see also the sampleStep() method). Be sure to look at unique noise parameters of each domain if you require deterministic transitions.

Parameters:a – The action to perform.


The action a must be an integer >= 0, and might better be called the “actionID”. See the class description Domain above.

Returns:The tuple (r, ns, t, p_actions) = (Reward [value], next observed state, isTerminal [boolean])