RLPy

RLPy

The Reinforcement Learning Library for Education and Research

Previous topic

Policy

Next topic

MDP Solvers

This Page

Representation

class rlpy.Representations.Representation.Representation(domain, discretization=20, seed=1)[source]

The Representation is the Agent‘s model of the value function associated with a Domain.

As the Agent interacts with the Domain, it receives updates in the form of state, action, reward, next state, next action.

The Agent passes these quantities to its Representation, which is responsible for maintaining the value function usually in some lower-dimensional feature space. Agents can later query the Representation for the value of being in a state V(s) or the value of taking an action in a particular state ( known as the Q-function, Q(s,a) ).

Note

Throughout the framework, phi refers to the vector of features; phi or phi_s is thus the vector of feature functions evaluated at the state s. phi_s_a appends |A|-1 copies of phi_s, such that |phi_s_a| = |A| * |phi|, where |A| is the size of the action space and |phi| is the number of features. Each of these blocks corresponds to a state-action pair; all blocks except for the selected action a are set to 0.

The Representation class is a base class that provides the basic framework for all representations. It provides the methods and attributes that allow child classes to interact with the Agent and Domain classes within the RLPy library.

All new representation implementations should inherit from this class.

Note

At present, it is assumed that the Linear Function approximator family of representations is being used.

Parameters:
  • domain – the problem Domain to learn
  • discretization – Number of bins used for each continuous dimension. For discrete dimensions, this parameter is ignored.
Q(s, terminal, a, phi_s=None)[source]

Returns the learned value of a state-action pair, Q(s,a).

Parameters:
  • s – The queried state in the state-action pair.
  • terminal – Whether or not s is a terminal state
  • a – The queried action in the state-action pair.
  • phi_s – (optional) The feature vector evaluated at state s. If the feature vector phi(s) has already been cached, pass it here as input so that it need not be computed again.
Returns:

(float) the value of the state-action pair (s,a), Q(s,a).

Q_oneStepLookAhead(s, a, ns_samples, policy=None)[source]

Returns the state action value, Q(s,a), by performing one step look-ahead on the domain.

Note

For an example of how this function works, see Line 8 of Figure 4.3 in Sutton and Barto 1998.

If the domain does not define expectedStep(), this function uses ns_samples samples to estimate the one_step look-ahead. If a policy is passed (used in the policy evaluation), it is used to generate the action for the next state. Otherwise the best action is selected.

Note

This function should not be called in any RL algorithms unless the underlying domain is an approximation of the true model.

Parameters:
  • s – The given state
  • a – The given action
  • ns_samples – The number of samples used to estimate the one_step look-ahead.
  • policy – (optional) Used to select the action in the next state (after taking action a) when estimating the one_step look-aghead. If policy == None, the best action will be selected.
Returns:

The one-step lookahead state-action value, Q(s,a).

Qs(s, terminal, phi_s=None)[source]

Returns an array of actions available at a state and their associated values.

Parameters:
  • s – The queried state
  • terminal – Whether or not s is a terminal state
  • phi_s – (optional) The feature vector evaluated at state s. If the feature vector phi(s) has already been cached, pass it here as input so that it need not be computed again.
Returns:

The tuple (Q,A) where: - Q: an array of Q(s,a), the values of each action at s.

  • A: the corresponding array of actionIDs (integers)

Note

This function is distinct from Q(), which computes the Q function for an (s,a) pair.

Instead, this function Qs() computes all Q function values (for all possible actions) at a given state s.

Qs_oneStepLookAhead(s, ns_samples, policy=None)[source]

Returns an array of actions and their associated values Q(s,a), by performing one step look-ahead on the domain for each of them.

Note

For an example of how this function works, see Line 8 of Figure 4.3 in Sutton and Barto 1998.

If the domain does not define expectedStep(), this function uses ns_samples samples to estimate the one_step look-ahead. If a policy is passed (used in the policy evaluation), it is used to generate the action for the next state. Otherwise the best action is selected.

Note

This function should not be called in any RL algorithms unless the underlying domain is an approximation of the true model.

Parameters:
  • s – The given state
  • ns_samples – The number of samples used to estimate the one_step look-ahead.
  • policy – (optional) Used to select the action in the next state (after taking action a) when estimating the one_step look-aghead. If policy == None, the best action will be selected.
Returns:

an array of length |A| containing the Q(s,a) for each possible a, where |A| is the number of possible actions from state s

V(s, terminal, p_actions, phi_s=None)[source]

Returns the value of state s under possible actions p_actions.

Parameters:
  • s – The queried state
  • terminal – Whether or not s is a terminal state
  • p_actions – the set of possible actions
  • phi_s – (optional) The feature vector evaluated at state s. If the feature vector phi(s) has already been cached, pass it here as input so that it need not be computed again.

See Qs().

V_oneStepLookAhead(s, ns_samples)[source]

Returns the value of being in state s, V(s), by performing one step look-ahead on the domain.

Note

For an example of how this function works, see Line 6 of Figure 4.5 in Sutton and Barto 1998.

If the domain does not define expectedStep(), this function uses ns_samples samples to estimate the one_step look-ahead.

Note

This function should not be called in any RL algorithms unless the underlying domain is an approximation of the true model.

Parameters:
  • s – The given state
  • ns_samples – The number of samples used to estimate the one_step look-ahead.
Returns:

The value of being in state s, V(s).

actions_num = 0

Number of actions in the representation

activeInitialFeatures(s)[source]

Returns the index of active initial features based on bins in each dimension. :param s: The state

Returns:The active initial features of this representation (before expansion)
addNewWeight()[source]

Add a new zero weight, corresponding to a newly added feature, to all actions.

agg_states_num = 0

Number of aggregated states based on the discretization. If the represenation is adaptive, set to the best resolution possible

batchBestAction(all_s, all_phi_s, action_mask=None, useSparse=True)[source]

Accepts a batch of states, returns the best action associated with each.

Note

See bestAction()

Parameters:
  • all_s – An array of all the states to consider.
  • all_phi_s – The feature vectors evaluated at a series of states. Has dimension p x n, where p is the number of states (indexed by row), and n is the number of features.
  • action_mask – (optional) a p x |A| mask on the possible actions to consider, where |A| is the size of the action space. The mask is a binary 2-d array, where 1 indicates an active mask (action is unavailable) while 0 indicates a possible action.
  • useSparse – Determines whether or not to use sparse matrix libraries provided with numpy.
Returns:

An array of the best action associated with each state.

batchPhi_s_a(all_phi_s, all_actions, all_phi_s_a=None, use_sparse=False)[source]

Builds the feature vector for a series of state-action pairs (s,a) using the copy-paste method.

Note

See phi_sa() for more information.

Parameters:
  • all_phi_s – The feature vectors evaluated at a series of states. Has dimension p x n, where p is the number of states (indexed by row), and n is the number of features.
  • all_actions – The set of actions corresponding to each feature. Dimension p x 1, where p is the number of states included in this batch.
  • all_phi_s_a – (Optional) Feature vector for a series of state-action pairs (s,a) using the copy-paste method. If the feature vector phi(s) has already been cached, pass it here as input so that it need not be computed again.
  • use_sparse – Determines whether or not to use sparse matrix libraries provided with numpy.
Returns:

all_phi_s_a (of dimension p x (s_a) )

bestAction(s, terminal, p_actions, phi_s=None)[source]

Returns the best action at a given state. If there are multiple best actions, this method selects one of them uniformly randomly. If phi_s [the feature vector at state s] is given, it is used to speed up code by preventing re-computation within this function.

See bestActions()

Parameters:
  • s – The given state
  • terminal – Whether or not the state s is a terminal one.
  • phi_s – (optional) the feature vector at state (s).
Returns:

The best action at the given state.

bestActions(s, terminal, p_actions, phi_s=None)[source]

Returns a list of the best actions at a given state. If phi_s [the feature vector at state s] is given, it is used to speed up code by preventing re-computation within this function.

See bestAction()

Parameters:
  • s – The given state
  • terminal – Whether or not the state s is a terminal one.
  • phi_s – (optional) the feature vector at state (s).
Returns:

A list of the best actions at the given state.

binState(s)[source]

Returns a vector where each element is the zero-indexed bin number corresponding with the given state. (See hashState()) Note that this vector will have the same dimensionality as s.

(Note: This method is binary compact; the negative case of binary features is excluded from feature activation. For example, if the domain has a light and the light is off, no feature will be added. This is because the very absence of the feature itself corresponds to the light being off.

binWidth_per_dim = 0

Width of bins in each dimension

bins_per_dim = 0

Number of possible states per dimension [1-by-dim]

domain = None

The Domain that this Representation is modeling

expectedStepCached = None

A dictionary used to cache expected results of step(). Used for planning algorithms

featureType()[source]

Abstract Method

Return the data type for the underlying features (eg ‘float’).

features_num = 0

Number of features in the representation

hashState(s)[source]

Returns a unique id for a given state. Essentially, enumerate all possible states and return the ID associated with s.

Under the hood: first, discretize continuous dimensions into bins as necessary. Then map the binstate to an integer.

init_randomization()[source]

Any stochastic behavior in __init__() is broken out into this function so that if the random seed is later changed (eg, by the Experiment), other member variables and functions are updated accordingly.

isDynamic = False

True if the number of features may change during execution.

phi(s, terminal)[source]

Returns phi_nonTerminal() for a given representation, or a zero feature vector in a terminal state.

Parameters:s – The state for which to compute the feature vector
Returns:numpy array, the feature vector evaluted at state s.

Note

If state s is terminal the feature vector is returned as zeros! This prevents the learning algorithm from wrongfully associating the end of one episode with the start of the next (e.g., thinking that reaching the terminal state causes it to teleport back to the start state s0).

phi_nonTerminal(s)[source]

Abstract Method

Returns the feature vector evaluated at state s for non-terminal states; see function phi() for the general case.

Parameters:s – The given state
Returns:The feature vector evaluated at state s.
phi_sa(s, terminal, a, phi_s=None, snippet=False)[source]

Returns the feature vector corresponding to a state-action pair. We use the copy paste technique (Lagoudakis & Parr 2003). Essentially, we append the phi(s) vector to itself |A| times, where |A| is the size of the action space. We zero the feature values of all of these blocks except the one corresponding to the actionID a.

When snippet == False we construct and return the full, sparse phi_sa. When snippet == True, we return the tuple (phi_s, index1, index2) where index1 and index2 are the indices defining the ends of the phi_s block which WOULD be nonzero if we were to construct the full phi_sa.

Parameters:
  • s – The queried state in the state-action pair.
  • terminal – Whether or not s is a terminal state
  • a – The queried action in the state-action pair.
  • phi_s – (optional) The feature vector evaluated at state s. If the feature vector phi(s) has already been cached, pass it here as input so that it need not be computed again.
  • snippet – if True, do not return a single phi_sa vector, but instead a tuple of the components needed to create it. See return value below.
Returns:

If snippet==False, return the enormous phi_sa vector constructed by the copy-paste method. If snippet==True, do not construct phi_sa, only return a tuple (phi_s, index1, index2) as described above.

post_discover(s, terminal, a, td_error, phi_s)[source]

Identifies and adds (“discovers”) new features for this adaptive representation AFTER having obtained the TD-Error. For example, see iFDD. In that class, a new feature is added based on regions of high TD-Error.

Note

For adaptive representations that do not require access to TD-Error to determine which features to add next, you may use pre_discover() instead.

Parameters:
  • s – The state
  • terminal – boolean, whether or not s is a terminal state.
  • a – The action
  • td_error – The temporal difference error at this transition.
  • phi_s – The feature vector evaluated at state s.
Returns:

The number of new features added to the representation

pre_discover(s, terminal, a, sn, terminaln)[source]

Identifies and adds (“discovers”) new features for this adaptive representation BEFORE having obtained the TD-Error. For example, see IncrementalTabular. In that class, a new feature is added anytime a novel state is observed.

Note

For adaptive representations that require access to TD-Error to determine which features to add next, use post_discover() instead.

Parameters:
  • s – The state
  • terminal – boolean, whether or not s is a terminal state.
  • a – The action
  • sn – The next state
  • terminaln – boolean, whether or not sn is a terminal state.
Returns:

The number of new features added to the representation

setBinsPerDimension(domain, discretization)[source]

Set the number of bins for each dimension of the domain. Continuous spaces will be slices using the discretization parameter. :param domain: the problem Domain to learn :param discretization: The number of bins a continuous domain should be sliced into.

stateID2state(s_id)[source]

Returns the state vector correponding to a state_id. If dimensions are continuous it returns the state representing the middle of the bin (each dimension is discretized according to representation.discretization.

Parameters:s_id – The id of the state, often calculated using the state2bin function
Returns:The state s corresponding to the integer s_id.
stateInTheMiddleOfGrid(s)[source]

Accepts a continuous state s, bins it into the discretized domain, and returns the state of the nearest gridpoint. Essentially, we snap s to the nearest gridpoint and return that gridpoint state. For continuous MDPs this plays a major rule in improving the speed through caching of next samples.

Parameters:s – The given state
Returns:The nearest state s which is captured by the discretization.
weight_vec = None

A numpy array of the Linear Weights, one for each feature (theta)