The Representation is the Agent‘s model of the value function associated with a Domain.
As the Agent interacts with the Domain, it receives updates in the form of state, action, reward, next state, next action.
The Agent passes these quantities to its Representation, which is responsible for maintaining the value function usually in some lowerdimensional feature space. Agents can later query the Representation for the value of being in a state V(s) or the value of taking an action in a particular state ( known as the Qfunction, Q(s,a) ).
Note
Throughout the framework, phi refers to the vector of features; phi or phi_s is thus the vector of feature functions evaluated at the state s. phi_s_a appends A1 copies of phi_s, such that phi_s_a = A * phi, where A is the size of the action space and phi is the number of features. Each of these blocks corresponds to a stateaction pair; all blocks except for the selected action a are set to 0.
The Representation class is a base class that provides the basic framework for all representations. It provides the methods and attributes that allow child classes to interact with the Agent and Domain classes within the RLPy library.
All new representation implementations should inherit from this class.
Note
At present, it is assumed that the Linear Function approximator family of representations is being used.
Parameters: 


Returns the learned value of a stateaction pair, Q(s,a).
Parameters: 


Returns:  (float) the value of the stateaction pair (s,a), Q(s,a). 
Returns the state action value, Q(s,a), by performing one step lookahead on the domain.
Note
For an example of how this function works, see Line 8 of Figure 4.3 in Sutton and Barto 1998.
If the domain does not define expectedStep(), this function uses ns_samples samples to estimate the one_step lookahead. If a policy is passed (used in the policy evaluation), it is used to generate the action for the next state. Otherwise the best action is selected.
Note
This function should not be called in any RL algorithms unless the underlying domain is an approximation of the true model.
Parameters: 


Returns:  The onestep lookahead stateaction value, Q(s,a). 
Returns an array of actions available at a state and their associated values.
Parameters: 


Returns:  The tuple (Q,A) where:  Q: an array of Q(s,a), the values of each action at s.

Note
This function is distinct from Q(), which computes the Q function for an (s,a) pair.
Instead, this function Qs() computes all Q function values (for all possible actions) at a given state s.
Returns an array of actions and their associated values Q(s,a), by performing one step lookahead on the domain for each of them.
Note
For an example of how this function works, see Line 8 of Figure 4.3 in Sutton and Barto 1998.
If the domain does not define expectedStep(), this function uses ns_samples samples to estimate the one_step lookahead. If a policy is passed (used in the policy evaluation), it is used to generate the action for the next state. Otherwise the best action is selected.
Note
This function should not be called in any RL algorithms unless the underlying domain is an approximation of the true model.
Parameters: 


Returns:  an array of length A containing the Q(s,a) for each possible a, where A is the number of possible actions from state s 
Returns the value of state s under possible actions p_actions.
Parameters: 


See Qs().
Returns the value of being in state s, V(s), by performing one step lookahead on the domain.
Note
For an example of how this function works, see Line 6 of Figure 4.5 in Sutton and Barto 1998.
If the domain does not define expectedStep(), this function uses ns_samples samples to estimate the one_step lookahead.
Note
This function should not be called in any RL algorithms unless the underlying domain is an approximation of the true model.
Parameters: 


Returns:  The value of being in state s, V(s). 
Number of actions in the representation
Returns the index of active initial features based on bins in each dimension. :param s: The state
Returns:  The active initial features of this representation (before expansion) 

Add a new zero weight, corresponding to a newly added feature, to all actions.
Number of aggregated states based on the discretization. If the represenation is adaptive, set to the best resolution possible
Accepts a batch of states, returns the best action associated with each.
Note
See bestAction()
Parameters: 


Returns:  An array of the best action associated with each state. 
Builds the feature vector for a series of stateaction pairs (s,a) using the copypaste method.
Note
See phi_sa() for more information.
Parameters: 


Returns:  all_phi_s_a (of dimension p x (s_a) ) 
Returns the best action at a given state. If there are multiple best actions, this method selects one of them uniformly randomly. If phi_s [the feature vector at state s] is given, it is used to speed up code by preventing recomputation within this function.
See bestActions()
Parameters: 


Returns:  The best action at the given state. 
Returns a list of the best actions at a given state. If phi_s [the feature vector at state s] is given, it is used to speed up code by preventing recomputation within this function.
See bestAction()
Parameters: 


Returns:  A list of the best actions at the given state. 
Returns a vector where each element is the zeroindexed bin number corresponding with the given state. (See hashState()) Note that this vector will have the same dimensionality as s.
(Note: This method is binary compact; the negative case of binary features is excluded from feature activation. For example, if the domain has a light and the light is off, no feature will be added. This is because the very absence of the feature itself corresponds to the light being off.
Width of bins in each dimension
Number of possible states per dimension [1bydim]
The Domain that this Representation is modeling
A dictionary used to cache expected results of step(). Used for planning algorithms
Abstract Method
Return the data type for the underlying features (eg ‘float’).
Number of features in the representation
Returns a unique id for a given state. Essentially, enumerate all possible states and return the ID associated with s.
Under the hood: first, discretize continuous dimensions into bins as necessary. Then map the binstate to an integer.
Any stochastic behavior in __init__() is broken out into this function so that if the random seed is later changed (eg, by the Experiment), other member variables and functions are updated accordingly.
True if the number of features may change during execution.
Returns phi_nonTerminal() for a given representation, or a zero feature vector in a terminal state.
Parameters:  s – The state for which to compute the feature vector 

Returns:  numpy array, the feature vector evaluted at state s. 
Note
If state s is terminal the feature vector is returned as zeros! This prevents the learning algorithm from wrongfully associating the end of one episode with the start of the next (e.g., thinking that reaching the terminal state causes it to teleport back to the start state s0).
Abstract Method
Returns the feature vector evaluated at state s for nonterminal states; see function phi() for the general case.
Parameters:  s – The given state 

Returns:  The feature vector evaluated at state s. 
Returns the feature vector corresponding to a stateaction pair. We use the copy paste technique (Lagoudakis & Parr 2003). Essentially, we append the phi(s) vector to itself A times, where A is the size of the action space. We zero the feature values of all of these blocks except the one corresponding to the actionID a.
When snippet == False we construct and return the full, sparse phi_sa. When snippet == True, we return the tuple (phi_s, index1, index2) where index1 and index2 are the indices defining the ends of the phi_s block which WOULD be nonzero if we were to construct the full phi_sa.
Parameters: 


Returns:  If snippet==False, return the enormous phi_sa vector constructed by the copypaste method. If snippet==True, do not construct phi_sa, only return a tuple (phi_s, index1, index2) as described above. 
Identifies and adds (“discovers”) new features for this adaptive representation AFTER having obtained the TDError. For example, see iFDD. In that class, a new feature is added based on regions of high TDError.
Note
For adaptive representations that do not require access to TDError to determine which features to add next, you may use pre_discover() instead.
Parameters: 


Returns:  The number of new features added to the representation 
Identifies and adds (“discovers”) new features for this adaptive representation BEFORE having obtained the TDError. For example, see IncrementalTabular. In that class, a new feature is added anytime a novel state is observed.
Note
For adaptive representations that require access to TDError to determine which features to add next, use post_discover() instead.
Parameters: 


Returns:  The number of new features added to the representation 
Set the number of bins for each dimension of the domain. Continuous spaces will be slices using the discretization parameter. :param domain: the problem Domain to learn :param discretization: The number of bins a continuous domain should be sliced into.
Returns the state vector correponding to a state_id. If dimensions are continuous it returns the state representing the middle of the bin (each dimension is discretized according to representation.discretization.
Parameters:  s_id – The id of the state, often calculated using the state2bin function 

Returns:  The state s corresponding to the integer s_id. 
Accepts a continuous state s, bins it into the discretized domain, and returns the state of the nearest gridpoint. Essentially, we snap s to the nearest gridpoint and return that gridpoint state. For continuous MDPs this plays a major rule in improving the speed through caching of next samples.
Parameters:  s – The given state 

Returns:  The nearest state s which is captured by the discretization. 
A numpy array of the Linear Weights, one for each feature (theta)