The Reinforcement Learning Library for Education and Research

Previous topic


Next topic


This Page

MDP Solvers

class rlpy.MDPSolvers.MDPSolver.MDPSolver(job_id, representation, domain, planning_time=inf, convergence_threshold=0.005, ns_samples=100, project_path='.', log_interval=5000, show=False)[source]

MDPSolver is the base class for model based reinforcement learning agents and planners.


job_id (int): Job ID number used for running multiple jobs on a cluster.

representation (Representation): Representation used for the value function.

domain (Domain): Domain (MDP) to solve.

planning_time (int): Maximum amount of time in seconds allowed for planning. Defaults to inf (unlimited).

convergence_threshold (float): Threshold for determining if the value function has converged.

ns_samples (int): How many samples of the successor states to take.

project_path (str): Output path for saving the results of running the MDPSolver on a domain.

log_interval (int): Minimum number of seconds between displaying logged information.

show (bool): Enable visualization?

BellmanBackup(s, a, ns_samples, policy=None)[source]

Applied Bellman Backup to state-action pair s,a i.e. Q(s,a) = E[r + discount_factor * V(s’)] If policy is given then Q(s,a) = E[r + discount_factor * Q(s’,pi(s’)]

s (ndarray): The current state a (int): The action taken in state s ns_samples(int): Number of next state samples to use. policy (Policy): Policy object to use for sampling actions.

Check to see if the representation is Tabular as Policy Iteration and Value Iteration only work with Tabular representation


Return matrices of S,A,NS,R,T where each row of each numpy 2d-array is a sample by following the current policy.

  • S: (#samples) x (# state space dimensions)
  • A: (#samples) x (1) int [we are storing actionIDs here, integers]
  • NS:(#samples) x (# state space dimensions)
  • R: (#samples) x (1) float
  • T: (#samples) x (1) bool

See Q_MC() and MC_episode()


Return a boolean stating if there is time left for planning.


Set Exploration to zero and sample one episode from the domain.


Solve the domain MDP.