Deep Experience Planning

Alexander Neitz, Kyrill Schmid, Lisbeth Claessens, Lenz Belzner

Notebook

Introduction

We investigate the combination of statistical online planning based on local search and value function approximations learned from past observations. Many current state-of-the-art statistical online planners perform some kind of finite-horizon search. In many application domains, the search horizon is far smaller than epsisode length or system lifetime. This raises the question of how to evaluate of final search states, i.e. states reached by a particular search run when the finite horizon is met. Deep Experience Planning (DEEP) aims at leveraging statistical online planning with the use of value function approximations learned from previously observed system transitions.

A non-exhaustive list of potentially related work.

  • Syam, Rafiuddin, Keigo Watanabe, and Kiyotaka Izumi. "An adaptive actor-critic algorithm with multi-step simulated experiences for controlling nonholonomic mobile robots." Soft Computing 11.1 (2007): 81-89.

Deep Experience Planning

The key idea of DEEP is to learn a value function approximation from observed transitions, and to use the learned approximation as a heuristic for evaluating state reached at the search horizon of a statistical online planner. In some sense, the search strategy used for sampling the domain model can be seen as an actor, while the learned value function acts as a critic. Our hypothesis is that using a value function approximation this way improves overall performance of a statistical online planner.

A DEEP agent maintains a simulation of the environment in order to sample potential consequences of its action choices. Based on these simulations, the DEEP agent is able to evaluate the quality of its behavioral options, and can act w.r.t. some given optimization objective (e.g. maximization of expected reward). Passing a current state and an action to the simulation allows to sample a potential successor state and an observed reward. Given a state space S, an action space A, and a reward domain R, a simulation $\Delta$ has the following form.

$\Delta : P(S \times \mathbb{R} ~|~ S \times A)$

Current statistical simulation-based planners perform simulation up to some horizon h. For such a simulation, the planning agent observes a sequence of states, actions and rewards like the following:

$s_0, a_0, r_0, s_1, r_1, a_1, ..., s_{h-1}, a_{h-1}, r_{h-1}, s_h$

Based on these observations, a possible optimization criterion is the cumulative reward CR, i.e. the sum of rewards gathered from executing the corresponding sequence of actions. A planning agent estimates the quality of a sequence of actions by using the CR.

$Q(a_0, ..., a_{h-1}) = \sum_{0 \leq i \leq h} r_i$

Local Planning with Value Functions

While the basic local planning approach as described above can be very effective, DARTS enhances the estimation of action evaluation by employing a value function in order to estimate the expected value of the final simulation state sh. For a given MDP (S, A, T, R), the value function is recursively defined as follows.