Deep Experience Planning: Leveraging Local Planning with Learned Value Functions

Alexander Neitz, Kyrill Schmid, Lisbeth Claessens, Lenz Belzner

Introduction

We investigate the combination of statistical online planning based on local search and value function approximations learned from past observations. Many current state-of-the-art statistical online planners perform some kind of finite-horizon search. In many application domains, the search horizon is far smaller than episode length or system lifetime. This raises the question of how to evaluate of final search states, i.e. states reached by a particular search run when the finite horizon is met. Deep Experience Planning (DEEP) aims at leveraging statistical online planning with the use of value function approximations learned from previously observed system transitions.

A non-exhaustive list of potentially related work.

Deep Experience Planning

The key idea of DEEP is to learn a value function approximation from observed transitions, and to use the learned approximation as a heuristic for evaluating state reached at the search horizon of a statistical online planner. In some sense, the search strategy used for sampling the domain model can be seen as an actor, while the learned value function acts as a critic. Our hypothesis is that using a value function approximation this way improves overall performance of a statistical online planner.

Local Planning

A DEEP agent maintains a simulation of the environment in order to sample potential consequences of its action choices. Based on these simulations, the DEEP agent is able to evaluate the quality of its behavioral options, and can act w.r.t. some given optimization objective (e.g. maximization of expected reward). Passing a current state and an action to the simulation allows to sample a potential successor state and an observed reward. Given a state space , an action space , and a reward domain , a simulation has the following form.

Current statistical simulation-based planners perform simulation up to some horizon . For such a simulation, the planning agent observes a sequence of states, actions and rewards like the following:

Based on these observations, a possible optimization criterion is the cumulative reward, i.e. the sum of rewards gathered from executing the corresponding plan , i.e. the sequence of actions. A planning agent estimates the quality of a plan by using the cumulative reward.

Local Planning with a Value Function

While the basic local planning approach as described above can be very effective, DARTS enhances the estimation of action evaluation by employing a value function in order to estimate the expected value of the final simulation state . For a given MDP with state-action transition distribution and reward function , the value function is recursively defined as follows.

That is, the value function of a state is defined by the best action the is executable in this state, where ‘best’ is determined w.r.t. potential future reward.

In the case of a DEEP planner, the transition function and the reward function of the MDP may only be available implicitly via the simulation . In this case, the value function can be defined via the expectation of future reward.

Given a value function, a DEEP agent alters the estimation of action sequence quality by adding the value of the final state. In some sense, the DEEP agent enhances its local planning with global information obtained via the value function.

Learning a Value Function Approximation from Experience

In order to leverage its planning capabilities, a DEEP agent uses a value function for improving its action quality estimates. We now discuss how an approximation of the value function can be learned from observed transitions by using a temporal difference update rule for measuring the approximation error. When modeling the value function approximation with a neural network, we can use stochastic gradient descent to reduce the temporal difference error. Let be the current value function approximation of the agent. For given observed transitions , we can now define the following tuple as a regression target:

That is, given some input state , we want the value function approximation network to output as a rough approximation of the real value of . Given enough observed transitions and training iterations, the network modeling starts approximating .

Using a Target Network to Stabilize Training

As is a changing target, in particular in the beginning of the learning process, we use a more stable target network for mitigating stability issues when training . Then, we train on tuples of the following form.

The target network is replaced with the current after a fixed number of training iterations. By using the less volatile target network for estimating state values yields a stabilization of the training process.

Experimental Results

TODO

Conclusion and Outlook

TODO: