Programmatic Reinforcement Learning (April 2023 - August 2023)

Worked with Nathanaƫl Fijalkow at the automata theory group of the University of Warsaw on synthesizing programmatic policies for markov decision processes (MDPs) by exploiting programmatic representations of MDPs.

Abstract. Starting from a programmatic representation of a markov decision process (MDP) in the PRISM syntax, we examine the task of synthesizing a policy in the form of a program for the MDP. The PRISM syntax allows us to specify MDPs concisely by partitioning the state space into regions with similar actions and transitions. While we cannot address the complete expressive power of the PRISM syntax, we restrict ourselves to a small subclass of two dimensional deterministic gridworlds partitioned into regions along linear predicates. Using a relaxation of this class of gridworlds, we present an algorithm to synthesize programmatic policies which exploit the symmetries present in the representation of the MDP. Our programs use memory to track subgoals and navigate between the edges of regions to provide a concise representation of a policy. Our main result is a proof of a concrete upper bound on the size of the synthesized programs. We also give a practical implementation of our synthesis algorithm which is evaluated on randomly generated instances of gridworlds.

A policy synthesized for a gridworld by our implementation can be visualized in the video below.