Maximum Causal Entropy Specification Inference
from Demonstrations

Marcell J. Vazquez-Chanlatte & Sanjit A. Seshia

University of California, Berkeley

Slides @ mjvc.me/CAV2020

Collaboration through Demonstrations

Demonstrations are often a natural way to relay intent.

However, it's often unclear how leverage this information.

Formal Methods

Properties of demonstrations

Formal Methods

Properties of demonstrations

Noisy : Need to be robust to demonstration errors.

Formal Methods

Properties of demonstrations

Noisy : Need to be robust to demonstration errors.
Unlabeled : Data could come from logs.

Formal Methods

Properties of demonstrations

Noisy : Need to be robust to demonstration errors.
Unlabeled : Demonstration might not be over.

Formal Methods

Properties of demonstrations

Noisy : Need to be robust to demonstration errors.
Unlabeled : Demonstration might not be over.
Contextual : May have natual language description.

Formal Methods

Properties of demonstrations

Noisy : Need to be robust to demonstration errors.
Unlabeled : Demonstration might not be over.
Contextual : Task specific prior.

Formal Methods

Properties of demonstrations

Noisy : Need to be robust to demonstration errors.
Unlabeled : Demonstration might not be over.
Contextual : Task specific prior.
Ambiguous : Not uniquely defined.

Formal Methods

Properties of demonstrations

Noisy : Need to be robust to demonstration errors.
Unlabeled : Demonstration might not be over.
Contextual : Task specific prior.
Ambiguous : Partially Ordered by Likelihood?

Formal Methods

Properties of demonstrations

Noisy
Unlabeled
Contextual
Ambiguous

Formal Methods

Goals

Noise Resistant
Unlabeled : Demonstration might not be over.
Contextual : Task specific prior.
Ambiguous : Partially Ordered by Likelihood?

Formal Methods

Goals

Noise Resistant
Unsupervised
Contextual : Task specific prior.
Ambiguous : Partially Ordered by Likelihood?

Formal Methods

Goals

Noise Resistant
Unsupervised
Bayesian

Formal Methods

Goals

Noise Resistant
Unsupervised
Bayesian

Contributions

Formulate using the Principle of Maximum Causal Entropy.
Compared to (NeurIPS 2018), explictly supports stochastic dynamics.
Efficent implementation based on Binary Decision Diagrams.
Experimental evaluation.

Structure of the talk

Prelude - Problem Setup

Act 1 - Naïve Problem Formulation

Act 2 - Efficent Encoding using Binary Decision Diagrams

Finale - Experiment

Structure of the talk

Prelude - Problem Setup

Act 1 - Naïve Problem Formulation

Act 2 - Efficent Encoding using Binary Decision Diagrams

Finale - Experiment

Structure of the talk

Prelude - Problem Setup

Act 1 - Naïve Problem Formulation

Act 2 - Efficent Encoding using Binary Decision Diagrams

Finale - Experiment

Structure of the talk

Prelude - Problem Setup

Act 1 - Naïve Problem Formulation

Act 2 - Efficent Encoding using Binary Decision Diagrams

Finale - Experiment

Basic definitions

Assume some fixed sets of states and actions.
A trace, $\xi$, is a sequence of states and actions.
Assume all traces the same length, $\tau \in \mathbb{N}$.

Basic definitions

Assume some fixed sets of states and actions.
A trace, $\xi$, is a sequence of states and actions.
Assume all traces the same length, $\tau \in \mathbb{N}$.
A (Boolean) specification $\varphi$, is a set of traces.
We say $\xi$ satisfies $\varphi$, written $\xi \models \varphi$, if $\xi \in \varphi$.

Relevant Facts about Task Specifications

Derived from Formal Logic, Automata, Rewards ($\epsilon$-"optimal").

Relevant Facts about Task Specifications

Derived from Formal Logic, Automata, Rewards ($\epsilon$-"optimal").
No a-priori ordering between acceptable behaviors.

Actions induce ordering

A demonstration of a task $\varphi$ is an unlabeled example where
the agent tries to satisfy $\varphi$.
Agency is key. Need a notion of action.

Actions induce ordering

A demonstration of a task $\varphi$ is an unlabeled example where
the agent tries to satisfy $\varphi$.
Agency is key. Need a notion of action.
Success probabilities induce an ordering.

Informal Problem Statement

Assume an agent is operating in a Markov Decision Process while trying to satisfy some unknown specification.

Given a sequence of demonstrations, and a collection of specifications find the specification that best explains the agent's behavior.

Solution Ingredients

Compare Likelihoods. (This work)
Search for likely specifications. (Future work)

Structure of the talk

Prelude - Problem Setup

Act 1 - Naïve Problem Formulation

Act 2 - Efficent Encoding using Binary Decision Diagrams

Finale - Experiment

Inverse Reinforcement Learning

Assume agent is acting in a Markov Decision process and optimizing the sum of an unknown state reward, $r(s)$, i.e,:

\[ \max_{\pi} \Big(\mathbb{E}_{s_{1:\tau}}\big(\sum_{i=1}^\tau r(s_i)~|~\pi\big)\Big) \]

where \[\pi(a~|~s) = \Pr(a~|~s)\]

Given a series of demonstrations, what reward, $r(s)$, best explains the behavior? (Abbeel and Ng 2004)

Inverse Reinforcement Learning

Given a series of demonstrations, what reward, $r(s)$, best explains the behavior? (Abbeel and Ng 2004)

Problem: There is no unique solution as posed! \[ \Pr(r~|~\xi) = ? \]
Idea: Disambiguate via the Principle of Maximum Causal Entropy. (Ziebart, et al. 2010)

Principle Maximum Causal Entropy Intuition

\[\Pr(A_{1:\tau}~||~S_{1:\tau}) \triangleq \prod_{t=1}^\tau \Pr(A_t~|~A_{1:t-1}, S_{1:t})\]

Causally Conditioning
Current actions shouldn't depend on information from the future.

Principle Maximum Causal Entropy Intuition

\[\Pr(A_{1:\tau}~||~S_{1:\tau}) \triangleq \prod_{t=1}^\tau \Pr(A_t~|~A_{1:t-1}, S_{1:t})\]

Causally Conditioning
Current actions shouldn't depend on information from the future.

Principle Maximum Causal Entropy Intuition

\[\Pr(A_{1:\tau}~||~S_{1:\tau}) =~?\]

Key Idea: Don't commit to any particular prediction more than the data forces you too.

Informally: Minimize surprise of actions.

subject to feature matching.

Principle Maximum Causal Entropy Intuition

\[\Pr(A_{1:\tau}~||~S_{1:\tau}) =~?\]

Key Idea: Don't commit to any particular prediction more than the data forces you too.