Maximum Causal Entropy Specification Inference
from Demonstrations
Marcell J. Vazquez-Chanlatte & Sanjit A. Seshia
University of California, Berkeley
Slides @
mjvc.me/CAV2020
Collaboration through Demonstrations
Demonstrations are often a natural way to relay intent .
However , it's often unclear how leverage this information.
Structure of the talk
Prelude - Problem Setup
Act 1 - Naïve Problem Formulation
Act 2 - Efficent Encoding using Binary Decision Diagrams
Finale - Experiment
Structure of the talk
Prelude - Problem Setup
Act 1 - Naïve Problem Formulation
Act 2 - Efficent Encoding using Binary Decision Diagrams
Finale - Experiment
Structure of the talk
Prelude - Problem Setup
Act 1 - Naïve Problem Formulation
Act 2 - Efficent Encoding using Binary Decision Diagrams
Finale - Experiment
Structure of the talk
Prelude - Problem Setup
Act 1 - Naïve Problem Formulation
Act 2 - Efficent Encoding using Binary Decision Diagrams
Finale - Experiment
Basic definitions
Assume some fixed sets of states and actions .
A trace , $\xi$, is a sequence of states and actions.
Assume all traces the same length, \(\tau \in \mathbb{N}\).
Basic definitions
Assume some fixed sets of states and actions .
A trace , $\xi$, is a sequence of states and actions.
Assume all traces the same length, \(\tau \in \mathbb{N}\).
A (Boolean) specification $\varphi$, is a set of
traces.
We say $\xi$ satisfies $\varphi$,
written $\xi \models \varphi$, if $\xi \in
\varphi$.
Relevant Facts about Task Specifications
Derived from Formal Logic, Automata, Rewards ($\epsilon$-"optimal").
Relevant Facts about Task Specifications
Derived from Formal Logic, Automata, Rewards ($\epsilon$-"optimal").
No a-priori ordering between acceptable behaviors.
Actions induce ordering
A demonstration of a task
$\varphi$ is an unlabeled example where
the agent tries to satisfy $\varphi$.
Agency is key. Need a notion of action .
Actions induce ordering
A demonstration of a task
$\varphi$ is an unlabeled example where
the agent tries to satisfy $\varphi$.
Agency is key. Need a notion of action .
Success probabilities induce an ordering.
Informal Problem Statement
Informal Problem Statement
Informal Problem Statement
Assume an agent is operating in a Markov Decision Process
while trying to satisfy some unknown specification.
Given a sequence of demonstrations,
and a collection of
specifications find the specification
that best explains the agent's behavior.
Solution Ingredients
Compare Likelihoods.
(This work)
Search for likely specifications.
(Future work)
Structure of the talk
Prelude - Problem Setup
Act 1 - Naïve Problem Formulation
Act 2 - Efficent Encoding using Binary Decision Diagrams
Finale - Experiment
Inverse Reinforcement Learning
Assume agent is acting in a Markov
Decision process and optimizing the sum
of an unknown state reward, $r(s)$, i.e,:
\[
\max_{\pi} \Big(\mathbb{E}_{s_{1:\tau}}\big(\sum_{i=1}^\tau r(s_i)~|~\pi\big)\Big)
\]
where \[\pi(a~|~s) = \Pr(a~|~s)\]
Given a series of demonstrations, what reward, $r(s)$, best explains
the behavior? (Abbeel and Ng 2004)
Inverse Reinforcement Learning
Given a series of demonstrations, what reward, $r(s)$, best explains
the behavior? (Abbeel and Ng 2004)
Problem: There is no unique solution as posed!
\[ \Pr(r~|~\xi) = ? \]
Idea:
Disambiguate via the
Principle of Maximum Causal Entropy.
(Ziebart, et al. 2010)
Principle Maximum Causal Entropy Intuition
\[\Pr(A_{1:\tau}~||~S_{1:\tau}) \triangleq \prod_{t=1}^\tau \Pr(A_t~|~A_{1:t-1}, S_{1:t})\]
Causally Conditioning
Current actions shouldn't depend on
information from the future.
Principle Maximum Causal Entropy Intuition
\[\Pr(A_{1:\tau}~||~S_{1:\tau}) \triangleq \prod_{t=1}^\tau \Pr(A_t~|~A_{1:t-1}, S_{1:t})\]
Causally Conditioning
Current actions shouldn't depend on
information from the future.
Principle Maximum Causal Entropy Intuition
\[\Pr(A_{1:\tau}~||~S_{1:\tau}) =~?\]
Key Idea:
Don't commit to any particular prediction more
than the data forces you too.
Informally:
Minimize surprise of actions.
subject to feature matching.
Principle Maximum Causal Entropy Intuition
\[\Pr(A_{1:\tau}~||~S_{1:\tau}) =~?\]
Key Idea:
Don't commit to any particular prediction more
than the data forces you too.
Formally:
Maximize expected causal entropy.
\[
H(A_{1:\tau}~||~S_{1:\tau}) \triangleq \mathbb{E}\left[\log\left(\frac{1}{\Pr(A_{1:\tau}~||~S_{1:\tau})}\right)\right]
\]
subject to feature matching.
Principle Maximum Causal Entropy Intuition
\[\Pr(A_{1:\tau}~||~S_{1:\tau}) =~?\]
Key Idea:
Don't commit to any particular prediction more
than the data forces you too.
Formally:
Maximize expected causal entropy.
\[
H(A_{1:\tau}~||~S_{1:\tau}) \triangleq \mathbb{E}\left[\log\left(\frac{1}{\Pr(A_{1:\tau}~||~S_{1:\tau})}\right)\right]
\]
while matching satisification probabilities.
Principle Maximum Causal Entropy Intuition
Maximize
\[
H(A_{1:\tau}~||~S_{1:\tau}) \triangleq \mathbb{E}\left[\log\left(\frac{1}{\Pr(A_{1:\tau}~||~S_{1:\tau})}\right)\right]
\]
while matching satisification probabilities.
Principle Maximum Causal Entropy Intuition
Maximize
\[
H(A_{1:\tau}~||~S_{1:\tau}) \triangleq \mathbb{E}\left[\log\left(\frac{1}{\Pr(A_{1:\tau}~||~S_{1:\tau})}\right)\right]
\]
while matching satisification probabilities.
Minimum Entropy Forcaster
Put all of the probability mass one 1 path.
Principle Maximum Causal Entropy Intuition
Maximize
\[
H(A_{1:\tau}~||~S_{1:\tau}) \triangleq \mathbb{E}\left[\log\left(\frac{1}{\Pr(A_{1:\tau}~||~S_{1:\tau})}\right)\right]
\]
while matching satisification probabilities.
Maximum Entropy Forcaster
Exponentially prefer high value paths.
Principle Maximum Causal Entropy Intuition
\[
H(A_{1:\tau}~||~S_{1:\tau}) \triangleq \mathbb{E}\left[\log\left(\frac{1}{\Pr(A_{1:\tau}~||~S_{1:\tau})}\right)\right]
\]
Key Take Away
Maximum Causal Entropy → Robust Forecaster
Linear Rewards
Consider case where reward is linear
combination of state features.
\[
r(s) \triangleq \underbrace{\vec{\theta}}_{\text{weights}}\cdot \underbrace{\vec{f}(s)}_{\text{features}}\]
Linear Rewards
Consider case where reward is linear
combination of state features.
\[
\scriptsize
r(s) \triangleq \underbrace{\vec{\theta}}_{\text{weights}}\cdot \underbrace{\vec{f}(s)}_{\text{features}}\]
Linear Rewards
\[
\scriptsize
r(s) \triangleq \underbrace{\vec{\theta}}_{\text{weights}}\cdot \underbrace{\vec{f}(s)}_{\text{features}}\]
Suppose we observe: \(\mathbb{E}[\vec{f}]\)
Maximum Causal Entropy Policy
\begin{equation}
\log\big(\pi_{\mathbf{\theta}}(a_t~|~s_t)\big) = Q_{\mathbf{\theta}}(a_t, s_t) - V_{\mathbf{\theta}}(s_t)
\end{equation}
where
\begin{equation}\label{eq:soft_bellman_backup}
\begin{split}
&V_{\mathbf{\theta}}(s_t) \triangleq \ln \sum_{a_t} e^{Q_{\mathbf{\theta}}(a_t, s_t)}\\
&Q_{\mathbf{\theta}}(a_t, s_t) \triangleq \mathbb{E}_{s_{t+1}}\left[ V_{\mathbf{\theta}}(s_{t+1})~|~s_t, a_t\right] + \vec{\theta} \cdot \vec{f}(s_t)
\end{split}
\end{equation}
Linear Rewards
Suppose we observe: \(\mathbb{E}[\vec{f}]\)
Maximum Causal Entropy Policy
\begin{equation}
\log\big(\pi_{\mathbf{\theta}}(a_t~|~s_t)\big) = Q_{\mathbf{\theta}}(a_t, s_t) - V_{\mathbf{\theta}}(s_t)
\end{equation}
Fit \(\theta\) to match \(\mathbb{E}[\vec{f}]\)
Resulting policy is the one with maximum causal entropy.
(Ziebart, et al. 2010)
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Q: What should the reward be?
Proposal: Use indicator.
\[
r(\xi) \triangleq
\begin{cases}
1 & \text{if } \xi \in \varphi\\
0 & \text{otherwise}
\end{cases}
\]
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
\[
\scriptsize
r(\xi) \triangleq
\begin{cases}
1 & \text{if } \xi \in \varphi\\
0 & \text{otherwise}
\end{cases}
\]
Note: States are now traces.
Let's exactly define the transition dynamics.
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
\[
\scriptsize
r(\xi) \triangleq
\begin{cases}
1 & \text{if } \xi \in \varphi\\
0 & \text{otherwise}
\end{cases}
\]
Note: States are now traces.
Suppose \(\varphi\) is over traces of length 2.
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
\[
\scriptsize
r(\xi) \triangleq
\begin{cases}
1 & \text{if } \xi \in \varphi\\
0 & \text{otherwise}
\end{cases}
\]
Note: States are now traces.
Suppose \(\varphi\) is over traces of length 2.
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
\[
\scriptsize
r(\xi) \triangleq
\begin{cases}
1 & \text{if } \xi \in \varphi\\
0 & \text{otherwise}
\end{cases}
\]
States now correspond to paths
in unrolled tree.
Suppose \(\varphi\) is over traces of length 2.
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Maximum Causal Entropy Policy
\begin{equation}
\log\big(\pi_{\mathbf{\theta}}(a_{1:t}~|~s_{1:t})\big) = Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t}) - V_{\mathbf{\theta}}(s_{1:t})
\end{equation}
where
\[
V_{\mathbf{\theta}}(s_{1:t}) \triangleq \ln \sum_{a_{1:t}} e^{Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t})}
\]
\[
Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t}) \triangleq \mathbb{E}_{s_{1:t+1}}\left[ V_{\mathbf{\theta}}(s_{t+1})~|~s_{1:t}, a_{1:t}\right] + \vec{\theta} \cdot \vec{f}(s_{1:t})
\]
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Maximum Causal Entropy Policy
\[
V_{\mathbf{\theta}}(s_{1:t}) \triangleq \ln \sum_{a_{1:t}} e^{Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t})}
\]
\[
Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t}) \triangleq \mathbb{E}_{s_{1:t+1}}\left[ V_{\mathbf{\theta}}(s_{t+1})~|~s_{1:t}, a_{1:t}\right] + \vec{\theta} \cdot \vec{f}(s_{1:t})
\]
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Maximum Causal Entropy Policy
\[
V_{\mathbf{\theta}}(s_{1:t}) \triangleq \text{smax}_{a_{1:t}}Q_\theta(a_{1:t}, s_{1:t})
\]
\[
Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t}) \triangleq \mathbb{E}_{s_{1:t+1}}\left[ V_{\mathbf{\theta}}(s_{t+1})~|~s_{1:t}, a_{1:t}\right] + \vec{\theta} \cdot \vec{f}(s_{1:t})
\]
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Maximum Causal Entropy Policy
\[
V_{\mathbf{\theta}}(s_{1:t}) \triangleq \text{smax}_{a_{1:t}}Q_\theta(a_{1:t}, s_{1:t})
\]
\[
Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t}) \triangleq \mathbb{E}_{s_{1:t+1}}\left[ V_{\mathbf{\theta}}(s_{t+1})~|~s_{1:t}, a_{1:t}\right] + \vec{\theta} \cdot \vec{f}(s_{1:t})
\]
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Maximum Causal Entropy Policy
\[
V_{\mathbf{\theta}}(s_{1:t}) \triangleq \text{smax}_{a_{1:t}}Q_\theta(a_{1:t}, s_{1:t})
\]
\[
Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t}) \triangleq \mathbb{E}_{s_{1:t+1}}\left[ V_{\mathbf{\theta}}(s_{t+1})~|~s_{1:t}, a_{1:t}\right] + \vec{\theta} \cdot \vec{f}(s_{1:t})
\]
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Note: Satisfaction probability grow monotonically in \(\theta\).
Can binary search for \(\theta\) such that satisfaction probability matches data.
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Problem: Unrolled tree grows exponentially in horizon!
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Observation 1: A lot of shared structure in computation graph.
Observation 2: System and environment actions are ordered.
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Idea: Encode to binary predicate
\[
\psi: \{0, 1\}^n \to \{0,1\}
\]
and represent as Reduced Ordered Binary Decision Diagram
(Bryant 1986) .
Structure of the talk
Prelude - Problem Setup
Act 1 - Naïve Problem Formulation
Act 2 - Efficent Encoding using Binary Decision Diagrams
Finale - Experiment
Summary
For time, focus on describing BDD encoding.
Random Bit Model
Idea :
Model Markov Decision Process as deterministic
transition system with access to $n_c$ coin flips.
Note:
Principle of maximum causal entropy + finite horizon together are robust to small dynamics mismatches.
Random Bit Model
Next : Assume \(\#(\text{Actions}) = 2^{n_a} \)
Random Bit Model
Next : Assume \(\#(\text{Actions}) = 2^{n_a} \)
\[
\text{Dynamics} : S \times {\{0, 1\}}^{n_a + n_c} \to S
\]
Random Bit Model
Unrolling \(\tau\) steps and composing with specification results in a predicate.
\[
\psi : {\{0, 1\}}^{\tau\cdot (n_a + n_c)} \to \{0, 1\}
\]
Random Bit Model
\[
\psi : {\{0, 1\}}^{\tau\cdot (n_a + n_c)} \to \{0, 1\}
\]
Proposal: Represent \(\psi\) as Binary Decision Diagram with bits in causal order.
Random Bit Model
\[
\scriptsize
\psi : {\{0, 1\}}^{\tau\cdot (n_a + n_c)} \to \{0, 1\}
\]
Proposal: Represent \(\psi\) as Binary Decision Diagram with bits in causal order.
Maximum Causal Entropy and BDDs
Q: Can Maximum Entropy Causal Policy be computed on causally ordered BDDs?
Maximum Causal Entropy and BDDs
Q: Can Maximum Entropy Causal Policy be computed on causally ordered BDDs?
A: Yes! Due to:
Associativity of \(\text{smax}\) and \(\mathbb{E}\) .
\[
\text{smax}(\alpha_1, \ldots, \alpha_4) = \ln(\sum_{i=1}^4 e^{\alpha_i})
\]
Maximum Causal Entropy and BDDs
Q: Can Maximum Entropy Causal Policy be computed on causally ordered BDDs?
A: Yes! Due to:
Associativity of \(\text{smax}\) and \(\mathbb{E}\) .
\[
\text{smax}(\alpha_1, \ldots, \alpha_4) = \ln(e^{\ln(e^{\alpha_1}+ e^{\alpha_2})} + e^{\ln(e^{\alpha_3}+ e^{\alpha_4})})
\]
Maximum Causal Entropy and BDDs
Q: Can Maximum Entropy Causal Policy be computed on causally ordered BDDs?
A: Yes! Due to:
Associativity of \(\text{smax}\) and \(\mathbb{E}\) .
\[
\text{smax}(\alpha_1, \ldots, \alpha_4) = \text{smax}(\text{smax}(\alpha_1, \alpha_2), \text{smax}(\alpha_3, \alpha_4))
\]
Maximum Causal Entropy and BDDs
Q: Can Maximum Entropy Causal Policy be computed on causally ordered BDDs?
A: Yes! Due to:
Associativity of \(\text{smax}\) and \(\mathbb{E}\) .
\(\text{smax}(\alpha, \alpha) = \alpha + \ln(2)\)
\(\text{E}(\alpha, \alpha) = \alpha\)
Maximum Causal Entropy and BDDs
Associativity of \(\text{smax}\) and \(\mathbb{E}\) .
\(\text{smax}(\alpha, \alpha) = \alpha + \ln(2)\)
\(\text{E}(\alpha, \alpha) = \alpha\)
Maximum Causal Entropy and BDDs
Associativity of \(\text{smax}\) and \(\mathbb{E}\) .
\(\text{smax}(\alpha, \alpha) = \alpha + \ln(2)\)
\(\text{E}(\alpha, \alpha) = \alpha\)
Size Bounds
Q: How big can these Causal BDDs be?
\[
|BDD| \leq \overbrace{\underbrace{\tau}_{\text{horizon}} \cdot \big( \log(|A|) + \text{#coins} \big)}^{\text{# inputs}} \cdot \big( \underbrace{|S \times S_\varphi \times A|}_{\text{composed automaton}} \cdot 2^{\text{#coins}} \big)
\]
See paper for proof.
Size Bounds
\[
|BDD| \leq \overbrace{\underbrace{\tau}_{\text{horizon}} \cdot \big( \log(|A|) + \text{#coins} \big)}^{\text{# inputs}} \cdot \big( \underbrace{|S \times S_\varphi \times A|}_{\text{composed automaton}} \cdot 2^{\text{#coins}} \big)
\]
Linear in horizon!
Note: Using function composition, can build BDD in polynomial time.
Summary
Summary
Structure of the talk
Prelude - Problem Setup
Act 1 - Naive Reduction to Maximum Causal Entropy IRL
Act 2 - The Random Bit Model and BDD Based Encoding
Finale - Experiment
Toy Experiment
Toy Experiment
Dynamics
Agent can attempt to move {↑, ↓, ←, →}.
With probability \(\frac{1}{32}\), agent will slip and move ←.
Toy Experiment
Dynamics
A = {↑, ↓, ←, →}.
\(p = \frac{1}{32}\), slip and move ←.
Toy Experiment
Dynamics
A = {↑, ↓, ←, →}.
\(p = \frac{1}{32}\), slip and move ←.
Provided 6 unlabeled demonstrations for the task:
Go to and stay at
the yellow
tile (recharge).
Avoid red
tiles (lava).
If you enter a blue ,
touch a brown tile
before recharging.
Within 10 time steps.
Note: Dashed demonstration fails to dry off due to slipping.
Toy Experiments
Dynamics
A = {↑, ↓, ←, →}.
\(p = \frac{1}{32}\), slip and move ←.
Spec Policy Size ROBDD Relative Log Likelihood (#nodes) build time (Compared to True) true 1 0.48s 0 φ 1 = Avoid Lava 1797 1.5s -22 φ 2 = Recharge 1628 1.2s 5 φ 3 = Don't recharge while wet 750 1.6s -10 φ 4 = φ 1 ∧ φ 2 523 1.9s 4 φ 5 = φ 1 ∧ φ 3 1913 1.5s -2 φ 6 = φ 2 ∧ φ 3 1842 2s 15 φ ⋆ = φ 1 ∧ φ 2 ∧ φ 3 577 1.6 27 (smaller better) (smaller better) (bigger better)
Informal Problem Statement
Solution Ingredients
Compare Likelihoods.
(This work)
Search for likely specifications.
(Future work)
Communicating via Demonstrations (Future Work)
Maximum Causal Entropy Specification Inference
from Demonstrations
Marcell J. Vazquez-Chanlatte & Sanjit A. Seshia
University of California, Berkeley
Slides @
mjvc.me/CAV2020
Motivating Questions
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: Did the agent intend to touch the red tile?
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: Did the agent intend to touch the red tile?
A: Probably Not.
p
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: Can we automatically infer agent intent?
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: Can we automatically infer agent intent?
A: Stay tuned!
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: Can we automatically infer agent intent?
A: Stay tuned!
Q: How should we represent intent?
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: Can we automatically infer agent intent?
A: Stay tuned!
Q: How should we represent intent?
A: Formal Specifications?
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: What is the agent likely to do in the future?
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: Can we algorithmically forecast the agent's behavior?
Consider an agent acting in the following stochastic grid world.
Maximum Causal Entropy Specification Inference from Demonstrations Marcell J. Vazquez-Chanlatte & Sanjit A. Seshia University of California, Berkeley Slides @ mjvc.me/CAV2020