Maximum Causal Entropy Specification Inference
from Demonstrations
Marcell J. Vazquez-Chanlatte & Sanjit A. Seshia
University of California, Berkeley
Slides @
mjvc.me/CAV2020
Collaboration through Demonstrations
Demonstrations are often a natural way to relay intent.
However, it's often unclear how leverage this information.
Structure of the talk
Prelude - Problem Setup
Act 1 - Naïve Problem Formulation
Act 2 - Efficent Encoding using Binary Decision Diagrams
Finale - Experiment
Structure of the talk
Prelude - Problem Setup
Act 1 - Naïve Problem Formulation
Act 2 - Efficent Encoding using Binary Decision Diagrams
Finale - Experiment
Structure of the talk
Prelude - Problem Setup
Act 1 - Naïve Problem Formulation
Act 2 - Efficent Encoding using Binary Decision Diagrams
Finale - Experiment
Structure of the talk
Prelude - Problem Setup
Act 1 - Naïve Problem Formulation
Act 2 - Efficent Encoding using Binary Decision Diagrams
Finale - Experiment
Basic definitions
-
Assume some fixed sets of states and actions.
-
A trace, $\xi$, is a sequence of states and actions.
- Assume all traces the same length, \(\tau \in \mathbb{N}\).
Basic definitions
-
Assume some fixed sets of states and actions.
-
A trace, $\xi$, is a sequence of states and actions.
- Assume all traces the same length, \(\tau \in \mathbb{N}\).
-
A (Boolean) specification $\varphi$, is a set of
traces.
-
We say $\xi$ satisfies $\varphi$,
written $\xi \models \varphi$, if $\xi \in
\varphi$.
Relevant Facts about Task Specifications
-
Derived from Formal Logic, Automata, Rewards ($\epsilon$-"optimal").
Relevant Facts about Task Specifications
-
Derived from Formal Logic, Automata, Rewards ($\epsilon$-"optimal").
-
No a-priori ordering between acceptable behaviors.
Actions induce ordering
-
A demonstration of a task
$\varphi$ is an unlabeled example where
the agent tries to satisfy $\varphi$.
-
Agency is key. Need a notion of action.
Actions induce ordering
-
A demonstration of a task
$\varphi$ is an unlabeled example where
the agent tries to satisfy $\varphi$.
-
Agency is key. Need a notion of action.
-
Success probabilities induce an ordering.
Informal Problem Statement
Informal Problem Statement
Informal Problem Statement
-
Assume an agent is operating in a Markov Decision Process
while trying to satisfy some unknown specification.
-
Given a sequence of demonstrations,
and a collection of
specifications find the specification
that best explains the agent's behavior.
Solution Ingredients
- Compare Likelihoods.
(This work)
-
Search for likely specifications.
(Future work)
Structure of the talk
Prelude - Problem Setup
Act 1 - Naïve Problem Formulation
Act 2 - Efficent Encoding using Binary Decision Diagrams
Finale - Experiment
Inverse Reinforcement Learning
Assume agent is acting in a Markov
Decision process and optimizing the sum
of an unknown state reward, $r(s)$, i.e,:
\[
\max_{\pi} \Big(\mathbb{E}_{s_{1:\tau}}\big(\sum_{i=1}^\tau r(s_i)~|~\pi\big)\Big)
\]
where \[\pi(a~|~s) = \Pr(a~|~s)\]
Given a series of demonstrations, what reward, $r(s)$, best explains
the behavior? (Abbeel and Ng 2004)
Inverse Reinforcement Learning
Given a series of demonstrations, what reward, $r(s)$, best explains
the behavior? (Abbeel and Ng 2004)
- Problem: There is no unique solution as posed!
\[ \Pr(r~|~\xi) = ? \]
- Idea:
Disambiguate via the
Principle of Maximum Causal Entropy.
(Ziebart, et al. 2010)
Principle Maximum Causal Entropy Intuition
\[\Pr(A_{1:\tau}~||~S_{1:\tau}) \triangleq \prod_{t=1}^\tau \Pr(A_t~|~A_{1:t-1}, S_{1:t})\]
Causally Conditioning
Current actions shouldn't depend on
information from the future.
Principle Maximum Causal Entropy Intuition
\[\Pr(A_{1:\tau}~||~S_{1:\tau}) \triangleq \prod_{t=1}^\tau \Pr(A_t~|~A_{1:t-1}, S_{1:t})\]
Causally Conditioning
Current actions shouldn't depend on
information from the future.
Principle Maximum Causal Entropy Intuition
\[\Pr(A_{1:\tau}~||~S_{1:\tau}) =~?\]
Key Idea:
Don't commit to any particular prediction more
than the data forces you too.
Informally:
Minimize surprise of actions.
subject to feature matching.
Principle Maximum Causal Entropy Intuition
\[\Pr(A_{1:\tau}~||~S_{1:\tau}) =~?\]
Key Idea:
Don't commit to any particular prediction more
than the data forces you too.
Formally:
Maximize expected causal entropy.
\[
H(A_{1:\tau}~||~S_{1:\tau}) \triangleq \mathbb{E}\left[\log\left(\frac{1}{\Pr(A_{1:\tau}~||~S_{1:\tau})}\right)\right]
\]
subject to feature matching.
Principle Maximum Causal Entropy Intuition
\[\Pr(A_{1:\tau}~||~S_{1:\tau}) =~?\]
Key Idea:
Don't commit to any particular prediction more
than the data forces you too.
Formally:
Maximize expected causal entropy.
\[
H(A_{1:\tau}~||~S_{1:\tau}) \triangleq \mathbb{E}\left[\log\left(\frac{1}{\Pr(A_{1:\tau}~||~S_{1:\tau})}\right)\right]
\]
while matching satisification probabilities.
Principle Maximum Causal Entropy Intuition
Maximize
\[
H(A_{1:\tau}~||~S_{1:\tau}) \triangleq \mathbb{E}\left[\log\left(\frac{1}{\Pr(A_{1:\tau}~||~S_{1:\tau})}\right)\right]
\]
while matching satisification probabilities.
Principle Maximum Causal Entropy Intuition
Maximize
\[
H(A_{1:\tau}~||~S_{1:\tau}) \triangleq \mathbb{E}\left[\log\left(\frac{1}{\Pr(A_{1:\tau}~||~S_{1:\tau})}\right)\right]
\]
while matching satisification probabilities.
Minimum Entropy Forcaster
Principle Maximum Causal Entropy Intuition
Maximize
\[
H(A_{1:\tau}~||~S_{1:\tau}) \triangleq \mathbb{E}\left[\log\left(\frac{1}{\Pr(A_{1:\tau}~||~S_{1:\tau})}\right)\right]
\]
while matching satisification probabilities.
Maximum Entropy Forcaster
Principle Maximum Causal Entropy Intuition
\[
H(A_{1:\tau}~||~S_{1:\tau}) \triangleq \mathbb{E}\left[\log\left(\frac{1}{\Pr(A_{1:\tau}~||~S_{1:\tau})}\right)\right]
\]
Key Take Away
Maximum Causal Entropy → Robust Forecaster
Linear Rewards
Consider case where reward is linear
combination of state features.
\[
r(s) \triangleq \underbrace{\vec{\theta}}_{\text{weights}}\cdot \underbrace{\vec{f}(s)}_{\text{features}}\]
Linear Rewards
Consider case where reward is linear
combination of state features.
\[
\scriptsize
r(s) \triangleq \underbrace{\vec{\theta}}_{\text{weights}}\cdot \underbrace{\vec{f}(s)}_{\text{features}}\]
Linear Rewards
\[
\scriptsize
r(s) \triangleq \underbrace{\vec{\theta}}_{\text{weights}}\cdot \underbrace{\vec{f}(s)}_{\text{features}}\]
Suppose we observe: \(\mathbb{E}[\vec{f}]\)
Maximum Causal Entropy Policy
\begin{equation}
\log\big(\pi_{\mathbf{\theta}}(a_t~|~s_t)\big) = Q_{\mathbf{\theta}}(a_t, s_t) - V_{\mathbf{\theta}}(s_t)
\end{equation}
where
\begin{equation}\label{eq:soft_bellman_backup}
\begin{split}
&V_{\mathbf{\theta}}(s_t) \triangleq \ln \sum_{a_t} e^{Q_{\mathbf{\theta}}(a_t, s_t)}\\
&Q_{\mathbf{\theta}}(a_t, s_t) \triangleq \mathbb{E}_{s_{t+1}}\left[ V_{\mathbf{\theta}}(s_{t+1})~|~s_t, a_t\right] + \vec{\theta} \cdot \vec{f}(s_t)
\end{split}
\end{equation}
Linear Rewards
Suppose we observe: \(\mathbb{E}[\vec{f}]\)
Maximum Causal Entropy Policy
\begin{equation}
\log\big(\pi_{\mathbf{\theta}}(a_t~|~s_t)\big) = Q_{\mathbf{\theta}}(a_t, s_t) - V_{\mathbf{\theta}}(s_t)
\end{equation}
Fit \(\theta\) to match \(\mathbb{E}[\vec{f}]\)
Resulting policy is the one with maximum causal entropy.
(Ziebart, et al. 2010)
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Q: What should the reward be?
Proposal: Use indicator.
\[
r(\xi) \triangleq
\begin{cases}
1 & \text{if } \xi \in \varphi\\
0 & \text{otherwise}
\end{cases}
\]
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
\[
\scriptsize
r(\xi) \triangleq
\begin{cases}
1 & \text{if } \xi \in \varphi\\
0 & \text{otherwise}
\end{cases}
\]
Note: States are now traces.
Let's exactly define the transition dynamics.
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
\[
\scriptsize
r(\xi) \triangleq
\begin{cases}
1 & \text{if } \xi \in \varphi\\
0 & \text{otherwise}
\end{cases}
\]
Note: States are now traces.
Suppose \(\varphi\) is over traces of length 2.
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
\[
\scriptsize
r(\xi) \triangleq
\begin{cases}
1 & \text{if } \xi \in \varphi\\
0 & \text{otherwise}
\end{cases}
\]
Note: States are now traces.
Suppose \(\varphi\) is over traces of length 2.
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
\[
\scriptsize
r(\xi) \triangleq
\begin{cases}
1 & \text{if } \xi \in \varphi\\
0 & \text{otherwise}
\end{cases}
\]
States now correspond to paths
in unrolled tree.
Suppose \(\varphi\) is over traces of length 2.
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Maximum Causal Entropy Policy
\begin{equation}
\log\big(\pi_{\mathbf{\theta}}(a_{1:t}~|~s_{1:t})\big) = Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t}) - V_{\mathbf{\theta}}(s_{1:t})
\end{equation}
where
\[
V_{\mathbf{\theta}}(s_{1:t}) \triangleq \ln \sum_{a_{1:t}} e^{Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t})}
\]
\[
Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t}) \triangleq \mathbb{E}_{s_{1:t+1}}\left[ V_{\mathbf{\theta}}(s_{t+1})~|~s_{1:t}, a_{1:t}\right] + \vec{\theta} \cdot \vec{f}(s_{1:t})
\]
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Maximum Causal Entropy Policy
\[
V_{\mathbf{\theta}}(s_{1:t}) \triangleq \ln \sum_{a_{1:t}} e^{Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t})}
\]
\[
Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t}) \triangleq \mathbb{E}_{s_{1:t+1}}\left[ V_{\mathbf{\theta}}(s_{t+1})~|~s_{1:t}, a_{1:t}\right] + \vec{\theta} \cdot \vec{f}(s_{1:t})
\]
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Maximum Causal Entropy Policy
\[
V_{\mathbf{\theta}}(s_{1:t}) \triangleq \text{smax}_{a_{1:t}}Q_\theta(a_{1:t}, s_{1:t})
\]
\[
Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t}) \triangleq \mathbb{E}_{s_{1:t+1}}\left[ V_{\mathbf{\theta}}(s_{t+1})~|~s_{1:t}, a_{1:t}\right] + \vec{\theta} \cdot \vec{f}(s_{1:t})
\]
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Maximum Causal Entropy Policy
\[
V_{\mathbf{\theta}}(s_{1:t}) \triangleq \text{smax}_{a_{1:t}}Q_\theta(a_{1:t}, s_{1:t})
\]
\[
Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t}) \triangleq \mathbb{E}_{s_{1:t+1}}\left[ V_{\mathbf{\theta}}(s_{t+1})~|~s_{1:t}, a_{1:t}\right] + \vec{\theta} \cdot \vec{f}(s_{1:t})
\]
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Maximum Causal Entropy Policy
\[
V_{\mathbf{\theta}}(s_{1:t}) \triangleq \text{smax}_{a_{1:t}}Q_\theta(a_{1:t}, s_{1:t})
\]
\[
Q_{\mathbf{\theta}}(a_{1:t}, s_{1:t}) \triangleq \mathbb{E}_{s_{1:t+1}}\left[ V_{\mathbf{\theta}}(s_{t+1})~|~s_{1:t}, a_{1:t}\right] + \vec{\theta} \cdot \vec{f}(s_{1:t})
\]
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Note: Satisfaction probability grow monotonically in \(\theta\).
Can binary search for \(\theta\) such that satisfaction probability matches data.
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Problem: Unrolled tree grows exponentially in horizon!
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Observation 1: A lot of shared structure in computation graph.
Observation 2: System and environment actions are ordered.
Idea: Reduce Specification
Inference to Maximum Entropy IRL.
Idea: Encode to binary predicate
\[
\psi: \{0, 1\}^n \to \{0,1\}
\]
and represent as Reduced Ordered Binary Decision Diagram
(Bryant 1986).
Structure of the talk
Prelude - Problem Setup
Act 1 - Naïve Problem Formulation
Act 2 - Efficent Encoding using Binary Decision Diagrams
Finale - Experiment
Summary
For time, focus on describing BDD encoding.
Random Bit Model
Idea:
Model Markov Decision Process as deterministic
transition system with access to $n_c$ coin flips.
Note:
Principle of maximum causal entropy + finite horizon together are robust to small dynamics mismatches.
Random Bit Model
Next: Assume \(\#(\text{Actions}) = 2^{n_a} \)
Random Bit Model
Next: Assume \(\#(\text{Actions}) = 2^{n_a} \)
\[
\text{Dynamics} : S \times {\{0, 1\}}^{n_a + n_c} \to S
\]
Random Bit Model
Unrolling \(\tau\) steps and composing with specification results in a predicate.
\[
\psi : {\{0, 1\}}^{\tau\cdot (n_a + n_c)} \to \{0, 1\}
\]
Random Bit Model
\[
\psi : {\{0, 1\}}^{\tau\cdot (n_a + n_c)} \to \{0, 1\}
\]
Proposal: Represent \(\psi\) as Binary Decision Diagram with bits in causal order.
Random Bit Model
\[
\scriptsize
\psi : {\{0, 1\}}^{\tau\cdot (n_a + n_c)} \to \{0, 1\}
\]
Proposal: Represent \(\psi\) as Binary Decision Diagram with bits in causal order.
Maximum Causal Entropy and BDDs
Q: Can Maximum Entropy Causal Policy be computed on causally ordered BDDs?
Maximum Causal Entropy and BDDs
Q: Can Maximum Entropy Causal Policy be computed on causally ordered BDDs?
A: Yes! Due to:
- Associativity of \(\text{smax}\) and \(\mathbb{E}\).
\[
\text{smax}(\alpha_1, \ldots, \alpha_4) = \ln(\sum_{i=1}^4 e^{\alpha_i})
\]
Maximum Causal Entropy and BDDs
Q: Can Maximum Entropy Causal Policy be computed on causally ordered BDDs?
A: Yes! Due to:
- Associativity of \(\text{smax}\) and \(\mathbb{E}\).
\[
\text{smax}(\alpha_1, \ldots, \alpha_4) = \ln(e^{\ln(e^{\alpha_1}+ e^{\alpha_2})} + e^{\ln(e^{\alpha_3}+ e^{\alpha_4})})
\]
Maximum Causal Entropy and BDDs
Q: Can Maximum Entropy Causal Policy be computed on causally ordered BDDs?
A: Yes! Due to:
- Associativity of \(\text{smax}\) and \(\mathbb{E}\).
\[
\text{smax}(\alpha_1, \ldots, \alpha_4) = \text{smax}(\text{smax}(\alpha_1, \alpha_2), \text{smax}(\alpha_3, \alpha_4))
\]
Maximum Causal Entropy and BDDs
Q: Can Maximum Entropy Causal Policy be computed on causally ordered BDDs?
A: Yes! Due to:
- Associativity of \(\text{smax}\) and \(\mathbb{E}\).
-
\(\text{smax}(\alpha, \alpha) = \alpha + \ln(2)\)
-
\(\text{E}(\alpha, \alpha) = \alpha\)
Maximum Causal Entropy and BDDs
- Associativity of \(\text{smax}\) and \(\mathbb{E}\).
-
\(\text{smax}(\alpha, \alpha) = \alpha + \ln(2)\)
-
\(\text{E}(\alpha, \alpha) = \alpha\)
Maximum Causal Entropy and BDDs
- Associativity of \(\text{smax}\) and \(\mathbb{E}\).
-
\(\text{smax}(\alpha, \alpha) = \alpha + \ln(2)\)
-
\(\text{E}(\alpha, \alpha) = \alpha\)
Size Bounds
Q: How big can these Causal BDDs be?
\[
|BDD| \leq \overbrace{\underbrace{\tau}_{\text{horizon}} \cdot \big( \log(|A|) + \text{#coins} \big)}^{\text{# inputs}} \cdot \big( \underbrace{|S \times S_\varphi \times A|}_{\text{composed automaton}} \cdot 2^{\text{#coins}} \big)
\]
See paper for proof.
Size Bounds
\[
|BDD| \leq \overbrace{\underbrace{\tau}_{\text{horizon}} \cdot \big( \log(|A|) + \text{#coins} \big)}^{\text{# inputs}} \cdot \big( \underbrace{|S \times S_\varphi \times A|}_{\text{composed automaton}} \cdot 2^{\text{#coins}} \big)
\]
Linear in horizon!
Note: Using function composition, can build BDD in polynomial time.
Summary
Summary
Structure of the talk
Prelude - Problem Setup
Act 1 - Naive Reduction to Maximum Causal Entropy IRL
Act 2 - The Random Bit Model and BDD Based Encoding
Finale - Experiment
Toy Experiment
Toy Experiment
Dynamics
- Agent can attempt to move {↑, ↓, ←, →}.
- With probability \(\frac{1}{32}\), agent will slip and move ←.
Toy Experiment
Dynamics
- A = {↑, ↓, ←, →}.
- \(p = \frac{1}{32}\), slip and move ←.
Toy Experiment
Dynamics
- A = {↑, ↓, ←, →}.
- \(p = \frac{1}{32}\), slip and move ←.
Provided 6 unlabeled demonstrations for the task:
-
Go to and stay at
the yellow
tile (recharge).
-
Avoid red
tiles (lava).
-
If you enter a blue,
touch a brown tile
before recharging.
- Within 10 time steps.
Note: Dashed demonstration fails to dry off due to slipping.
Toy Experiments
Dynamics
- A = {↑, ↓, ←, →}.
- \(p = \frac{1}{32}\), slip and move ←.
Spec | Policy Size | ROBDD | Relative Log Likelihood |
| (#nodes) | build time | (Compared to True) |
true | 1 | 0.48s | 0 |
φ1 = Avoid Lava | 1797 | 1.5s | -22 |
φ2= Recharge | 1628 | 1.2s | 5 |
φ3= Don't recharge while wet | 750 | 1.6s | -10 |
φ4 = φ1 ∧ φ2 | 523 | 1.9s | 4 |
φ5 = φ1 ∧ φ3 | 1913 | 1.5s | -2 |
φ6 = φ2 ∧ φ3 | 1842 | 2s | 15 |
φ⋆ = φ1 ∧ φ2 ∧ φ3 | 577 | 1.6 | 27 |
| (smaller better) | (smaller better) | (bigger better) |
Informal Problem Statement
Solution Ingredients
-
Compare Likelihoods.
(This work)
-
Search for likely specifications.
(Future work)
Communicating via Demonstrations (Future Work)
Maximum Causal Entropy Specification Inference
from Demonstrations
Marcell J. Vazquez-Chanlatte & Sanjit A. Seshia
University of California, Berkeley
Slides @
mjvc.me/CAV2020
Motivating Questions
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: Did the agent intend to touch the red tile?
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: Did the agent intend to touch the red tile?
A: Probably Not.
p
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: Can we automatically infer agent intent?
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: Can we automatically infer agent intent?
A: Stay tuned!
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: Can we automatically infer agent intent?
A: Stay tuned!
Q: How should we represent intent?
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: Can we automatically infer agent intent?
A: Stay tuned!
Q: How should we represent intent?
A: Formal Specifications?
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: What is the agent likely to do in the future?
Consider an agent acting in the following stochastic grid world.
Motivating Questions
Q: Can we algorithmically forecast the agent's behavior?
Consider an agent acting in the following stochastic grid world.