A specification(trace
property), $\varphi$, is a set of
traces.
We say $\xi$ satisfies $\varphi$,
written $\xi \models \varphi$, if $\xi \in
\varphi$.
Communicating
Intent and Formal Specifications
Specification: What qualifies as acceptable behavior?
Derived from Formal Logic, Automata, Rewards ($\epsilon$-"optimal").
No a-priori ordering between acceptable behaviors.
Key question: How to compose specifications.
$\circ : $ Specification $\times$ Specification $\to$ Specification
Communicating
Intent and Formal Specifications
Key question: How to compose specifications.
$\circ : $ Specification $\times$ Specification $\to$ Specification
Build up complex specifications from simpler ones.
Can always compose as concrete the sets:
Preferable to compose at abstract level: $(\neg a \wedge b)$
Question: How can one compose rewards?
Utility Maximizing Agents
Assume agent is acting in a Markov
Decision process and optimizing the sum
of an unknown state reward, $r(s)$, i.e,:
\[
\begin{split}
&\pi : S \times A \to [0, 1]\\
\max_{\pi} &\Big(\mathbb{E}_{s_{1:T}}\big(\sum_{i=1}^T r(s_i)~|~\pi\big)\Big)
\end{split}
\]
Agent model reinforcement learning (and optimal control)
(Witten 1977, Sutton and Barto 1981)(Bellman 1957).
Implicit specification: Pick a trace with (approximately) optimal reward.
Inverse Reinforcement Learning
Assume agent is acting in a Markov
Decision process and optimizing the sum
of an unknown state reward, $r(s)$, i.e,:
\[
\max_{\pi} \Big(\mathbb{E}_{s_{1:T}}\big(\sum_{i=1}^T r(s_i)~|~\pi\big)\Big)
\]
Given a series of demonstrations, what reward, $r(s)$, best explains
the behavior? (Abbeel and Ng 2004)
Will revisit solution techniques.
Problems with rewards
How to safely compose in a dynamics invariant way?
Problems with rewards
Dynamic States != Reward States
Beware the curse of history (Pineau et al 2003).
Adding history can result in exponential state space explosion.
Running Example is Compositional and non-Markovian
A demonstration of a task
$\varphi$ is an unlabeled example where
the agent tries to
satisfy $\varphi$.
Agency is key. Need a notion of action.
Success probabilities induce an ordering.
Informal Problem Statement
Assume an agent is operating in a Markov Decision Process
while trying to satisfy some unknown specification.
Given a sequence of demonstrations,
$\text{demos}$, and a collection of
specifications, $\Phi$, find the specification
$\varphi \in \Phi$ that best explains the agent's behavior.
Idea: Apply analog to Inverse Reinforcement Learning.
Inverse Reinforcement Learning
Assume agent is acting in a Markov
Decision process and optimizing the sum
of an unknown state reward, $r(s)$, i.e,:
\[
\max_{\pi} \Big(\mathbb{E}_{s_{1:T}}\big(\sum_{i=1}^T r(s_i)~|~\pi\big)\Big)
\]
Given a series of demonstrations, what reward, $r(s)$, best explains
the behavior? (Abbeel and Ng 2004)
Note: There is no unique solution as posed!
\[ \Pr(\xi~|~\text{demos}, r) = ? \]
Q: What is the likelihood of generating $\text{demos}$ given $\varphi = \text{demos}$?
Mitigations:
Constrain concept class, $\Phi \subsetneq S^\omega$. Examples: simple temporal logic formula, number of states
in the automata, circuit complexity, etc.
Carefully construct priors. Examples: exponential in description length, entropy, or circuit complexity.
Avoid Maximum a Posteriori Estimation by
marginalizing over the whole distribution. (reduction to $L^*$).
Preliminary Human Studies
Structure of the talk
Prelude
Act 1 - Learning Specifications
Formal Specification Propaganda
Maximum Entropy Specification Learning
(NeurIPS 2018, current work)
Human Understandable $\neq$ Machine Understandable
Co-Op IRL / Showing vs Doing:
Teacher provides demonstrations that disambiguate objective (Hadfield-Menell et al 2016) (Ho et al 2016).
Idea: Can these differences be explained by optimizing
for our MaxEnt Specification Learner?
Teaching Problem Statement
Assume the human performs Maximum Likelihood Estimation
over a concept class $\Phi$ using
a likelihood compatible with:
\[
\Pr(\text{demos}~|~\varphi) \propto 1[p_\varphi \geq q_\varphi]\exp{\Big(n\cdot D_{KL}(P_\varphi || Q_\varphi)\Big )}.
\]
For a fixed $n \in \mathbb{N}$, find $n$
demonstrations that maximally disambiguate
$\varphi^*$ from all other specifications in
$\Phi$:
\[
\max_{\text{demos}}\min_{\substack{\varphi \in \Phi\\\varphi \neq \varphi^*}}\Big(\Pr(\text{demos}~|~\varphi^*) - \Pr(\text{demos}~|~\varphi)\Big).
\]
Select Demonstrations and World to Maximally Disambiguate
Maximum a posteriori estimation throws away all of the distribution information.
MAP considered Harmful
Maximum a posteriori estimation throws away all of the distribution information.
Without proper priors this can lead to overfitting.
Note: Marginalizing results in a "soft-decision":
\[\Pr(\xi \in \varphi_*~|~\text{demos}) \propto \int_{\substack{\varphi \in \Phi\\\varphi\ni \xi}} \Pr(\xi~|~\text{demos}, \varphi)\Pr(\varphi).\]
Set of $\kappa$-"probable" traces is a trace property.
\[\varphi_\kappa \triangleq \Big\{\xi \in S^* ~.~\Pr(\xi \in \varphi_*~|~\text{demos}) > \kappa \Pr(\xi \notin \varphi_*~|~\text{demos})\Big\}\]
where $\kappa \geq 1$.
Angluin Style Learning ($L^*$)
Learning Regular Languages (as DFAs) with minimally adequate teacher is
polytime in the number of states.
(Angluin 1987)
Minimally Adequate Teacher
Membership$(\xi):$ Is $\xi$ a member of $\varphi_*$? (Y / N)
Equivalence$(\varphi)$: Is $\varphi$ equivalent to $\varphi^*$? (Y / Counter Example)
Idea: Use $\kappa$ probable decision as membership oracle.
Benefits: Avoid overfitting + Learn outside of concept class.
Angluin Style Learning ($L^*$)
Learning Regular Languages (as DFAs) with minimally adequate teacher is
polytime in the number of states.
(Angluin 1987)