Entropic Control Improvisation
Given a dynamics model and horizon T, find a policy that satisfies:

Hard Constraint:
$$\Pr(\xi \in \psi) \geq 1$$ 
Soft Constraint:
$$\Pr(\xi \in \varphi) \geq \mathbf{p}$$ 
Causal Entropy Constraint:
$$H(\mathcal{A}_{1:T} \mid\mid \mathcal{S}_{1:T}) \geq \mathbf{h}$$
 Causal entropy (randomness) constraint ensures minimal bias.
 Natural trade off between performance p and randomness h.
Dynamics Model and Applications
In above settings, a biased policy is undesirable.
Contributions
 Symbolic approach for representing MDPs and Stochastic Games as Binary Decision Diagrams.
 Improvisation in stochastic games which support arbitrary combinations of probabilistic 🎲 and adversarial 👿 uncertainty.
Improv in MDPs (CAV 20')
Key Observation: Can think of soft constraint as binary reward.
$$r_\lambda(\xi) \triangleq \lambda \cdot 1[\xi \in \varphi]$$ By adding history to state space, can reduce to Maximum Causal Entropy Inverse Reinforcement Learning.
 Problem: Potential combinatorial explosion.
 Solution: Encode MDP as a Binary Decision Diagram.
 Write the composition of the dynamics and property as a circuit with access to biased coins.
 Can represent MDP with a Binary Decision Diagram: Conservative size bound: $$O(\text{horizon}\cdot S/\varphi\cdot \text{Actions}\log(\text{Actions}))$$
 We show you can efficiently compute maximum causal entropy policy on compressed MDP.
Application: Used to learn temporal logic constraint from unlabeled demonstrations, e.g.,
φ = "Avoid Lava, eventually recharge, and don't recharge while wet."
Improv in Stochastic Games
Motivation: Often want to handle combinations of probabilistic 🎲 and adversarial 👿 uncertainty, i.e., Interval MDPs, 2 player MDPs, and model compression.
 Is efficent improvisation synthesis possible?
 Yes! By recursive entropy matching.
Efficient via dynamic programming from leafs of BDD.
 Assume min entropy 👿.
 Run MDP to find optimal h.
 Plan to match h.
 Approximate Pareto Front.
 Pareto Front allows for replanning locally.
 Resulting algorithm is efficient in practice.
Future Work
 Sampling based algorithms.
 POMDPs.
 Subset queries.
 Dynamic Scenic Constraints.