At this point you might be wondering something like: Wait, if we have demonstrations on how to behave in an environment, can’t we just fit a model capable of generalizing (e.g. ANN) to it? That is what Imitation Learning tries to achieve, sadly it doesn’t work. In this post we explain why.
IDEA: Record a lot of “expert” demonstrations and apply classic supervised learning to obtain a model to map observations to actions (policy).
Behavioral cloning is a type of Imitation Learning -a general term to say “learn from expert demonstrations”.
Behavioral cloning is just a fancy way to say “supervised learning”.
Wrong actions change the data distribution: A small mistake makes the subsequent observation distribution to be different from the training data. This makes the policy to be more prone to error: it has not been trained on this new distribution (as the expert did not commit mistakes). This snowball effect keeps rising the error between trajectories over time:
Idea: Collect training data from policy distribution instead of human distribution, using the following algorithm:
Problem: While it addresses the distributional shift problem, it is and unnatural way for humans to provide labels (we expect temporal coherence) \(\Rightarrow\) Bad labels.
Most decision humans take are non-Markovian: If we see the same thing twice we won’t act exactly the same (given only the last time-step). What happened in previous time-steps affects our current actions. This makes the training much harder.
In a continuous action space, if the parametric distribution chosen for our policy is not multimodal (e.g. a single Gaussian) the Maximum Likelihood Estimation (MLE) of the actions may be a problem:
Defining the cost function: \(c(s, a) = \delta_{a \neq \pi^*(s)}\) (1 when the action is different from the expert).
And assuming that the probability of making a mistake on a state sampled from the training distribution is bounded by \(\epsilon\): \(\space \space \pi_\theta (a \neq \pi^* (s) \mid s) \leq \epsilon \space\space \forall s \sim p_{train}(s)\)
\begin{equation} E \left[ \sum_t c(s_t, a_t) \right] = O(\epsilon T) \end{equation}
This would be the case if DAgger algorithm correctly applied, where the training data distribution converges to the trained policy one.
We have that: \(p_\theta (s_t) = (1-\epsilon)^t p_{train} (s_t) + (1 - (1 - \epsilon)^t) p_{mistake} (s_t)\)
Where \(p_{mistake} (s_t)\) is a state probability distribution different from \(p_{train} (s_t)\). In the worst case, the total variation divergence: \(\mid p_{mistake} (s_t) - p_{train} (s_t) \mid = 2\)
Therefore:
\begin{equation} \sum_t E_{p_\theta (s_t)} [c_t] = O(\epsilon T^2) \end{equation}
The error expectation grows quadratically over time!! More details on: A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning