Idea: If we learn \(f(s_t, a_t) = s_{t+1}\) (or \(p(s_{t+1} \mid s_t, a_t)\) in stochastic envs.) we can apply last lecture techniques to get a policy.
Does it work? Yes, if we design a good base policy and we hand hand-engineer a dynamics representation of the environment (using physics knowledge) and we just fit few parameters.
Distribution mismatch problem: When we design a policy, we might be extrapolating beyond the data distribution used to learn the physics model, making results to be far off.
Solution: (Inspired by DAgger) Add samples from estimated optimal trajectory.
Problem: Blindly planning on top of an imperfect learned env. model might still result into ending up taking actions in states we did not plan to be. To improve this, we can update the plan after each step:
This algorithm works very decently: Replanning after each step compensates for imperfections in model and in planner. We can even plan for shorter horizons or use some very simple planner like random shooting.
Problems:
Idea: What if we exploit models being aware of their uncertainty? (similarly to Gaussian Processes)
There exist 2 types of uncertainty:
We cannot just use entropy of the ANN outcome since it is not a good measure of uncertainty. The model could be overfited and be very confident about something it is wrong about.
When training models, we usually only get a Maximum Likelihood Estimation (MLE) of the params: \(\arg \max_\theta \log p(\theta \mid \mathcal{D})\) (which if the priors are uniform is the same as \(\arg \max_\theta \log p(\mathcal{D} \mid \theta)\)).
Nevertheless, \(p(\theta \mid \mathcal{D})\) would tell us the real uncertainty of the model. Moreover, we could make predictions marginalizing over the parameters:
\begin{equation} p(s_{t+1} \mid s_t, a_t) = \int p(s_{t+1} \mid s_t, a_t, \theta) p(\theta \mid \mathcal{D}) d\theta \end{equation}
We can achieve a rough approximation of \(p(s_{t+1} \mid s_t, a_t)\) by training multiple independent models and averaging them as:
\begin{equation} \label{bootstrap} \int p(s_{t+1} \mid s_t, a_t, \theta) p(\theta \mid \mathcal{D}) d\theta \simeq \frac{1}{N} \sum_i p(s_{t+1} \mid s_t, a_t, \theta_i) \end{equation}
Bootstrap ensembling achieves the training of “independent” models by training on “independent” datasets: It uses bootstrap samples, uniform sampling with replacement of the entire dataset. Later, to get an estimation of \(p(\theta \mid \mathcal{D})\) we can just average each model output as depicted in Eq. \ref{bootstrap}.
Bootstrap sampling is not very needed since SGD and random initialization is usually sufficient to make models independent enough.
Problem: Training ANNs is expensive so usually the number of fitted models is very small (< 10), this makes the approximation to be very crude.
Solution: In later lectures we will present better methods such as the use of Bayesian NN (BNN), Mean Field approximations and variational inference.
Before a single model predicted next state:
\begin{equation} J(a_1,…,a_T) = \sum_t r(s_t, a_t) \space \space \space \space s.t. \space s_{t+1} = f(s_t, a_t) \end{equation}
Now, the next states are predicted by an average of multiple models: \begin{equation} J(a_1,…,a_T) = \frac{1}{N} \sum_i \sum_t r(s_{t, i}, a_t) \space \space \space \space s.t. \space s_{t+1, i} = f_i (s_{t, i}, a_t) \end{equation}
In case of a stochastic environment, the algorithm becomes:
There exist fancier options such as moment matching or better posterior estimation with BNNs.
Problems in estimating \(f(s_t, a_t) = s_{t+1}\) big observation spaces such as images:
Remember that states are the underlying structure which fully describe an instance of the environment, while observations are only what the agent perceives.
Example: in an Atari game, the observation is the screen output, and the state can be summarized in less dimensions by the position of the key elements.
Idea: Can we learn separately \(p(o_t \mid s_t )\) and \(p(s_{t+1} \mid s_t, a_t)\)?
We need to learn models for:
The state structure is not given to us, they are latent variables (helper random variables which are not observed but rather inferred from the observed variables).
Given a dataset of trajectories: \(\mathcal{D} = \{(s_{t+1} \mid s_t, a_t)_i\}\) we can train a fully observed (\(s_t = o_t\)) model maximizing over its parameters:
\begin{equation} max_{\phi} \frac{1}{N} \sum_i \sum_t \log p_{\phi} (s_{t+1, i} \mid s_{t, i}, a_{t, i}) \end{equation}
For a latent space models, we do not know what \(s_t\) is, so we need to take an expectation of it given the observed trajectory. Moreover, we also need to learn our observation model:
\begin{equation} \label{eq:latent_learning} max_{\phi} \frac{1}{N} \sum_i \sum_t E_{(s, s_{t+1})\sim p(s_t, s_{t+1} \mid o_{1:T}, a_{1:T})} \left[ \log p_{\phi} (s_{t+1, i} \mid s_{t, i}, a_{t, i}) + \log p_{\phi} (o_{t, i} \mid s_{t, i}) \right] \end{equation}
We can learn this approximation of the posterior in different ways:
In this lecture we will focus on single-step encoders:
We can compute the expectation in Eq. \ref{eq:latent_learning} by doing:
\(s_t \sim q_\psi (s_t \mid o_t), s_{t+1} \sim q_\psi(s_{t+1} \mid o_{t+1})\).
In the special case where \(q_\psi (s_t \mid o_t)\) is deterministic we can encode it with an ANN which given an observation returns the single most probable state: \(s_t = g_\psi (o_t)\). Since it is deterministic, the expectation in Eq. \ref{eq:latent_learning} can be re-written substituting the random variable by its value:
\begin{equation} \label{eq:latent_learning_det} max_{\phi, \psi} \frac{1}{N} \sum_i \sum_t \log p_{\phi} (g_\psi (o_{t+1, i}) \mid g_\psi (o_{t, i}), a_{t, i}) + \log p_{\phi} (o_{t, i} \mid g_\psi (o_{t, i})) \end{equation}
Everything is differentiable so this can be trained using backprop. In addition, we can append our reward model as:
\begin{equation} \label{eq:latent_learning_det_reward} max_{\phi, \psi} \frac{1}{N} \sum_i \sum_t \log p_{\phi} (g_\psi (o_{t+1, i}) \mid g_\psi (o_{t, i}), a_{t, i}) + \log p_{\phi} (o_{t, i} \mid g_\psi (o_{t, i)}) + \log p_\phi (r_{t,i} \mid g_\psi (o_{t,i})) \end{equation}
In this order, we have the latent space dynamics, the observation reconstruction and the reward model. The previous model-based algorithm can be adapted as:
How is this related to Hidden Markov Models (HMM) and Kalman filters (or Linear Quadratic Estimation, LQE)?
Do we really need to learn an embedding to get the underlying states?
Observations model directly learn \(p(o_{t+1} \mid o_t, a_t)\). Usually combining CNNs with RNNs in a kind of video-prediction model achieves good results.