In the previous lecture: Model-based RL, we where planning trajectories (stochastic open-loop), by maximizing the expected reward over a sequence of actions: \begin{equation} a_1,…,a_T = \arg \max_{a_1,…,a_T} E \left[ \sum_t r(s_t, a_t) \mid a_1,…, a_T \right] \end{equation}
Now we will build a policies capable of adapting to the situation (stochastic closed-loop), by maximizing a reward expectation:
\begin{equation} \pi = \arg \max_{\pi} E_{\tau \sim p(\tau)} \left[ \sum_t r(s_t, a_t) \right] \end{equation}
Backprop the \(s_{t+1}\) error into our env model prediction and the \(r_t\) error into our policy:
The pseudo-code would then be:
General idea of the solutions, developed further in subsequent sections.
Model-based acceleration: Use model-free RL algorithms (which are derivative free), using our learned model to generate synthetic samples. Even though it seems counter-productive it works well.
Use simple policies (rather than ANNs). This allows us to use second-order optimization (aka Newton Method) which mitigates the mentioned problems. Some applications are:
We have two equivalent options to approximate \(\nabla_{\theta} J(\theta)\):
Avoids backprop through time as it treats the derivation of an expectation as the derivation of sums of its states probabilities:
\begin{equation} \nabla_{\theta} J(\theta) \simeq \frac{1}{N} \sum_i^N \sum_t \nabla_{\theta} \log \pi_{\theta} (a_{i, t} \mid s_{i, t}) \hat Q^\pi (s_t^i, a_t^i) \end{equation}
\begin{equation} \nabla_{\theta} J(\theta) = \sum_t \frac{dr_t}{ds_t} \prod_{t^{\prime}} \frac{ds_{t^{\prime}}}{da_{t^{\prime} - 1}} \frac{da_{t^{\prime} - 1}}{ds_{t^{\prime} - 1}} \end{equation}
Online Q-learning algorithm performing model-free RL with a model to help compute future expectations.
We use our learned synthetic model to make better estimations of future rewards.
Online Q-learning algorithm performing model-free RL with a model to help compute future expectations.
Pros:
Cons:
In the standard RL setup, the main thing we lack to use LQR is: \(\frac{df}{dx_t}\), \(\frac{df}{du_t}\) (control notation).
Idea: Fit \(\frac{df}{dx_t}\), \(\frac{df}{du_t}\) around taken trajectories. By using LQR we have a linear feedback controller which can be executed in the real world.
iLQR produces: \(\hat x_t, \hat u_t K_t, k_t\) s.t. \(u_t = K_t (x_t - \hat x_t) + k_t + \hat u_t\). but what controller should we execute?
Ideas:
Problem: Most real problems are not linear: Linear approximations are only good close to the traversed trajectories.
Solution: Try to keep new trajectory probability distribution \(p(\tau)\) “close” to the old one \(\hat p(\tau)\). If the distribution is close, the dynamics will be as well. By close we mean small KL divergence: \(D_{KL} (p(\tau) || \hat p(\tau)) < \epsilon\). Turns out it is very easy to do for LQR models by just modifying the reward function of the new controller to add the log-probability of the old one. More in this paper.
Still, the learned policies will be only local! We need a way to combine them:
Use a weaker learner (e.g. model-based local-policy learner) to guide the learning of a more complex global policy (e.g. ANN).
For instance, if we have an environment with different possible starting states, we can cluster them and train a separate LQR controller for each cluster (each one being only responsible for a narrow region of the state-space: trajectory-centric RL). Then we can use it to learn a single ANN policy (using supervised learning) which learns starting from any state.
Problem: The learned controllers behavior might not be reproducible by a single ANN. They all have different local optima and an ANN can only have one.
Solution: After training the ANN, go back and modify the weak learners rewards to try to mimic the ANN as well.This way, after re-training, the global optima should be found:
This idea of combining local policies and a single global policy ANN can be used in other settings beyond model-based RL (it also works well on model-free RL).
Distillation: Given an ensemble of weaker models, we can train a single one that matches their performance by using each model output of the ensemble as a “soft” target (e.g. applying Softmax over them). The intuition is that the ensemble adds knowledge to the otherwise hard labels, such as: which ones can be confusing.
Distillation concept can be brought to RL. For instance in this paper they train an agent to play all Atari games. They train a different policies to play each of the games and then use supervised learning to train a single policy which plays all of them. This technique seems to be easier than multi-task RL training.
This is analogous to Guided policy search but for multi-task learning
We can use the loop presented in Guided policy search also in this setting to improve the specific policies using the global policy: