Deep RL algorithms suffer from:
To solve them, this paper implements an off-policy (deal with sample complexity), actor-critic (work in high-dim continuous spaces) algorithm in the maximum entropy framework (enhance learning robustness).
For infinte-horizon problems, a \(\gamma\) parameter can be used to make both rewards and entropies finite.
“Succeed at the task, while behaving as random as possible”: Actor aims to maximize expected reward while also maximizing entropy:
The optimization objective then becomes:
\begin{equation} J(\pi) = \sum_t E_{(s_t, a_t) \sim \rho_{\pi}} \left[ r(s_t, a_t) + \alpha H(\pi(\cdot | s_t)) \right] \end{equation}
Where $\alpha$ is a “temperature” meta-parameter, which can be fixed or learned ($\alpha$ = 0 for standard RL).
The Bellman (expectation) operator \(\mathcal{T}\) then becomes:
\begin{equation} \label{eq:quality} \mathcal{T}^\pi Q(s_t, a_t) = r(s_t, a_t) + \gamma E_{s+1 \sim p} \left[ V (s_{t+1}) \right] \end{equation}
Where:
\begin{equation} \label{eq:value} V (s_{t+1}) = E_{a_t \sim \pi} \left[ Q (s_t, a_t) \right] - \alpha E_{a_t \sim \pi} \left[ \log \pi (a_t | s_t) \right] \end{equation}
The value of the next state is the quality of the actions over \(\pi\) distribution at that state + the entropy of the policy at that state. Remember that the entropy of a distribution \(X\) is defined by \(H(X) = - E_x (\log(X))\).
The paper proofs the Soft policy iteration theorem. It states that the repeated application of the soft Bellman equation on a family of parametrized policies converges to the optimal one.
Since the algorithm lies in the Actor-Critic framework they parametrize the:
Learning \(V_\psi\) is redundant if also learning \(Q_\theta\). Nevertheless learning both stabilizes the training process.
\(V_\psi\) is fitted to minimize the RSS with function \ref{eq:value}:
\begin{equation} \label{eq:v} J_V (\psi) = E_{s_t \sim \mathcal{D}} \left[ \frac{1}{2} \left(V_\psi (s_t) - E_{a_t \sim \pi_\phi} \left[ Q_\theta (s_t, a_t) - \log \pi_\phi (a_t | s_t) \right] \right)^2 \right] \end{equation}
\(Q_\theta\) is fitted to minimize the RSS with function \ref{eq:quality}:
\begin{equation} J_Q (\theta) = E_{s_t \sim \mathcal{D}} \left[ \frac{1}{2} \left( Q_\theta (s_t, a_t) - r(s_t, a_t) - \gamma E_{s+1 \sim p} \left[ V (s_{t+1}) \right] \right)^2 \right] \end{equation}
It has been shown that using an exponential moving average of \(\psi\) (usually referred as \(\bar{\psi}\) stabilizes training.
\(\pi_\phi\) is fitted to minimize the expected distance (KL-divergence) between a policy \(\pi \in \Pi\) (\(\Pi\) is the parameter-space we are working on), and the policy which the softmax of \(Q_\theta\) gives:
\begin{equation} \label{eq:pi} J_\pi (\phi) = E_{s_t \sim \mathcal{D}} \left[ D_{KL} \left(\pi_\phi (\cdot|s_t) \mid \mid \frac{exp( Q_{\theta} (s_t, \cdot)}{Z_\theta} \right) \right] \end{equation}
Where \(Z_\theta\) is used to normalize the exponential of \(Q\) distribution. Notice that in general it is intractable but since it does not contribute to the gradient w.r.t. the policy parameters, it can be ignored.
The algorithm then becomes:
They use 2 Q-function approximators to mitigate positive bias in policy improvement adn speed training. Eq \ref{eq:v} uses the minimum value of the 2 Q-functions.