In the previous lecture (Actor-Critic Algorithms) we learned how to improve the policy by taking gradient steps proportional to an Advantage Function \(A^{\pi}(s_t, a_t)\) that tells us how much better is action \(a_t\) than the average action in state \(s_t\) according to the policy \(\pi\). We defined the Advantage Function as \begin{equation} \label{eq:advantage} A^{\pi}(s_t, a_t) = r(s_t, a_t) + E_{s_{t+1} \sim p(s_{t+1} \vert s_t, a_t)} \Big [ \gamma V^{\pi}(s_{t+1}) \Big] - V^{\pi}(s_t) \end{equation}
If we have a good estimation of the Advantage Function, we do not need an explicit policy: we can just choose actions accordingly to \begin{equation} a_t = \arg\max_{a}A^{\pi}(a, s_t) \end{equation} i.e. we take the action that would lead to the highest reward by taking that action and then following \(\pi\).
We make some strong assumptions that we will relax later, but help us defining the problem:
During this lecture we will using the bold \(\pmb{s}\) notation to refer to all possible states \(s \in S\), and an assignment \(V(\pmb{s}) \leftarrow something\) means that the assignment is performed for each \(s \in S\).
Following Eq. \ref{eq:advantage} we decide to evaluate \(A^{\pi}\) by evaluating \(V^{\pi}\).
We can store the whole \(V^{\pi}(\pmb{s})\) and perform a bootstrapped update \begin{equation} \label{eq:v_update} V^{\pi}(\pmb{s}) \leftarrow r(\pmb{s}, \pi(\pmb{s})) + \gamma E_{s_{t+1} \sim p(s_{t+1} \vert s_t, a_t)} \Big[ V^{\pi}(\pmb{s}_{t+1}) \Big] \end{equation}
Note that we are using the current estimate of \(V^{\pi}\) when computing the expectation for the next states \(\pmb{s}_{t+1}\). Since we are assuming to know the transition probabilities, the expected value can be computed analytically. This leads us to the Policy Iteration algorithm.
Repeat:
The policy \(\pi\) is now deterministic as it is an \(\arg\max\) policy. The policy \(\pi\) is ensured to improve every time we perform the update in step 2 (if it is not already optimal) thanks to the Policy Improvement Theorem (Sutton & Barto, Sec 4.2).
By analyizing the deterministic argmax policy where \(A^{\pi}\) is given by Eq. \ref{eq:advantage}, we observe that the subtracted baseline \(V^{\pi}(s_t)\) is independent from the chosen action, and the \(\arg\max\) step is therefore equivalent to
\begin{equation} \pi(\pmb{a}_t \vert \pmb{s}_t) = \arg\max_a A^{\pi}(\pmb{s}_t, \pmb{a}_t) = \arg\max_aQ^{\pi}(\pmb{s}_t, \pmb{a}_t) \end{equation}
since \(r(\pmb{s}_t, \pmb{a}_t) + \gamma E[V^{\pi}(\pmb{s}_{t+1})] = Q^{\pi}(\pmb{s}_t, \pmb{a}_t)\). If we stick to our assumption that we know the environment dynamics, we store the \(Q^{\pi}\) values in a table and we can create a new algorithm called Value Iteration.
The Value Iteration algorithm is now straightfoward:
Repeat:
Until policy \(\pi = \arg\max_a Q(s, a)\) does not change anymore.
As long as we are fully storing the \(Q\) values, the algorithm is ensured to converge to the optimal value function.
Storing a value for each posible state is often not possible because the state space is too big or not discrete. We can then approximate the value of a state using a Neural Network with parameters \(\phi\). We define the loss as \begin{equation} \mathcal{L}(\phi) = \frac{1}{2} \vert\vert V_{\phi}(\pmb{s}) - \max_a Q^{\pi}(\pmb{s}, \pmb{a}) \vert\vert \end{equation} We can then train a network to approximate the value function and obtain the Fitted Value Iteration algorithm.
We are still assuming that we know the transition dynamics of the environment and that the state space is finite.
Repeat:
Note that we still need to iterate trough all the possible states, although now we do not need to store a value for each, and we need to know the transition dynamics to compute \(E[V_{\phi}(\pmb{s}_i')]\).
The algorithm above has an issue that prevents us from relaxing the assumptions we made: we need to perform a \(\max\) operation on \(E[V_{\phi}]\) over all possible actions. If we are learning \(V_{\phi}\) in step 2, we cannot know which action leads to the highest value without knowing the transition dynamics. For this reason, we will learn the \(Q\) function instead. By approximating the Q values with \(Q_{\phi}\) we can now take the \(\max\) of our approximation and use it to estimate \(E[V(\pmb{s}')] \approx \max_{a'} Q(\pmb{s}', \pmb{a}')\). By relaxing the assumption of knowing the transitions, we now must rely on sampling in order to compute the expectation. We relax the assumption of having a finite number of states by exploiting the sampled states instead of iterating trough all the possible states.
Therefore, we can define the Fitted Q Iteration algorithm.
The following is the Fitted Q Iteration algorithm along with the parameters of each step.
There are a few observations that we can make to better understand what this algorithm does.
We can derive an online version of the algorithm that we can call Q Learning in which we optimize \(Q_{\phi}\) at every step.
Here we use the same name as the well-known Watson’s Q Learning algorithm which was designed for discrete tasks with no function approximation and contains some additional concepts that we will use in the next Lecture 8: Deep RL with Q functions. As we will see, using function approximation has serious implications, most importantly that convergence is no more guaranteed. Therefore, keep in mind that this algorithm is not the same as Watkin’s, and we are using is only as an intermediate step until the next lecture.
In step 1 of the Fitted Q Iteration and Q Learning we collect one or more transitions. This step is really important as it is the step in which we collect the data that will be used when fitting the \(Q_{\phi}\) network. We cannot directly use the \(\arg\max\) policy because it lacks of exploration: it always choses the action for which the estimated value is higher, preventing the exploration of all actions. We therefore need some kind of stochastic exploration policies that allow us to explore while still taking, on average, good actions that allow us to reach new important states. In a videogame, for example, we need to explore multiple actions in order to learn, but we also need to take good actions in order to reach new levels and thus observe new interesting states.
In \(\epsilon\)-greedy policies we take a random action with probability \(\epsilon\), and the \(\arg\max\) action with probability \(1-\epsilon\):
\[\pi(a_t \vert s_t) = \begin{cases} 1 - \epsilon, & \text{if} \ a_t = \arg\max_a Q_{\phi}(s_t, a) \\ \frac{\epsilon}{\vert A \vert -1} & \text{otherwise} \end{cases}\]This exploration policy ensures that we explore all actions but still, on average, perform good actions that will make us obtain good samples as the policy advances towards most interesting states.
In Boltzmann exploration we explore actions proportionally to a transformation of their Q value.
\[\pi(a_t \vert s_t) \propto \exp Q_{\phi}(s_t, a_t)\]This allows us to take random actions in a way that is more oriented towards our current \(Q_{\phi}\) estimate