Problems:
This paper combines multiple ideas to get an estimation of uncertainty:
One output for the mean \(\mu\) and the other for the variance of the guess \(\sigma\) as in Estimating the mean and variance of the target probability distribution. Samples are treated as from taken from a heteroscedastic Gaussian distribution. They then use a Maximum Likelihood Estimation (MLE) on \(\mu\), \(\sigma\), minimizing the negative log-likelihood:
\begin{equation} -\log p_\theta (y | x) = \frac{\log \sigma_\theta^2 (x) }{2} + \frac{(y - \mu_\theta (x))^2}{2 \sigma_\theta^2 (x)} + constant \end{equation}
Notice the trade-off between \(\mu\) and \(\sigma\). The optimizer can’t just minimize \(\sigma\) faster than \(\mu\) since would make the second term grow.
It is known that an ensemble of models boosts predictive accuracy. Bagging is often used to decrease variance while boosting to decrease bias. This research shows that it also improves predictive uncertainty.
Once trained, the ensemble is treated as a mixture of Gaussians of same weight. Thus the final prediction is the mean of the mixture and the uncertainty is given by the variance of the mixture. If using \(M\) models with parameters: \(\theta_1 ... \theta_M\):
\begin{equation} \mu_\star (x) = \frac{1}{M} \sum_m \mu_{\theta_m} (x) \end{equation}
\begin{equation} \sigma_\star^2 (x) = \frac{1}{M} \sum_m \left( \mu_{\theta_m}^2 (x) + \sigma_{\theta_m}^2 (x) \right) - \mu_\star^2 (x) \end{equation}
When optimizing using adversarial training, a small perturbation on the input is created in the direction in which the network increases its loss. This augmentation of the training set smoothens the predictive distributions. While it had been used before to improve prediction accuracy, this paper shows that it also improves prediction uncertainty.
The previously presented ideas can be combined in the following algorithm:
First they show on toy-regression examples the benefits of the 3 design choices explained above. They then test their algorithm performance on: regression, classification and uncertainty estimation.
They run their algorithm on multiple famous regression datasets (e.g. Boston Housing).
They test classification performance on the MNIST and SVHN datasets.
The paper evaluates uncertainty on out-of distribution examples (i.e unseen classes). To do so, they run the following experiment:
They repeat the same experiment using the SVHN dataset for training and CIFAR10 for testing. Results show that with a big enough ensemble their method is better calibrated than MonteCarlo-Dropout. They can better estimate the uncertainty for out-of distribution inputs:
One can set a minimum confidence level (entropy score) to the model, otherwise output “I don’t know”.
They show that their algorithm provides more reliable confidence estimates compared to MC-Dropout.