Given a dataset
Autoregressive models fix an ordering of the variables and model each conditional probability
This decomposition converts the joint modelling problem
A Bayesian network which does not do any assumption on the conditional independence of the variables is set to obey the autoregressive property.
Simplification methods:
Independence assumption: Instead of each variable dependent on all the previous, you could define a probabilistic graphical model and define some dependencies:
Parameter reduction: To ease the training one can under-parametrize the model and apply VI to find the closest distribution in the working sub-space. For instance you could design the conditional approximators parameters to grow linearly in input size like:
Increase representation power: I.e. parametrize
Figure 2: Growing ANN modelling of the conditional distributions. (Image from KTH DD2412 course)
The order in which you traverse the data matters! While temporal and sequential data have natural orders, 2D data doesn’t. A solution is to train an ensemble with different orders (ENADE) and average its predictions.
Instead of having a static model for each input, we can use a RNN and encode the seen “context” information as hidden inputs. They work for sequences of arbitrary lengths and we can tune their modeling capacity. The only downsides are that they are slow to train (sequential) and might present vanishing/exploding gradient problems.
PixelRNN applies this idea to images. They present some tricks like multi-scale context to achieve better results than just traversing the pixels row-wise. It consists of first traversing sub-scaled versions of the image to finally fit the model on the whole image. If interested, check out our LMConv post. Some other interesting papers about this topic: PixelCNN WaveNet
Overall AR provide:
But: