This post contains some key concepts using when dealing with ML models.
Machine Learning studies algorithms which improve their performance through data: The more data they process, the better they will perform.
To put it in simple terms:
More on our Why generative models? post
Optimizer is the algorithm which allows us to find the parameters which minimize an objective function. Very simple models might have a closed solution, which one should (obviously) use instead of these general techniques. Optimizers are often categorized by the order of the gradient available.
0-order optimizers are used when the gradients of the objective to optimize are not available (black box functions). These algorithms are often used in ML to optimize model meta-parameters, from which computing the gradient is not feasible or too expensive
Another group of “fun” gradient-free optimizers are the meta-heuristics:
They use the gradient of the objective (wrt the model’s weights) function in order to find the local minima. Most of the algorithms are based on gradient descent: Move in the direction of the steepest gradient. There are some variations to improve performance:
Often, very deep or recurrent networks suffer from vanishing gradient problems:
Solutions:
While less frequent, if the weights of the network are bigger than 1, we can have an exploding gradient problem (completely analogous).
They use second order derivatives (Hessian). A common algorithm is to look for a $0$ of the objective’s derivative using the Netwon-Raphson method. As a reminder, if \(f\) is the derivative of our objective function, Newton-Raphson iteratively finds a $0$ as:
\begin{equation} x_{n+1} = x_n - \frac{f(x_n)}{f^\prime (x_n)} \end{equation}
2nd-order methods approximate quadratically the local neighborhood of the evaluated point while 1st-methods only approximate it linearly.
Pos/Cons:
Since ANNs are already quite expensive to evaluate 2nd-order optimization methods are not used
Techniques to improve model generality by imposing extra conditions in the training phase. These techniques reduce variance at the cost of higher bias.
These regularization techniques are based on the idea of shrinkage: force the smaller (thus less-important) model parameters to zero to reduce variance. In regression tasks, this can be achieved by adding the norm of the weights to the loss function to optimize.
It is interesting to notice that shrinkage is implicit in Bayesian inference (dictated by the chosen prior).
Most famous methods are:
\begin{equation} \mathcal{L}^{\text{REG}} = \mathcal{L} + \lambda \Vert W \Vert_2^2 \end{equation}
\begin{equation} \mathcal{L}^{\text{REG}} = \mathcal{L} + \lambda \Vert W \Vert_1 \end{equation}
Interestingly, the L1 term is better at forcing some parameters to 0. Thus, LASSO regression can also be used for feature selection: detect which dimensions matter the most for the regression task we are working with.
Notice that this idea of penalizing high weights of the model by adding a term on the loss function can be also used in more complex models such as ANNs. Most ANN packages include the option to add regularization to a Dense layer.
$\lambda$ controls how strong the regularization applied is:
Notice this is just a Lagrange multiplier parameter we use to optimize with the added constrain of forcing weights to be small.
At each training iteration it randomly selects a subsets of the neurons and sets their output to $0$. This forces the network to learn multiple ways of mapping input and outputs, making it more robust.
This can be also thought as an ensemble technique since one might argue that we are training multiple models at once, one for each active set of neurons.
Split the data into three sets: training, evaluation and testing. Update model parameters based on training loss but stop training as soon as evaluation loss stops to decrease (there are multiple heuristics to detect this). Early stopping prevents the model to overfit to the training set.
To improve the network generality it is often a good idea to train with small randomly-perturbed inputs. In the case of images these perturbations could be:” random noise, translations, rotations, zoom, mirror…
Combine multiple weak learners to improve results. This can also be seen as a type of regularization, as averaging over multiple models boosts generality.
Mode: Simple voting mechanism. Take what the majority of learners say
Average / Weighted Average: Assign a weight to each learner and compute the mean prediction.
BAGGING (Bootstrap AGGregatING) : Multiple models of the same type are trained with a random subset of the data sampled with replacement (bootstrapping). This technique is specially effective to reduce variance.
BOOSTING (aka Arcing (Adaptive Reweighting and Combining)): Trains models sequentially based on previous model performance instead of in parallel such as Bagging.
ADABOOST: Each datapoint is given an “importance weight” which is adjusted during the sequential training of multiple models: Missclassified datapoints are assigned a higher weight to make subsequent models consider them more. In addition, a “reliability weight” is assigned to each model and weighted average is used for the final guess. Although it also lowers the variance, it is mainly used to lower the bias of the models.
Gradient Boosting (GB): Instead of changing the weight of each missclassified input-label pair, GB sequentially fits each model to match the error given by the previous model. So if real labels are $y_i$ and the first model is trained on $x_i$ and outputs $h_i$, the second model will attempt to map $x_i$ to $y_i - h_i$. The final prediction is then given by the addition of the output of all models.
In this section we will take a look at the most common model criteria
Cross-validation splits the data into n equal chunks and trains n different times the algorithm, each time leaving a chunk for testing. Finally, we can get an idea on how good the model performs by averaging the loss functions obtained in every test.
If we have as many chunks as datapoints, we call it leave-one-out cross-validation.
It is useful to compare the performance of different models being economic on the available data. It is also used to optimize hyper-parameters.
For binary classification tasks it is very useful to construct the binary confusion matrix and compare different model performances with the following metrics:
Receiving Operating Characteristic ROC
Compares model Recall vs FPR (1 - Specificity) obtained with the studied model when varying a parameter.