This post includes basic probability concepts useful to have in mind when working with ML models. Its main purpose is to be a memory refresher. Thus, it is written as schematic notes avoiding formal definitions.
Random variable (rv): Mapping between events (or outcomes) and real values that ease probability calculus.
Independence: We say events \(X\), \(Y\) are independent when the occurrence of \(X\) doesn’t affect \(Y\): \(p(X \cap Y) = p(X) \cdot p(Y)\)
Variance: Expectation of the squared deviation of a rv: \(V(X) = E[(X - E[X])^2]\)
Conditional probability: \(p(X \mid Y) = \frac{p(X \cap Y)}{p(Y)}\)
\begin{equation} p(Y | X) = \frac{p(X | Y) p(Y)}{p(X)} \end{equation}
Naming goes: Posterior: \(p(y \mid x)\). Likelihood: \(p(x \mid y)\). Prior: \(p(y)\). Evidence: \(p(x)\).
Both MLE (Maximum Likelihood Estimation) and MAP (Maximum A Posteriori) are used to estimate parameters in a probabilistic setting.
MLE finds the parameters which better explain the data: Given a dataset \(\mathcal{D}\) and some parametrized distribution, MLE finds the parameters which yield the highest likelihood.
\begin{equation} \theta^\star = \arg \max_\theta \mathcal{L} (\mathcal{D}, \theta) = \arg \max_\theta p(\mathcal{D} \mid \theta) \end{equation}
Remember that likelihood (as presented in the Bayes theorem) refers to the probability of having obtained the data we obtained, assuming it comes from a distribution with parameters \(\theta\).
MAP works within the Bayesian framework and, instead of maximizing the likelihood, maximizes the posterior. It finds the best parameters according to the data. For this, we need to assume some prior distribution over the parameters $p(\theta)$.
\begin{equation} \theta^\star = \arg \max_\theta p(\theta \mid \mathcal{D}) = \arg \max_\theta \frac{p(\mathcal{D} \mid \theta) p(\theta)}{ p(\mathcal{D})} \propto \arg \max_\theta p(\mathcal{D} \mid \theta) p(\theta) \end{equation}
The probability of the observed data is independent from the model parameters. Thus, we do not need to consider it for the MAP computation.
\begin{equation} I_P(X) = - \log (P (X)) \end{equation} How surprising an observed event is.
\begin{equation} \mathcal{H} (P) = E_P [ I_P (X)] \end{equation} How surprised you expect to be by sampling from P. It is the average of information content of a probability distribution.
\begin{equation} \mathcal{H} (P, Q) = E_P [ I_Q (X)] \end{equation} Information content of distribution Q weighted by distribution P.
Generalized metric to quantify dissimilarity between random variables, probability distributions or samples. Some probability divergence metric examples are: Kullback–Leibler, Jensen–Shannon or Bhattacharyya distance.
A lot of ML models attempt to minimize a statistical distance to find the best parameters of a parametrized distribution. The ultimate goal is to have a model which behaves as close as possible to the “real” distribution. Obviously this distribution is not known and we are only given some samples. The basic approach is to choose a model which can capture a subspace of the probability distribution space. Later, to optimize the model parameters to minimize the distance to the given distribution.
Often used as an approximation technique in Bayesian modelling when expressions are untractable.
\begin{equation} D_{KL} (P || Q) = E_P [ I_Q (X) - I_P (X)] \end{equation}
How surprised you get on average observing Q knowing that events follow distribution P.
Developing the expression we get:
\begin{equation} D_{KL} (P || Q) = - \sum P (x) \log (Q (x)) + \sum P (x) \log (P (x)) = \mathcal{H} (P, Q) - \mathcal{H} (P) \end{equation}
The minimum achievable value is 0. This happens when \(P=Q\), in this case \(\mathcal{H} (P, Q) = \mathcal{H} (P)\)
If P is a fixed distribution, then \(\mathcal{H} (P)\) is constant, so minimizing KL-divergence is the same as minimizing Cross-Entropy. That is why cross entropy is often used as a loss when minimizing NN (P are labels, Q are predictions).
If \(P\) is the “true” distribution and \(Q\) our approximation we have 2 types of KL divergence:
Forward KL (moment-matching): \(D_{KL} (P \mid \mid Q) = \sum P \log \frac{P}{Q}\). Where P is small, it doesn’t matter what Q is doing. E.g. if Q is Gaussian, it will only try to match the moments of P (mean and variance). Forward KL is what most ANNs optimize.
Reverse KL (mode-seeking): \(D_{KL} (Q \mid \mid P) = \sum Q \log \frac{Q}{P}\). Where Q is low, does not matter what P is.
Statistical testing is used to know whether an hypothesis is significant. The main steps are as follows:
Common tests: z-test, t-test, ANOVA, \(\chi^2\)-test…