This post is more a memory-refresher list of the most famous probability distributions rather than a comprehensive explanation of anything in particular.
Toss of a coin with probability of head of
Moments Example: By means of example, its moments would be computed as:
Toss of a dice with
Remember: Categorical Cross-Entropy is often used as a loss function for ML supervised classification tasks, which is:
In classification tasks, we see our model prediction as a categorical distribution: for a given input, a certain probability is assigned to each class. In this context:
is the “real” categorical distribution of the data. Usually the labels come as a 1-hot encoding: so the “real” distribution we want to match has probability in the correct label and in the others. indicates whether class is the correct one for a given input.
on the other hand, is the guess done by our model, with each being the probability of classifying the input into class . Thus, the loss associated with a datapoint becomes:
(negative log of the guessed prob).
Remember that the cross-entropy is often used when the data information is fixed, thus the KL divergence:
Number of successes in
For large
Binomial PMF
Number of counts of a k-sided die rolled n times.
I.e.
Where:
Notice this is a generalization of binomial distribution for categorical variables: Instead of counting successes of a bernoulli event, we count successes of a categorical event.
Counts the number of failures before the first success in a Bernoulli r.v.
Usage example: This distribution is often used to model life expectancy of something that has a probability
of dying at every time-step. For instance, in RL: if life expectancy of an agent is , we discount each expected future reward by a factor of to account for the probability of being alive at that point (there are other reasons to discount it such as making infinite episode reward sums finite or account for the fact that most actions do not have long-lasting repercussions).
Geometric PMF
Counts the number of random independent events happening in a fixed period of time (or space):
Imagine that on average
Poisson PMF
Assigns same likelihood to all values between a range.
MLE Example: By means of example, lets see how we would compute the MLE of its parameters given a dataset
. If we express the likelihood of the dataset we get that:
If we want to maximize the likelihood, we need to minimize
but . Thus, the minimum is achieved when:
Uniform PDF
Gaussian PDF
Weak Law of Large Numbers
The mean of a sequence of i.i.d random variables converges in probability to the expected value of the random variable as the length of that sequence tends to infinity:
Converges in probability means that the probability of the sample’s mean being equal to the distribution mean tends to 0 as the sample size grows:
This is very nice, but it only considers the mean of a SINGLE set of samples of a random variable. But what distribution does this mean follow?
Central limit theorem
The distribution of sample means of any distribution converges in distribution to a normal distribution.
This only applies to finite-mean and finite-variance distributions, so forget about stuff like Cauchy distribution.
Converges in distribution means that the CDF of the set of means converges to the CDF of a Gaussian distribution.
Combining these two theorems, we can assume that the mean of samples of any distribution will follow a normal distribution around the expected value if the sample size is big enough (usually 30 samples is considered big enough).
The central limit theorem gives the impression that a lot of events in nature seem to follow a normal distribution. Thus, it is very often the case that scientists assume normality on their observations (also because of the maximum entropy principle)
Models the sum of k squared standardized normals.
Represents the distribution probability of the amount of time between two Poison-type events.
Measures the amount of time probability between two Poisson-type events.
It is often said that “it doesn’t have memory, this happens because the occurrence of events are independent from each other. The way I picture it is with this process:
- Throw
darts into a 1-dimensional target of fixed length - Walk through the target from side to side.
The probability distribution of time to the next dart is exponential and it doesn’t matter that you just saw one, the probability of seeing another one is completely unrelated:
Can be thought of as a continuous version of a Geometric distribution. “Number of failures until one success” is analogous to “time until event”.
The same way the exponential distribution predicts the amount of time until the first Poisson event, the Gamma distribution predicts the time until the k-th Poisson of event having rate
Presents two representations. One with shape parameter
And one with shape (
Gamma PDF