The Law of Large Numbers (LLN) is a fundamental theorem in probability that describes the result of performing the same experiment a large number of times. It states that as the number of identically distributed, randomly generated variables increases, their sample mean (average) approaches their theoretical mean (expected value).
In simpler terms: The more data you collect, the closer your observed average will be to the true average.
Formal Definition
Let be a sequence of independent and identically distributed (i.i.d.) random variables with an expected value .
The sample mean is defined as:
The LLN states that as , the sample mean converges to the true mean .
Weak Law of Large Numbers (Khinchin’s Law)
The Weak LLN states that the sample mean converges to the expected value in probability. For any strictly positive number : This means that for a sufficiently large sample size, the probability that the sample mean is far from the true mean is extremely close to zero.
Strong Law of Large Numbers (Kolmogorov’s Law)
The Strong LLN states that the sample mean converges to the expected value almost surely (with probability 1): This is a stronger mathematical statement implying that the occurrence of the sample mean not converging to the true mean is an event with zero probability.
Applications in Machine Learning
The Law of Large Numbers is the bedrock upon which much of machine learning theory and practice is built. It provides the theoretical guarantee that learning from data is actually possible.
1. Empirical Risk Minimization (ERM)
In machine learning, we want to find a model that minimizes the true risk (the expected loss over the entire unseen data distribution). However, we only have access to a finite training dataset. We instead minimize the empirical risk (the average loss on our training data).
- The LLN guarantees that as the size of our training dataset grows, the empirical risk converges to the true risk. This is why more data generally leads to better, more generalizable models.
2. Monte Carlo Methods
Many problems in ML involve calculating complex integrals or expectations (e.g., computing marginal probabilities in Bayesian networks, or value functions in Reinforcement Learning). When analytical solutions are impossible, we use Monte Carlo sampling.
- We draw random samples from the distribution and average their outcomes. The LLN guarantees that this sample average will converge to the true expected value as increases.
3. A/B Testing and Analytics
When evaluating a new feature (like a UI change) in Data Science, we look at the conversion rate of a sample of users. The LLN ensures that if we collect a large enough sample, the observed conversion rate will accurately reflect the true population conversion rate, avoiding decisions based on small-sample noise.
4. Mini-Batch Gradient Descent
In deep learning, we update model weights using the gradient of the loss function. Computing the true gradient over the entire dataset is expensive. Instead, we compute the gradient over a small “mini-batch”. By the LLN, the average gradient of the mini-batch is an unbiased estimator of the true full-dataset gradient.