Central limit theorem

The law of large numbers shows that the average of many samples will approach the population’s expected value as the number of samples goes to infinity:

This holds for independent and identically distributed variables; each trial is performed without reference to previous samples and the underlying distribution is the same each time. You make sure to use the same deck and you don’t change the way you shuffle.

For these types of variables, the central limit theorem shows that as your sample size increases, the distribution of averages approaches a normal distribution. This is amazing! The sampling distribution can be anything, but with every trial you increase your confidence around the true mean.

We’ll demonstrate this by doing a quick simulation sampling from an exponential distribution. This can describe events that happen at a constant average rate, independently from each other. Imagine clicks from radiation detected on your Geiger counter, or length of time between phone calls at a call center. It has a density function,

We define our rate parameter (\(\lambda\)), calculate the theoretical population mean (\(\mu\)), standard deviation (\(\sigma\)), and variance (\(\sigma^2\)):

Simulation

Each sample is \(n = 50\) numbers drawn from the exponential. We repeat this \(k = 1000\) times and store the mean for each trial.

Show code

#' n: # number of rexp samples for each mean
#' k: # number of means
#' lambda: rate constant of the underlying exponential
generate_means <- function(n, k, lambda) {
  #' collect k means, each with n samples, in an array
  sample_means = NULL
  for (i in 1:k) sample_means = c(sample_means, mean(rexp(n, lambda)))
  means_df <- data.frame(means = sample_means)
  means_df
}

df1 <- generate_means(n, k, lambda)

Since we are sampling from the rexp function in R, we know the data is independent and identically distributed. A sample histogram shows that the values roughly follow an exponential curve, but not too convincingly at this small sample size.

The overall mean of our sampled means is \(4.992\), close to the theoretical value of \(5\) represented by the red line on the boxplot.

Assuming \(n = 50\) is large enough, the distribution of means approaches a normal with mean \(\mu\) and variance \(\sigma^2/n\), where \(\sigma^2\) is the variance of the exponential distribution we were originally sampling from. The variance \(\sigma^2/n\) is related to the standard error and will be our theoretical variance.

exponential variance (\(\sigma^2\)): \(25\)
theoretical variance (\(\sigma^2/n\)): \(0.5\)
sample variance: \(0.4931\)

While the underlying samples are exponential, the distribution of means looks normal, centered around the theoretical population mean \(\mu\), with standard deviation \(\sigma/\sqrt{n}\).

Any individual sample of the exponential distribution can have a mean far from the population mean, as seen for cumulative means with small \(k\) on the left. As we average over more sample means, we get closer to the theoretical population mean \(\mu = 5\).

To expand the previous experiment we increase the sample size to \(n = 1000\) with the same \(k = 1000\) trials, and we have a tighter distribution around our theoretical mean, with \(\sigma^2/n = 0.1581\).

<- Statistical basics | Statistical distributions ->

Orbits! | code