Lecture 22 – The Normal Distribution, The Central Limit Theorem

DSC 10, Spring 2023

Announcements

Agenda

The normal distribution

Recap: Standard units

SAT scores range from 0 to 1600. The distribution of SAT scores has a mean of 950 and a standard deviation of 300. Your friend tells you that their SAT score, in standard units, is 2.5. What do you conclude?

Recap: The standard normal distribution

Using the normal distribution

Last time, we looked at a data set of heights and weights of 5000 adult males.

Both variables are roughly normal. What benefit is there to knowing that the two distributions are roughly normal?

Example: Proportion of heights between 65 and 70 inches

Let's suppose, as is often the case, that we don't have access to the entire distribution of heights, just the mean and SD.

Using just this information, we can estimate the proportion of heights between 65 and 70 inches:

  1. Convert 65 to standard units.
  2. Convert 70 to standard units.
  3. Use stats.norm.cdf to find the area between (1) and (2).

Checking the approximation

Since we have access to the entire set of heights, we can compute the true proportion of heights between 65 and 70 inches.

Pretty good for an approximation! 🤩

Center and spread, revisited

Chebyshev's inequality and the normal distribution

Range All Distributions (via Chebyshev's inequality) Normal Distribution
mean $\pm \ 1$ SD $\geq 0\%$ $\approx 68\%$
mean $\pm \ 2$ SDs $\geq 75\%$ $\approx 95\%$
mean $\pm \ 3$ SDs $\geq 88.8\%$ $\approx 99.73\%$

68% of values are within 1 SD of the mean

Remember, the values on the $x$-axis for the standard normal curve are in standard units. So, the proportion of values within 1 SD of the mean is the area under the standard normal curve between -1 and 1.

This means that if a variable follows a normal distribution, approximately 68% of values will be within 1 SD of the mean.

95% of values are within 2 SDs of the mean

Recap: Proportion of values within $z$ SDs of the mean

Range All Distributions (via Chebyshev's inequality) Normal Distribution
mean $\pm \ 1$ SD $\geq 0\%$ $\approx 68\%$
mean $\pm \ 2$ SDs $\geq 75\%$ $\approx 95\%$
mean $\pm \ 3$ SDs $\geq 88.8\%$ $\approx 99.73\%$

The percentages you see for normal distributions above are approximate, but are not lower bounds.

Important: They apply to all normal distributions, standardized or not. This is because all normal distributions are just stretched and shifted versions of the standard normal distribution.

Inflection points

Example: Inflection points

Remember: The distribution of heights is roughly normal, but it is not a standard normal distribution.

The Central Limit Theorem

Back to flight delays ✈️

The distribution of flight delays that we've been looking at is not roughly normal.

Empirical distribution of a sample statistic

Empirical distribution of the sample mean

Since we have access to the population of flight delays, let's remind ourselves what the distribution of the sample mean looks like by drawing samples repeatedly from the population.

Notice that this distribution is roughly normal, even though the population distribution was not! This distribution is centered at the population mean.

The Central Limit Theorem

The Central Limit Theorem (CLT) says that the probability distribution of the sum or mean of a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.

While the formulas we're about to introduce only work for sample means, it's important to remember that the statement above also holds true for sample sums.

Characteristics of the distribution of the sample mean

Changing the sample size

The function sample_mean_delays takes in an integer sample_size, and:

  1. Takes a sample of size sample_size directly from the population.
  2. Computes the mean of the sample.
  3. Repeats steps 1 and 2 above 2000 times, and returns an array of the resulting means.

Let's call sample_mean_delays on several values of sample_size.

Let's look at the resulting distributions.

What do you notice? 🤔

Standard deviation of the distribution of the sample mean