In [1]:
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

# Animations
import time
from IPython.display import display, HTML, IFrame, clear_output
import ipywidgets as widgets

import warnings
warnings.filterwarnings('ignore')

def normal_curve(x, mu=0, sigma=1):
    return 1 / np.sqrt(2*np.pi) * np.exp(-(x - mu)**2/(2 * sigma**2))

def normal_area(a, b, bars=False):
    x = np.linspace(-4, 4, 1000)
    y = normal_curve(x)
    ix = (x >= a) & (x <= b)
    plt.figure(figsize=(10, 5))
    plt.plot(x, y, color='black')
    plt.fill_between(x[ix], y[ix], color='gold')
    if bars:
        plt.axvline(a, color='red')
        plt.axvline(b, color='red')
    plt.title(f'Area between {np.round(a, 2)} and {np.round(b, 2)}')
    plt.show()

def show_clt_slides():
    src = "https://docs.google.com/presentation/d/e/2PACX-1vTcJd3U1H1KoXqBFcWGKFUPjZbeW4oiNZZLCFY8jqvSDsl4L1rRTg7980nPs1TGCAecYKUZxH5MZIBh/embed?start=false&loop=false&delayms=3000"
    width = 960
    height = 509
    display(IFrame(src, width, height))

Lecture 21 – The Normal Distribution, The Central Limit Theorem¶

DSC 10, Winter 2023¶

Announcements¶

  • Lab 6 is due Tuesday 3/7 at 11:59PM.
  • Homework 6 is due Thursday 3/7 at 11:59PM.
  • Check out the DSC Senior Capstone Showcase on Wednesday 3/15.
    • See how DSC majors are putting their skills to work on problems from a variety of domains.
    • Block 1 (11AM-12:30PM): Medicine and Bioinformatics 💊, Graphs and Deep Learning 📈, Finance and Blockchain 💰
    • Block 2 (1-2:30PM): NLP, Sentiment Analysis, and Social Media 🗣, Fairness and Causality 🤝, Other Applications ⚙️
    • RSVP here by 3/13.

Check-in ✅ – Answer at cc.dsc10.com¶

The Final project is due on Tuesday 3/14 at 11:59PM and has 8 sections. How much progress have you made?

A. Not started or barely started ⏳
B. Finished 1 or 2 sections
C. Finished 3 or 4 sections ❤️
D. Finished 5 or 6 sections
E. Finished 7 or 8 sections 🤯

Agenda¶

  • The normal distribution.
  • The Central Limit Theorem.

Recap: Standard units¶

SAT scores range from 0 to 1600. The distribution of SAT scores has a mean of 950 and a standard deviation of 300. Your friend tells you that their SAT score, in standard units, is 2.5. What do you conclude?

The normal distribution¶

Recap: The standard normal distribution¶

  • The standard normal distribution can be thought of as a "continuous histogram."
  • Like a histogram:
    • The area between $a$ and $b$ is the proportion of values between $a$ and $b$.
    • The total area underneath the normal curve is is 1.
  • The standard normal distribution's cumulative density function (CDF) describes the proportion of values in the distribution less than or equal to $z$, for all values of $z$.
    • In Python, we use the function scipy.stats.norm.cdf.

Using the normal distribution¶

Last time, we looked at a data set of heights and weights of 5000 adult males.

In [2]:
height_and_weight = bpd.read_csv('data/height_and_weight.csv')
height_and_weight
Out[2]:
Height Weight
0 73.85 241.89
1 68.78 162.31
2 74.11 212.74
... ... ...
4997 67.01 199.20
4998 71.56 185.91
4999 70.35 198.90

5000 rows × 2 columns

Both variables are roughly normal. What benefit is there to knowing that the two distributions are roughly normal?

Standard units and the normal distribution¶

  • Key idea: The $x$-axis in a plot of the standard normal distribution is in standard units.
    • For instance, the area between -1 and 1 is the proportion of values within 1 standard deviation of the mean.
  • Suppose a distribution is roughly normal. Then, these are two are approximately equal:
    • The proportion of values in the distribution between $a$ and $b$.
    • The area between $z(a)$ and $z(b)$ under the standard normal curve. (Recall, $z(x_i) = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$.)

Example: Proportion of heights between 65 and 70 inches¶

Let's suppose, as is often the case, that we don't have access to the entire distribution of heights, just the mean and SD.

In [3]:
heights = height_and_weight.get('Height')
height_mean = heights.mean()
height_mean
Out[3]:
69.02634590621737
In [4]:
height_std = np.std(heights)
height_std
Out[4]:
2.863075878119538

Using just this information, we can estimate the proportion of heights between 65 and 70 inches:

  1. Convert 65 to standard units.
  2. Convert 70 to standard units.
  3. Use stats.norm.cdf to find the area between (1) and (2).
In [5]:
left = (65 - height_mean) / height_std
left
Out[5]:
-1.4063008029189459
In [6]:
right = (70 - height_mean) / height_std
right
Out[6]:
0.3400727522534686
In [7]:
normal_area(left, right)
2023-03-02T18:17:18.089759 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
In [8]:
from scipy import stats
approximation = stats.norm.cdf(right) - stats.norm.cdf(left)
approximation
Out[8]:
0.5532817187111865

Checking the approximation¶

Since we have access to the entire set of heights, we can compute the true proportion of heights between 65 and 70 inches.

In [9]:
# True proportion of values between 65 and 70.
height_and_weight[
    (height_and_weight.get('Height') >= 65) &
    (height_and_weight.get('Height') <= 70)
].shape[0] / height_and_weight.shape[0]
Out[9]:
0.554
In [10]:
# Approximation using the standard normal curve.
approximation
Out[10]:
0.5532817187111865

Pretty good for an approximation! 🤩

Warning: Standardization doesn't make a distribution normal!¶

Consider the distribution of delays from earlier in the lecture.

In [11]:
delays = bpd.read_csv('data/delays.csv')
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', figsize=(10, 5), title='Flight Delays')
plt.xlabel('Delay (minutes)');
2023-03-02T18:17:19.472556 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

The distribution above does not look normal. It won't look normal even if we standardize it. By standardizing a distribution, all we do is move it horizontally and stretch it vertically – the shape itself doesn't change.

In [12]:
HTML('data/delay_anim.html')
Out[12]:
Your browser does not support the video tag.

Center and spread, revisited¶

Special cases¶

  • As we just discovered, the $x$-axis in the standard normal curve represents standard units.
  • Often times, we want to know the proportion of values within $z$ standard deviations of the mean.
Percent in Range Normal Distribution
$\text{mean} \pm 1 \: \text{SD}$ $\approx 68\%$
$\text{mean} \pm 2 \: \text{SDs}$ $\approx 95\%$
$\text{mean} \pm 3 \: \text{SDs}$ $\approx 99.73\%$

68% of values are within 1 SD of the mean¶

In [13]:
normal_area(-1, 1, bars=True)
2023-03-02T18:17:19.628838 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
In [14]:
stats.norm.cdf(1) - stats.norm.cdf(-1)
Out[14]:
0.6826894921370859

This means that if a variable follows a normal distribution, approximately 68% of values will be within 1 SD of the mean.

95% of values are within 2 SDs of the mean¶

In [15]:
normal_area(-2, 2, bars=True)
2023-03-02T18:17:19.767740 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
In [16]:
stats.norm.cdf(2) - stats.norm.cdf(-2)
Out[16]:
0.9544997361036416
  • If a variable follows a normal distribution, approximately 95% of values will be within 2 SDs of the mean.
  • Consequently, 5% of values will be outside this range.
  • Since the normal curve is symmetric,
    • 2.5% of values will be more than 2 SDs above the mean, and
    • 2.5% of values will be more than 2 SDs below the mean.

Chebyshev's inequality and the normal distribution¶

  • Last class, we looked at Chebyshev's inequality, which stated that the proportion of data within $z$ SDs of the mean is at least $1-\frac{1}{z^2}$.
    • This works for any distribution, and is a lower bound.
  • If we know that the distribution is normal, we can be even more specific:
Range All Distributions (via Chebyshev's inequality) Normal Distribution
mean $\pm \ 1$ SD $\geq 0\%$ $\approx 68\%$
mean $\pm \ 2$ SDs $\geq 75\%$ $\approx 95\%$
mean $\pm \ 3$ SDs $\geq 88.8\%$ $\approx 99.73\%$
  • The percentages you see for normal distributions above are approximate, but are not lower bounds.
    • Important: They apply to all normal distributions, standardized or not. This is because all normal distributions are just stretched and shifted versions of the standard normal distribution.

Inflection points¶

  • Last class, we mentioned that the standard normal curve has inflection points at $z = \pm 1$.
    • An inflection point is where a curve goes from "opening down" 🙁 to "opening up" 🙂.
  • We know that the $x$-axis of the standard normal curve represents standard units, so the inflection points are at 1 standard deviation above and below the mean.
  • This means that if a distribution is roughly normal, we can determine its standard deviation by finding the distance between each inflection point and the mean.

Example: Inflection points¶

Remember: The distribution of heights is roughly normal, but it is not a standard normal distribution.

In [17]:
height_and_weight.plot(kind='hist', y='Height', density=True, ec='w', bins=40, alpha=0.8, figsize=(10, 5));
plt.xticks(np.arange(60, 78, 2));
2023-03-02T18:17:19.985319 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
  • The center appears to be around 69.
  • The inflection points appear to be around 66 and 72.
  • So, the standard deviation is roughly 72 - 69 = 3.
In [18]:
np.std(height_and_weight.get('Height'))
Out[18]:
2.863075878119538

The Central Limit Theorem¶

Back to flight delays ✈️¶

The distribution of flight delays that we've been looking at is not roughly normal.

In [19]:
delays = bpd.read_csv('data/delays.csv')
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', figsize=(10, 5), title='Population Distribution of Flight Delays')
plt.xlabel('Delay (minutes)');
2023-03-02T18:17:20.217865 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
In [20]:
delays.get('Delay').describe()
Out[20]:
count    13825.00
mean        16.66
std         39.48
           ...   
50%          2.00
75%         18.00
max        580.00
Name: Delay, Length: 8, dtype: float64

Empirical distribution of a sample statistic¶

  • Before we started discussing center, spread, and the normal distribution, our focus was on bootstrapping.
  • We used the bootstrap to estimate the distribution of a sample statistic (e.g. sample mean or sample median), using just a single sample.
  • We did this to construct confidence intervals for a population parameter.
  • Important: For now, we'll suppose our parameter of interest is the population mean, so we're interested in estimating the distribution of the sample mean.

Empirical distribution of the sample mean¶

Since we have access to the population of flight delays, let's remind ourselves what the distribution of the sample mean looks like by drawing samples repeatedly from the population.

  • This is not bootstrapping.
  • This is also not practical. If we had access to a population, we wouldn't need to understand the distribution of the sample mean – we'd be able to compute the population mean directly.
In [21]:
sample_means = np.array([])
repetitions = 2000

for i in np.arange(repetitions):
    sample = delays.sample(500)
    sample_mean = sample.get('Delay').mean()
    sample_means = np.append(sample_means, sample_mean)
    
sample_means
Out[21]:
array([15.65, 17.02, 16.58, ..., 18.76, 16.87, 13.23])
In [22]:
bpd.DataFrame().assign(sample_means=sample_means).plot(kind='hist', density=True, ec='w', alpha=0.65, bins=20, figsize=(10, 5));
plt.scatter([sample_means.mean()], [-0.005], marker='^', color='green', s=250)
plt.axvline(sample_means.mean(), color='green', label=f'mean={np.round(sample_means.mean(), 2)}', linewidth=4)
plt.xlim(5, 30)
plt.ylim(-0.013, 0.26)
plt.legend();
2023-03-02T18:17:24.160680 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/