In [1]:

```
# Set up packages for lecture. Don't worry about understanding this code,
# but make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
# Animations
import time
from IPython.display import display, HTML, IFrame, clear_output
import ipywidgets as widgets
import warnings
warnings.filterwarnings('ignore')
def normal_curve(x, mu=0, sigma=1):
return 1 / np.sqrt(2*np.pi) * np.exp(-(x - mu)**2/(2 * sigma**2))
def normal_area(a, b, bars=False):
x = np.linspace(-4, 4, 1000)
y = normal_curve(x)
ix = (x >= a) & (x <= b)
plt.figure(figsize=(10, 5))
plt.plot(x, y, color='black')
plt.fill_between(x[ix], y[ix], color='gold')
if bars:
plt.axvline(a, color='red')
plt.axvline(b, color='red')
plt.title(f'Area between {np.round(a, 2)} and {np.round(b, 2)}')
plt.show()
def show_clt_slides():
src = "https://docs.google.com/presentation/d/e/2PACX-1vTcJd3U1H1KoXqBFcWGKFUPjZbeW4oiNZZLCFY8jqvSDsl4L1rRTg7980nPs1TGCAecYKUZxH5MZIBh/embed?start=false&loop=false&delayms=3000&rm=minimal"
width = 960
height = 509
display(IFrame(src, width, height))
```

- Lab 6 is due on
**Saturday 5/27 at 11:59PM**. - Homework 6 – the final homework of the quarter – is due on
**Tuesday 5/30 at 11:59PM**. - The Final Project is due on
**Tuesday 6/6 at 11:59PM**. - The Grade Report has been updated – take a look on Gradescope.

- The normal distribution.
- The Central Limit Theorem.

SAT scores range from 0 to 1600. The distribution of SAT scores has a mean of 950 and a standard deviation of 300. Your friend tells you that their SAT score, in standard units, is 2.5. What do you conclude?

- The standard normal distribution can be thought of as a "continuous histogram."

In [2]:

```
normal_area(-1, 1)
```

- Like a histogram:
- The
**area**between $a$ and $b$ is the**proportion**of values between $a$ and $b$. - The total area underneath the normal curve is is 1.

- The

**Key idea: The $x$-axis in a plot of the**__standard__normal distribution is in__standard__units.- For instance, the area between -1 and 1 is the proportion of values within 1 standard deviation of the mean.

- The standard normal distribution's
**cumulative density function**(CDF) describes the proportion of values in the distribution less than or equal to $z$, for all values of $z$.- In Python, we use the function
`scipy.stats.norm.cdf`

.

- In Python, we use the function

Last time, we looked at a data set of heights and weights of 5000 adult males.

In [3]:

```
height_and_weight = bpd.read_csv('data/height_and_weight.csv')
height_and_weight
```

Out[3]:

Height | Weight | |
---|---|---|

0 | 73.85 | 241.89 |

1 | 68.78 | 162.31 |

2 | 74.11 | 212.74 |

... | ... | ... |

4997 | 67.01 | 199.20 |

4998 | 71.56 | 185.91 |

4999 | 70.35 | 198.90 |

5000 rows × 2 columns

In [4]:

```
height_and_weight.plot(kind='hist', density=True, ec='w', bins=60, alpha=0.8, figsize=(10, 5));
```

Both variables are roughly normal. What *benefit* is there to knowing that the two distributions are roughly normal?

Let's suppose, as is often the case, that we don't have access to the entire distribution of heights, just the mean and SD.

In [5]:

```
heights = height_and_weight.get('Height')
height_mean = heights.mean()
height_mean
```

Out[5]:

69.02634590621741

In [6]:

```
height_std = np.std(heights)
height_std
```

Out[6]:

2.863075878119538

Using just this information, we can estimate the proportion of heights between 65 and 70 inches:

- Convert 65 to standard units.
- Convert 70 to standard units.
- Use
`stats.norm.cdf`

to find the area between (1) and (2).

In [7]:

```
left = (65 - height_mean) / height_std
left
```

Out[7]:

-1.406300802918961

In [8]:

```
right = (70 - height_mean) / height_std
right
```

Out[8]:

0.34007275225345374

In [9]:

```
normal_area(left, right)
```

In [10]:

```
from scipy import stats
approximation = stats.norm.cdf(right) - stats.norm.cdf(left)
approximation
```

Out[10]:

0.5532817187111831

Since we have access to the entire set of heights, we can compute the true proportion of heights between 65 and 70 inches.

In [11]:

```
# True proportion of values between 65 and 70.
height_and_weight[
(height_and_weight.get('Height') >= 65) &
(height_and_weight.get('Height') <= 70)
].shape[0] / height_and_weight.shape[0]
```

Out[11]:

0.554

In [12]:

```
# Approximation using the standard normal curve.
approximation
```

Out[12]:

0.5532817187111831

Pretty good for an approximation! 🤩

- Last class, we looked at Chebyshev's inequality, which stated that the proportion of values within $z$ SDs of the mean is
**at least**$1-\frac{1}{z^2}$.- This works for
**any**distribution, and is a lower bound.

- This works for

- If we know that the distribution is normal, we can be even more specific!

Range | All Distributions (via Chebyshev's inequality) | Normal Distribution |
---|---|---|

mean $\pm \ 1$ SD | $\geq 0\%$ | $\approx 68\%$ |

mean $\pm \ 2$ SDs | $\geq 75\%$ | $\approx 95\%$ |

mean $\pm \ 3$ SDs | $\geq 88.8\%$ | $\approx 99.73\%$ |

Remember, the values on the $x$-axis for the standard normal curve are in standard units. So, the proportion of values within 1 SD of the mean is the area under the standard normal curve between -1 and 1.

In [13]:

```
normal_area(-1, 1, bars=True)
```

In [14]:

```
stats.norm.cdf(1) - stats.norm.cdf(-1)
```

Out[14]:

0.6826894921370859

This means that if a variable follows a normal distribution, approximately 68% of values will be within 1 SD of the mean.

In [15]:

```
normal_area(-2, 2, bars=True)
```

In [16]:

```
stats.norm.cdf(2) - stats.norm.cdf(-2)
```

Out[16]:

0.9544997361036416

- If a variable follows a normal distribution, approximately 95% of values will be within 2 SDs of the mean.
- Consequently, 5% of values will be outside this range.
- Since the normal curve is symmetric,
- 2.5% of values will be more than 2 SDs above the mean, and
- 2.5% of values will be more than 2 SDs below the mean.

Range | All Distributions (via Chebyshev's inequality) | Normal Distribution |
---|---|---|

mean $\pm \ 1$ SD | $\geq 0\%$ | $\approx 68\%$ |

mean $\pm \ 2$ SDs | $\geq 75\%$ | $\approx 95\%$ |

mean $\pm \ 3$ SDs | $\geq 88.8\%$ | $\approx 99.73\%$ |

The percentages you see for normal distributions above are approximate, but are not lower bounds.

**Important**: They apply to all normal distributions, standardized or not. This is because all normal distributions are just stretched and shifted versions of the standard normal distribution.

- Last class, we mentioned that the standard normal curve has inflection points at $z = \pm 1$.
- An inflection point is where a curve goes from "opening down" 🙁 to "opening up" 🙂.

In [17]:

```
normal_area(-1, 1)
```

- We know that the $x$-axis of the standard normal curve represents standard units, so the inflection points are at 1 standard deviation above and below the mean.

- This means that if a distribution is roughly normal, we can determine its standard deviation by finding the distance between each inflection point and the mean.

Remember: The distribution of heights is roughly normal, but it is *not* a *standard* normal distribution.

In [18]:

```
height_and_weight.plot(kind='hist', y='Height', density=True, ec='w', bins=40, alpha=0.8, figsize=(10, 5));
plt.xticks(np.arange(60, 78, 2));
```

- The center appears to be around 69.
- The inflection points appear to be around 66 and 72.
- So, the standard deviation is roughly 72 - 69 = 3.

In [19]:

```
np.std(height_and_weight.get('Height'))
```

Out[19]:

2.863075878119538

The distribution of flight delays that we've been looking at is *not* roughly normal.

In [20]:

```
delays = bpd.read_csv('data/delays.csv')
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', figsize=(10, 5), title='Population Distribution of Flight Delays')
plt.xlabel('Delay (minutes)');
```

In [21]:

```
delays.get('Delay').describe()
```

Out[21]:

count 13825.00 mean 16.66 std 39.48 ... 50% 2.00 75% 18.00 max 580.00 Name: Delay, Length: 8, dtype: float64

- Before we started discussing center, spread, and the normal distribution, our focus was on bootstrapping.

- We used bootstrapping to estimate
**the distribution of a sample statistic (e.g. sample mean or sample median)**, using just a single sample.

- We did this to construct confidence intervals for a population parameter.

**Important**: For now, we'll suppose our parameter of interest is the population mean,**so we're interested in estimating the distribution of the sample mean**.

- What we're soon going to discover is a technique for
**finding the distribution of the sample mean and creating a confidence interval, without needing to bootstrap**. Think of this as a shortcut to bootstrapping.

Since we have access to the population of flight delays, let's remind ourselves what the distribution of the sample mean looks like by drawing samples repeatedly from the population.

- This is
**not bootstrapping**. - This is also
**not practical**. If we had access to a population, we wouldn't need to understand the distribution of the sample mean – we'd be able to compute the population mean directly.

In [22]:

```
sample_means = np.array([])
repetitions = 2000
for i in np.arange(repetitions):
sample = delays.sample(500)
sample_mean = sample.get('Delay').mean()
sample_means = np.append(sample_means, sample_mean)
sample_means
```

Out[22]:

array([17.06, 16.39, 16.58, ..., 12.78, 16.21, 15.36])

In [23]:

```
bpd.DataFrame().assign(sample_means=sample_means).plot(kind='hist', density=True, ec='w', alpha=0.65, bins=20, figsize=(10, 5));
plt.scatter([sample_means.mean()], [-0.005], marker='^', color='green', s=250)
plt.axvline(sample_means.mean(), color='green', label=f'mean={np.round(sample_means.mean(), 2)}', linewidth=4)
plt.xlim(5, 30)
plt.ylim(-0.013, 0.26)
plt.legend();
```

Notice that this distribution is roughly normal, even though the population distribution was not! This distribution is centered at the population mean.

The Central Limit Theorem (CLT) says that the probability distribution of the

sum or meanof a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.

While the formulas we're about to introduce only work for sample means, it's important to remember that the statement above also holds true for sample sums.

**Shape**: The CLT says that the distribution of the sample mean is roughly normal, no matter what the population looks like.

**Center**: This distribution is centered at the population mean.

**Spread**: What is the standard deviation of the distribution of the sample mean? How is it impacted by the sample size?

The function `sample_mean_delays`

takes in an integer `sample_size`

, and:

- Takes a sample of size
`sample_size`

directly from the population. - Computes the mean of the sample.
- Repeats steps 1 and 2 above 2000 times, and returns an array of the resulting means.

In [24]:

```
def sample_mean_delays(sample_size):
sample_means = np.array([])
for i in np.arange(2000):
sample = delays.sample(sample_size)
sample_mean = sample.get('Delay').mean()
sample_means = np.append(sample_means, sample_mean)
return sample_means
```

Let's call `sample_mean_delays`

on several values of `sample_size`

.

In [25]:

```
sample_means = {}
sample_sizes = [5, 10, 50, 100, 200, 400, 800, 1600]
for size in sample_sizes:
sample_means[size] = sample_mean_delays(size)
```

Let's look at the resulting distributions.

In [26]:

```
# Plot the resulting distributions.
bins = np.arange(5, 30, 0.5)
for size in sample_sizes:
bpd.DataFrame().assign(data=sample_means[size]).plot(kind='hist', bins=bins, density=True, ec='w', title=f'Distribution of the Sample Mean for Samples of Size {size}', figsize=(8, 4))
plt.legend('');
plt.show()
time.sleep(1.5)
if size != sample_sizes[-1]:
clear_output()
```

What do you notice? 🤔

- As we increase our sample size, the distribution of the sample mean gets narrower, and so its standard deviation decreases.
- Can we determine exactly how much it decreases by?

In [27]:

```
# Compute the standard deviation of each distribution.
sds = np.array([])
for size in sample_sizes:
sd = np.std(sample_means[size])
sds = np.append(sds, sd)
sds
```

Out[27]:

array([17.78, 12.25, 5.69, 3.93, 2.77, 1.99, 1.36, 0.91])

In [28]:

```
observed = bpd.DataFrame().assign(
SampleSize=sample_sizes,
StandardDeviation=sds
)
observed.plot(kind='scatter', x='SampleSize', y='StandardDeviation', s=70, title="Standard Deviation of the Distribution of the Sample Mean vs. Sample Size", figsize=(10, 5));
```