In [1]:

```
# Set up packages for lecture. Don't worry about understanding this code,
# but make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')
from scipy import stats
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
# Animations
import time
from IPython.display import display, HTML, IFrame, clear_output
import ipywidgets as widgets
import warnings
warnings.filterwarnings('ignore')
def normal_curve(x, mu=0, sigma=1):
return (1 / np.sqrt(2 * np.pi * sigma ** 2)) * np.exp((- (x - mu) ** 2) / (2 * sigma ** 2))
def normal_area(a, b, bars=False, title=None):
x = np.linspace(-4, 4)
y = normal_curve(x)
ix = (x >= a) & (x <= b)
plt.plot(x, y, color='black')
plt.fill_between(x[ix], y[ix], color='gold')
if bars:
plt.axvline(a, color='red')
plt.axvline(b, color='red')
if title:
plt.title(title)
else:
plt.title(f'Area between {np.round(a, 2)} and {np.round(b, 2)}')
plt.show()
def area_within(z):
title = f'Proportion of values within {z} SDs of the mean: {np.round(stats.norm.cdf(z) - stats.norm.cdf(-z), 4)}'
normal_area(-z, z, title=title)
def show_clt_slides():
src = "https://docs.google.com/presentation/d/e/2PACX-1vTcJd3U1H1KoXqBFcWGKFUPjZbeW4oiNZZLCFY8jqvSDsl4L1rRTg7980nPs1TGCAecYKUZxH5MZIBh/embed?start=false&loop=false&delayms=3000&rm=minimal"
width = 960
height = 509
display(IFrame(src, width, height))
```

- We reached our 80% participation goal for the Mid-Quarter Survey. As a result, your score on Gradescope for the Midterm Exam is now 2 points higher than it was before!
- Great job on the Midterm Project! Scores are available and reflected in the recently updated Grade Report.
- Quiz 3 is on
**Wednesday in discussion section**.- It covers Lectures 14 through 17.
- Prepare by solving relevant problems on practice.dsc10.com.

- With holidays, the schedule of due dates has shifted a bit.
- Lab 5 is released and due on
**Saturday 11/18 at 11:59PM**. - Homework 5 is not yet released, but will be due
**Tuesday 11/21 at 11:59PM**.

- Lab 5 is released and due on

- Recap: Standard units and the normal distribution.
- The Central Limit Theorem.
- Using the Central Limit Theorem to create confidence intervals.

SAT scores range from 0 to 1600. The distribution of SAT scores has a mean of 950 and a standard deviation of 300. Your friend tells you that their SAT score, in standard units, is 2.5. What do you conclude?

- Last class, we focused on the
*standard*normal distribution, which has a mean of 0 and standard deviation of 1.

In [2]:

```
plt.figure(figsize=(10, 5))
x = np.linspace(-40, 40, 10000)
pairs = [(0, 1, 'black'), (10, 1, 'blue'), (-15, 4, 'red'), (20, 0.5, 'green')]
for pair in pairs:
y = normal_curve(x, mu=pair[0], sigma=pair[1])
plt.plot(x, y, color=pair[2], linewidth=3, label=f'Normal(mean={pair[0]}, SD={pair[1]})')
plt.xlim(-40, 40)
plt.ylim(0, 1)
plt.title('Normal Distributions with Different Means and Standard Deviations')
plt.legend();
```

The distribution of flight delays that we've been looking at is *not* roughly normal.

In [3]:

```
delays = bpd.read_csv('data/united_summer2015.csv')
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', figsize=(10, 5), title='Population Distribution of Flight Delays')
plt.xlabel('Delay (minutes)');
```

In [4]:

```
delays.get('Delay').describe()
```

Out[4]:

count 13825.00 mean 16.66 std 39.48 ... 50% 2.00 75% 18.00 max 580.00 Name: Delay, Length: 8, dtype: float64

- We used bootstrapping to estimate
**the distribution of a sample statistic (e.g. sample mean or sample median)**, using just a single sample.

- We did this to construct confidence intervals for a population parameter.

**Important**: For now, we'll suppose our parameter of interest is the population mean,**so we're interested in estimating the distribution of the sample mean**.

- What we're soon going to discover is a technique for
**finding the distribution of the sample mean and creating a confidence interval, without needing to bootstrap**. Think of this as a shortcut to bootstrapping.

Since we have access to the population of flight delays, let's remind ourselves what the distribution of the sample mean looks like by drawing samples repeatedly from the population.

- This is
**not bootstrapping**. - This is also
**not practical**. If we had access to a population, we wouldn't need to understand the distribution of the sample mean – we'd be able to compute the population mean directly.

In [5]:

```
sample_means = np.array([])
repetitions = 2000
for i in np.arange(repetitions):
sample = delays.sample(500) # Not bootstrapping!
sample_mean = sample.get('Delay').mean()
sample_means = np.append(sample_means, sample_mean)
sample_means
```

Out[5]:

array([16.88, 15. , 16.11, ..., 16.29, 16.45, 15.02])

In [6]:

```
bpd.DataFrame().assign(sample_means=sample_means).plot(kind='hist', density=True, ec='w', alpha=0.65, bins=20, figsize=(10, 5));
plt.scatter([sample_means.mean()], [-0.005], marker='^', color='green', s=250)
plt.axvline(sample_means.mean(), color='green', label=f'mean={np.round(sample_means.mean(), 2)}', linewidth=4)
plt.xlim(5, 30)
plt.ylim(-0.013, 0.26)
plt.legend();
```

The Central Limit Theorem (CLT) says that the probability distribution of the

sum or meanof a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.

**Shape**: The CLT says that the distribution of the sample mean is roughly normal, no matter what the population looks like.

**Center**: This distribution is centered at the population mean.

**Spread**: What is the standard deviation of the distribution of the sample mean? How is it impacted by the sample size?

The function `sample_mean_delays`

takes in an integer `sample_size`

, and:

- Takes a sample of size
`sample_size`

directly from the population. - Computes the mean of the sample.
- Repeats steps 1 and 2 above 2000 times, and returns an array of the resulting means.

In [7]:

```
def sample_mean_delays(sample_size):
sample_means = np.array([])
for i in np.arange(2000):
sample = delays.sample(sample_size)
sample_mean = sample.get('Delay').mean()
sample_means = np.append(sample_means, sample_mean)
return sample_means
```

Let's call `sample_mean_delays`

on several values of `sample_size`

.

In [8]:

```
sample_means = {}
sample_sizes = [5, 10, 50, 100, 200, 400, 800, 1600]
for size in sample_sizes:
sample_means[size] = sample_mean_delays(size)
```

Let's look at the resulting distributions.

In [9]:

```
# Plot the resulting distributions.
bins = np.arange(5, 30, 0.5)
for size in sample_sizes:
bpd.DataFrame().assign(data=sample_means[size]).plot(kind='hist', bins=bins, density=True, ec='w', title=f'Distribution of the Sample Mean for Samples of Size {size}', figsize=(8, 4))
plt.legend('');
plt.show()
time.sleep(1.5)
if size != sample_sizes[-1]:
clear_output()
```

What do you notice? 🤔

- As we increase our sample size, the distribution of the sample mean gets narrower, and so its standard deviation decreases.
- Can we determine exactly how much it decreases by?

In [10]:

```
# Compute the standard deviation of each distribution.
sds = np.array([])
for size in sample_sizes:
sd = np.std(sample_means[size])
sds = np.append(sds, sd)
sds
```

Out[10]:

array([18.01, 12.8 , 5.64, 3.86, 2.81, 1.98, 1.37, 0.95])

In [11]:

```
observed = bpd.DataFrame().assign(
SampleSize=sample_sizes,
StandardDeviation=sds
)
observed.plot(kind='scatter', x='SampleSize', y='StandardDeviation', s=70, title="Standard Deviation of the Distribution of the Sample Mean vs. Sample Size", figsize=(10, 5));
```

*decreases quickly*.

- Here's the mathematical relationship describing this phenomenon:

- This is sometimes called the
**square root law**. Its proof is outside the scope of this class; you'll see it if you take an upper-division probability course.

**Note**: This is**not**saying anything about the standard deviation of a sample itself! It is a statement about the distribution of all possible sample means. If we increase the size of the sample we're taking:- It
**is not true**❌ that the SD of our sample will decrease. - It
**is true**✅ that the SD of the distribution of all possible sample means of that size will decrease.

- It

If we were to take many, many samples of the same size from a population, and take the mean of each sample, the distribution of the sample mean will have the following characteristics:

**Shape**: The distribution will be roughly normal, regardless of the shape of the population distribution.

**Center**: The distribution will be centered at the population mean.

**Spread**: The distribution's standard deviation will be described by the square root law:

**🚨 Practical Issue**: The mean and standard deviation of the distribution of the sample mean both depend on the original population, but we typically **don't have access to the population**!

**Idea**: The sample mean and SD are likely to be close to the population mean and SD. So, use them as approximations in the CLT!

- As a result,
**we can approximate the distribution of the sample mean, given just a single sample, without ever having to bootstrap!**- In other words, the CLT is a shortcut to bootstrapping!

Let's take a single sample of size 500 from `delays`

.

In [12]:

```
np.random.seed(42)
my_sample = delays.sample(500)
my_sample.get('Delay').describe()
```

Out[12]:

count 500.00 mean 13.01 std 28.00 ... 50% 3.00 75% 16.00 max 209.00 Name: Delay, Length: 8, dtype: float64

In [13]:

```
resample_means = np.array([])
repetitions = 2000
for i in np.arange(repetitions):
resample = my_sample.sample(500, replace=True) # Bootstrapping!
resample_mean = resample.get('Delay').mean()
resample_means = np.append(resample_means, resample_mean)
resample_means
```

Out[13]:

array([12.65, 11.5 , 11.34, ..., 12.59, 11.89, 12.58])

In [14]:

```
bpd.DataFrame().assign(resample_means=resample_means).plot(kind='hist', density=True, ec='w', alpha=0.65, bins=20, figsize=(10, 5));
plt.scatter([resample_means.mean()], [-0.005], marker='^', color='green', s=250)
plt.axvline(resample_means.mean(), color='green', label=f'mean={np.round(resample_means.mean(), 2)}', linewidth=4)
plt.xlim(7, 20)
plt.ylim(-0.015, 0.35)
plt.legend();
```

The CLT tells us what this distribution will look like, without having to bootstrap!

Suppose all we have access to in practice is a single "original sample." If we were to take many, many samples of the same size from this original sample, and take the mean of each resample, the distribution of the (re)sample mean will have the following characteristics:

**Shape**: The distribution will be roughly normal, regardless of the shape of the original sample's distribution.

**Center**: The distribution will be centered at the**original sample's mean**, which should be close to the population's mean.

**Spread**: The distribution's standard deviation will be described by the square root law:

Let's test this out!

Using just the original sample, `my_sample`

, we estimate that the distribution of the sample mean has the following mean:

In [15]:

```
sample_mean_mean = my_sample.get('Delay').mean()
sample_mean_mean
```

Out[15]:

13.008

and the following standard deviation:

In [16]:

```
sample_mean_sd = np.std(my_sample.get('Delay')) / np.sqrt(my_sample.shape[0])
sample_mean_sd
```

Out[16]:

1.2511114546674091

In [17]:

```
norm_x = np.linspace(7, 20)
norm_y = normal_curve(norm_x, mu=sample_mean_mean, sigma=sample_mean_sd)
bpd.DataFrame().assign(Bootstrapping=resample_means).plot(kind='hist', density=True, ec='w', alpha=0.65, bins=20, figsize=(10, 5));
plt.plot(norm_x, norm_y, color='black', linestyle='--', linewidth=4, label='CLT')
plt.title('Distribution of the Sample Mean, Using Two Methods')
plt.xlim(7, 20)
plt.legend();
```

**Key takeaway**: Given just a single sample, we can use the CLT to estimate the distribution of the sample mean, **without bootstrapping**.

In [18]:

```
show_clt_slides()
```

Now, we can make confidence intervals for population means **without needing to bootstrap**!

- Previously, we bootstrapped to construct confidence intervals.
- Strategy: Collect one sample, repeatedly resample from it, calculate the statistic on each resample, and look at the middle 95% of resampled statistics.

- But,
**if our statistic is the mean**, we can use the CLT.- Computationally cheaper – no simulation required!

- In both cases, we use just a single sample to construct our confidence interval.

We already have a single sample, `my_sample`

. Let's bootstrap to generate 2000 resample means.

In [19]:

```
my_sample.get('Delay').describe()
```

Out[19]:

In [20]:

```
resample_means = np.array([])
repetitions = 2000
for i in np.arange(repetitions):
resample = my_sample.sample(500, replace=True)
resample_mean = resample.get('Delay').mean()
resample_means = np.append(resample_means, resample_mean)
resample_means
```

Out[20]:

array([14.37, 13.93, 11.34, ..., 16.84, 14.46, 11.4 ])

In [21]:

```
bpd.DataFrame().assign(resample_means=resample_means).plot(kind='hist', density=True, ec='w', alpha=0.65, bins=20, figsize=(10, 5));
plt.scatter([resample_means.mean()], [-0.005], marker='^', color='green', s=250)
plt.axvline(resample_means.mean(), color='green', label=f'mean={np.round(resample_means.mean(), 2)}', linewidth=4)
plt.xlim(7, 20)
plt.ylim(-0.015, 0.35)
plt.legend();
```

In [22]:

```
left_boot = np.percentile(resample_means, 2.5)
right_boot = np.percentile(resample_means, 97.5)
[left_boot, right_boot]
```

Out[22]:

[10.6359, 15.61205]

In [23]:

```
bpd.DataFrame().assign(resample_means=resample_means).plot(kind='hist', y='resample_means', alpha=0.65, bins=20, density=True, ec='w', figsize=(10, 5), title='Distribution of Bootstrapped Sample Means');
plt.plot([left_boot, right_boot], [0, 0], color='gold', linewidth=10, label='95% bootstrap-based confidence interval');
plt.xlim(7, 20);
plt.legend();
```

But we didn't *need* to bootstrap to learn what the distribution of the sample mean looks like. We could instead use the CLT, which tells us that the distribution of the sample mean is normal. Further, its mean and standard deviation are approximately:

In [24]:

```
sample_mean_mean = my_sample.get('Delay').mean()
sample_mean_mean
```

Out[24]:

13.008

In [25]:

```
sample_mean_sd = np.std(my_sample.get('Delay')) / np.sqrt(my_sample.shape[0])
sample_mean_sd
```

Out[25]:

1.2511114546674091

So, the distribution of the sample mean is approximately:

In [26]:

```
plt.figure(figsize=(10, 5))
norm_x = np.linspace(7, 20)
norm_y = normal_curve(norm_x, mu=sample_mean_mean, sigma=sample_mean_sd)
plt.plot(norm_x, norm_y, color='black', linestyle='--', linewidth=4, label='Distribution of the Sample Mean (via the CLT)')
plt.xlim(7, 20)
plt.legend();
```