In [1]:

```
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
# Animations
from IPython.display import display, IFrame
def show_permutation_testing_summary():
src = "https://docs.google.com/presentation/d/e/2PACX-1vSovXDonR6EmjrT45h4pY1mwmcKFMWVSdgpbKHC5HNTm9sbG7dojvvCDEQCjuk2dk1oA4gmwMogr8ZL/embed?start=false&loop=false&delayms=3000"
width = 960
height = 569
display(IFrame(src, width, height))
def show_bootstrapping_slides():
src = "https://docs.google.com/presentation/d/e/2PACX-1vS_iYHJYXSVMMZ-YQVFwMEFR6EFN3FDSAvaMyUm-YJfLQgRMTHm3vI-wWJJ5999eFJq70nWp2hyItZg/embed?start=false&loop=false&delayms=3000"
width = 960
height = 509
display(IFrame(src, width, height))
```

- Lab 5 is due
**Saturday 2/25 at 11:59PM**. - Homework 5 is due
**Thursday 3/2 at 11:59PM**.

- Permutation testing examples.
- Are the distributions of weight for babies 👶 born to smoking mothers vs. non-smoking mothers different?
- Are the distributions of pressure drops for footballs 🏈 from two different teams different?

- Bootstrapping 🥾.

Permutation tests help answer questions of the form:

I have

two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?

- Are the distributions of pressure drops for footballs 🏈 from two different teams different?

In [2]:

```
babies = bpd.read_csv('data/baby.csv').get(['Maternal Smoker', 'Birth Weight'])
babies
```

Out[2]:

Maternal Smoker | Birth Weight | |
---|---|---|

0 | False | 120 |

1 | False | 113 |

2 | True | 128 |

... | ... | ... |

1171 | True | 130 |

1172 | False | 125 |

1173 | False | 117 |

1174 rows × 2 columns

**Null Hypothesis**: In the population, birth weights of smokers' babies and non-smokers' babies have the same distribution, and the observed differences in our samples are due to random chance.

**Alternative Hypothesis**: In the population, smokers' babies have lower birth weights than non-smokers' babies, on average. The observed differences in our samples cannot be explained by random chance alone.

- Test statistic: Difference in mean birth weight of non-smokers' babies and smokers' babies.

- Strategy:
- Create a "population" by pooling data from both samples together.
- Randomly divide this "population" into two groups of the same sizes as the original samples.
- Repeat this process, calculating the test statistic for each pair of random groups.
- Generate an empirical distribution of test statistics and see whether the observed statistic is consistent with it.

- Implementation:
- To randomly divide the "population" into two groups of the same sizes as the original samples, we'll just shuffle the group labels and use the shuffled group labels to define the two random groups.

In [3]:

```
babies_with_shuffled = babies.assign(
Shuffled_Labels=np.random.permutation(babies.get('Maternal Smoker'))
)
babies_with_shuffled
```

Out[3]:

Maternal Smoker | Birth Weight | Shuffled_Labels | |
---|---|---|---|

0 | False | 120 | False |

1 | False | 113 | False |

2 | True | 128 | False |

... | ... | ... | ... |

1171 | True | 130 | False |

1172 | False | 125 | False |

1173 | False | 117 | False |

1174 rows × 3 columns

`'Maternal Smoker'`

column defines the original groups. The `'Shuffed_Labels'`

column defines the random groups.

For the original groups:

In [4]:

```
original_groups = babies.groupby('Maternal Smoker').mean()
original_groups
```

Out[4]:

Birth Weight | |
---|---|

Maternal Smoker | |

False | 123.09 |

True | 113.82 |

In [5]:

```
original_means = original_groups.get('Birth Weight')
observed_difference = original_means.loc[False] - original_means.loc[True]
observed_difference
```

Out[5]:

9.266142572024918

For the random groups:

In [6]:

```
def difference_in_group_means(weights_df):
group_means = weights_df.groupby('Shuffled_Labels').mean().get('Birth Weight')
return group_means.loc[False] - group_means.loc[True]
# Shuffling the labels again.
babies_with_shuffled = babies.assign(Shuffled_Labels=np.random.permutation(babies.get('Maternal Smoker')))
difference_in_group_means(babies_with_shuffled)
```

Out[6]:

0.9633438456967838

In [7]:

```
n_repetitions = 500 # The dataset is large, so it takes too long to run if we use 5000 or 10000.
differences = np.array([])
for i in np.arange(n_repetitions):
# Step 1: Shuffle the labels.
shuffled_labels = np.random.permutation(babies.get('Maternal Smoker'))
# Step 2: Put them in a DataFrame.
shuffled = babies_with_shuffled.assign(Shuffled_Labels=shuffled_labels)
# Step 3: Compute the difference in group means and add the result to the differences array.
difference = difference_in_group_means(shuffled)
differences = np.append(differences, difference)
differences
```

Out[7]:

array([ 1.06, 0.67, 1.48, ..., 1.68, -2.14, 1.33])

In [8]:

```
(bpd.DataFrame()
.assign(simulated_diffs=differences)
.plot(kind='hist', bins=20, density=True, ec='w', figsize=(10, 5))
);
```

- Note that the empirical distribution of the test statistic (difference in means) is centered around 0.
- This matches our intuition – if the null hypothesis is true, there should be no difference in the group means on average.

In [9]:

```
(bpd.DataFrame()
.assign(simulated_diffs=differences)
.plot(kind='hist', bins=20, density=True, ec='w', figsize=(10, 5))
);
plt.axvline(observed_difference, color='black', linewidth=4, label='observed difference in means')
plt.legend();
```

In [10]:

```
smoker_p_value = np.count_nonzero(differences >= observed_difference) / 500
smoker_p_value
```

Out[10]:

0.0

- Under the null hypothesis, we rarely see differences as large as 9.26 ounces.

- Still, we can't conclude that smoking
*causes*lower birth weight because there may be other factors at play. For instance, maybe smokers are more likely to drink caffeine, and caffeine causes lower birth weight.

In [11]:

```
show_permutation_testing_summary()
```

Recall, `babies`

has two columns.

In [12]:

```
babies.take(np.arange(3))
```

Out[12]:

Maternal Smoker | Birth Weight | |
---|---|---|

0 | False | 120 |

1 | False | 113 |

2 | True | 128 |

To randomly assign weights to groups, we shuffled `'Maternal Smoker'`

column. Could we have shuffled the `'Birth Weight'`

column instead?

- A. Yes
- B. No

- On January 18, 2015, the New England Patriots played the Indianapolis Colts for a spot in the Super Bowl.
- The Patriots won, 45-7. They went on to win the Super Bowl.
- After the game,
**it was alleged that the Patriots intentionally deflated footballs**, making them easier to catch. This scandal was called "Deflategate."

- Each team brings 12 footballs to the game. Teams use their own footballs while on offense.

- NFL rules stipulate that
**each ball must be inflated to between 12.5 and 13.5 pounds per square inch (psi)**.

- Before the game, officials found that all of the Patriots' footballs were at about 12.5 psi, and that all of the Colts' footballs were at about 13.0 psi.
- This pre-game data was not written down.

- At halftime, two officials (Clete Blakeman and Dyrol Prioleau) independently measured the pressures of as many of the 24 footballs as they could.
- They ran out of time before they could finish.

- Note that the relevant quantity is the
**change in pressure**from the start of the game to the halftime.- The Patriots' balls
*started*at a lower psi (which is not an issue on its own). - The allegations were that the Patriots
**deflated**their balls, during the game.

- The Patriots' balls

In [13]:

```
footballs = bpd.read_csv('data/footballs.csv')
footballs
```

Out[13]:

Team | Pressure | PressureDrop | |
---|---|---|---|

0 | Patriots | 11.65 | 0.85 |

1 | Patriots | 11.03 | 1.48 |

2 | Patriots | 10.85 | 1.65 |

... | ... | ... | ... |

11 | Colts | 12.53 | 0.47 |

12 | Colts | 12.72 | 0.28 |

13 | Colts | 12.35 | 0.65 |

14 rows × 3 columns

- There are only 15 rows (11 for Patriots footballs, 4 for Colts footballs) since the officials weren't able to record the pressures of every ball.
- The
`'Pressure'`

column records the average of the two officials' measurements at halftime. - The
`'PressureDrop'`

column records the difference between the estimated starting pressure and the average recorded`'Pressure'`

of each football.

Did the Patriots' footballs drop in pressure more than the Colts'?

**Null hypothesis**: The drop in pressures for both teams came from the same distribution.- By chance, the Patriots' footballs deflated more.

**Alternative hypothesis**: No, the Patriots' footballs deflated more than one would expect due to random chance alone.

Similar to the baby weights example, our test statistic will be the difference between the teams' average pressure drops. We'll calculate the mean drop for the `'Patriots'`

minus the mean drop for the `'Colts'`

.

In [14]:

```
means = footballs.groupby('Team').mean().get('PressureDrop')
means
```

Out[14]:

Team Colts 0.47 Patriots 1.21 Name: PressureDrop, dtype: float64

In [15]:

```
# Calculate the observed statistic.
observed_difference = means.loc['Patriots'] - means.loc['Colts']
observed_difference
```

Out[15]:

0.7362500000000001

The average pressure drop for the Patriots was about 0.74 psi more than the Colts.

We'll run a permutation test to see if 0.74 psi is a significant difference.

- To do this, we'll need to repeatedly shuffle either the
`'Team'`

or the`'PressureDrop'`

column. - We'll shuffle the
`'PressureDrop'`

column. - Tip: It's a good idea to simulate one value of the test statistic before putting everything in a
`for`

-loop.

In [16]:

```
# For simplicity, keep only the columns that are necessary for the test:
# One column of group labels and one column of numerical values.
footballs = footballs.get(['Team', 'PressureDrop'])
footballs
```

Out[16]:

Team | PressureDrop | |
---|---|---|

0 | Patriots | 0.85 |

1 | Patriots | 1.48 |

2 | Patriots | 1.65 |

... | ... | ... |

11 | Colts | 0.47 |

12 | Colts | 0.28 |

13 | Colts | 0.65 |

14 rows × 2 columns

In [17]:

```
# Shuffle one column.
# We chose to shuffle the numerical data (pressure drops), but we could have shuffled the group labels (team names) instead.
shuffled_drops = np.random.permutation(footballs.get('PressureDrop'))
shuffled_drops
```

Out[17]:

array([1.23, 1.48, 0.42, 1.18, 0.28, 1.35, 1.65, 0.47, 0.85, 0.65, 1.8 , 0.47, 0.72, 1.38])

In [18]:

```
# Add the shuffled column back to the DataFrame.
shuffled = footballs.assign(Shuffled_Drops=shuffled_drops)
shuffled
```

Out[18]:

Team | PressureDrop | Shuffled_Drops | |
---|---|---|---|

0 | Patriots | 0.85 | 1.23 |

1 | Patriots | 1.48 | 1.48 |

2 | Patriots | 1.65 | 0.42 |

... | ... | ... | ... |

11 | Colts | 0.47 | 0.47 |

12 | Colts | 0.28 | 0.72 |

13 | Colts | 0.65 | 1.38 |

14 rows × 3 columns

In [19]:

```
# Calculate the group means for the two randomly created groups.
team_means = shuffled.groupby('Team').mean().get('Shuffled_Drops')
team_means
```

Out[19]:

Team Colts 1.09 Patriots 0.96 Name: Shuffled_Drops, dtype: float64

In [20]:

```
# Calcuate the difference in group means (Patriots minus Colts) for the randomly created groups.
team_means.loc['Patriots'] - team_means.loc['Colts']
```

Out[20]:

-0.13874999999999993

- Repeat the process many times by wrapping it inside a
`for`

-loop. - Keep track of the difference in group means in an array, appending each time.
- Optionally, create a function to calculate the difference in group means.

In [21]:

```
def difference_in_mean_pressure_drops(pressures_df):
team_means = pressures_df.groupby('Team').mean().get('Shuffled_Drops')
return team_means.loc['Patriots'] - team_means.loc['Colts']
```

In [22]:

```
n_repetitions = 5000 # The dataset is much smaller than in the baby weights example, so a larger number of repetitions will still run quickly.
differences = np.array([])
for i in np.arange(n_repetitions):
# Step 1: Shuffle the pressure drops.
shuffled_drops = np.random.permutation(footballs.get('PressureDrop'))
# Step 2: Put them in a DataFrame.
shuffled = footballs.assign(Shuffled_Drops=shuffled_drops)
# Step 3: Compute the difference in group means and add the result to the differences array.
difference = difference_in_mean_pressure_drops(shuffled)
differences = np.append(differences, difference)
differences
```

Out[22]:

array([ 0.13, -0.09, 0.05, ..., -0.81, 0.02, -0.37])

In [23]:

```
bpd.DataFrame().assign(SimulatedDifferenceInMeans=differences).plot(kind='hist', bins=20, density=True, ec='w', figsize=(10, 5))
plt.axvline(observed_difference, color='black', linewidth=4, label='observed difference in means')
plt.legend();
```

It doesn't look good for the Patriots. What is the p-value?

- Recall, the p-value is the probability, under the null hypothesis, of seeing a result
**as or more extreme**than the observation. - In this case, that's the probability of the difference in mean pressure drops being greater than or equal to 0.74 psi.

In [24]:

```
np.count_nonzero(differences >= observed_difference) / n_repetitions
```

Out[24]:

0.005

*highly* statistically significant ($p<0.01$).

- We reject the null hypothesis, as it is unlikely that the difference in mean pressure drops is due to chance alone.
- But this doesn't establish
**causation**. - That is, we can't conclude that the Patriots
**intentionally**deflated their footballs.

Quote from an investigative report commissioned by the NFL:

“[T]he average pressure drop of the Patriots game balls exceeded the average pressure drop of the Colts balls by 0.45 to 1.02 psi, depending on various possible assumptions regarding the gauges used, and assuming an initial pressure of 12.5 psi for the Patriots balls and 13.0 for the Colts balls.”

- Many different methods were used to determine whether the drop in pressures were due to chance, including physics.
- We computed an observed difference of 0.74, which is in line with the findings of the report.

- In the end, Tom Brady (quarterback for the Patriots at the time) was suspended 4 games and the team was fined $1 million dollars.
- The Deflategate Wikipedia article is extremely thorough; give it a read if you're curious!

In [25]:

```
population = bpd.read_csv('data/2021_salaries.csv')
population
```

Out[25]:

Year | EmployerType | EmployerName | DepartmentOrSubdivision | ... | EmployerCounty | SpecialDistrictActivities | IncludesUnfundedLiability | SpecialDistrictType | |
---|---|---|---|---|---|---|---|---|---|

0 | 2021 | City | San Diego | Police | ... | San Diego | NaN | False | NaN |

1 | 2021 | City | San Diego | Police | ... | San Diego | NaN | False | NaN |

2 | 2021 | City | San Diego | Police | ... | San Diego | NaN | False | NaN |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

12302 | 2021 | City | San Diego | Fire-Rescue | ... | San Diego | NaN | False | NaN |

12303 | 2021 | City | San Diego | Fleet Operations | ... | San Diego | NaN | False | NaN |

12304 | 2021 | City | San Diego | Fire-Rescue | ... | San Diego | NaN | False | NaN |

12305 rows × 29 columns

In [26]:

```
population.columns
```

Out[26]:

Index(['Year', 'EmployerType', 'EmployerName', 'DepartmentOrSubdivision', 'Position', 'ElectedOfficial', 'Judicial', 'OtherPositions', 'MinPositionSalary', 'MaxPositionSalary', 'ReportedBaseWage', 'RegularPay', 'OvertimePay', 'LumpSumPay', 'OtherPay', 'TotalWages', 'DefinedBenefitPlanContribution', 'EmployeesRetirementCostCovered', 'DeferredCompensationPlan', 'HealthDentalVision', 'TotalRetirementAndHealthContribution', 'PensionFormula', 'EmployerURL', 'EmployerPopulation', 'LastUpdatedDate', 'EmployerCounty', 'SpecialDistrictActivities', 'IncludesUnfundedLiability', 'SpecialDistrictType'], dtype='object')

We only need the `'TotalWages'`

column, so let's `get`

just that column.

In [27]:

```
population = population.get(['TotalWages'])
population
```

Out[27]:

TotalWages | |
---|---|

0 | 359138 |

1 | 345336 |

2 | 336250 |

... | ... |

12302 | 9 |

12303 | 9 |

12304 | 4 |

12305 rows × 1 columns

In [28]:

```
population.plot(kind='hist', bins=np.arange(0, 400000, 10000), density=True, ec='w', figsize=(10, 5),
title='Distribution of Total Wages of San Diego City Employees in 2021');
```

Consider the question

What is the median salary of all San Diego city employees?

What is the right tool to answer this question?

- A. Standard hypothesis testing
- B. Permutation testing
- C. Either of the above
- D. None of the above

- We can use
`.median()`

to find the median salary of all city employees. - This is
**not**a random quantity.

In [29]:

```
population_median = population.get('TotalWages').median()
population_median
```

Out[29]:

74441.0

- In practice, it is costly and time-consuming to survey
**all**12,000+ employees.- More generally, we can't expect to survey all members of the population we care about.

- Instead, we gather salaries for a random sample of, say, 500 people.

- Hopefully, the median of the sample is close to the median of the population.

- The full DataFrame of salaries is the
**population**.

- We observe a
**sample**of 500 salaries from the population.

- We want to determine the
**population median (a parameter)**, but we don't have the whole population, so instead we use the**sample median (a statistic) as an estimate**.

- Hopefully the sample median is close to the population median.

Let's survey 500 employees at random. To do so, we can use the `.sample`

method.

In [30]:

```
np.random.seed(38) # Magic to ensure that we get the same results every time this code is run.
# Take a sample of size 500.
my_sample = population.sample(500)
my_sample
```

Out[30]:

TotalWages | |
---|---|

599 | 167191 |

10595 | 18598 |

837 | 157293 |

... | ... |

2423 | 122785 |

7142 | 62808 |

5792 | 78093 |

500 rows × 1 columns

`my_sample`

at any point in this notebook, so it will always refer to this particular sample.

In [31]:

```
# Compute the sample median.
sample_median = my_sample.get('TotalWages').median()
sample_median
```

Out[31]:

72016.0

- Our estimate depended on a random sample.

- If our sample was different, our estimate may have been different, too.

**How different could our estimate have been?**

- Our confidence in the estimate depends on the answer to this question.

- The sample median is a random number.

- It comes from some distribution, which we don't know.

- How different could our estimate have been, if we drew a different sample?
- "Narrow" distribution $\Rightarrow$ not too different.
- "Wide" distribution $\Rightarrow$ quite different.

**What is the distribution of the sample median?**

- One idea: repeatedly collect random samples of 500
**from the population**and compute their medians.- This is what we did in Lecture 14 to compute an empirical distribution of the sample mean of flight delays.

In [32]:

```
sample_medians = np.array([])
for i in np.arange(1000):
median = population.sample(500).get('TotalWages').median()
sample_medians = np.append(sample_medians, median)
sample_medians
```

Out[32]:

array([81062.5, 77915.5, 70419.5, ..., 71840. , 73618.5, 79238. ])

In [33]:

```
(bpd.DataFrame()
.assign(SampleMedians=sample_medians)
.plot(kind='hist', density=True,
bins=30, ec='w', figsize=(8, 5),
title='Distribution of the Sample Median of 1000 Samples from the Population')
);
```

- Drawing new samples like this is impractical.
- If we were able to do this, why not just collect more data in the first place?

- Often, we can't ask for new samples from the population.

**Key insight:**our original sample,`my_sample`

, looks a lot like the population.- Their distributions are similar.

In [34]:

```
fig, ax = plt.subplots(figsize=(10, 5))
bins=np.arange(10_000, 300_000, 10_000)
population.plot(kind='hist', y='TotalWages', ax=ax, density=True, alpha=.75, bins=bins, ec='w')
my_sample.plot(kind='hist', y='TotalWages', ax=ax, density=True, alpha=.75, bins=bins, ec='w')
plt.legend(['Population', 'My Sample']);
```

`my_sample`

), **not** the distribution of sample medians for 1000 samples.

**Shortcut:**Use the sample in lieu of the population.- The sample itself looks like the population.
- So, resampling from the sample is kind of like sampling from the population.
- The act of resampling from a sample is called
**bootstrapping**.

- In our case specifically:
- We have a sample of 500 salaries.
- We want another sample of 500 salaries, but we can't draw from the population.
- However, the original sample looks like the population.
- So, let's just
**resample from the sample!**

In [35]:

```
show_bootstrapping_slides()
```

- Our goal when bootstrapping is to create a sample of the same size as our original sample.

- Let's repeatedly resample 3 numbers
**without replacement**from an original sample of [1, 2, 3].

In [36]:

```
original = [1, 2, 3]
for i in np.arange(10):
resample = np.random.choice(original, 3, replace=False)
print("Resample: ", resample, " Median: ", np.median(resample))
```

- Let's repeatedly resample 3 numbers
**with replacement**from an original sample of [1, 2, 3].

In [37]:

```
original = [1, 2, 3]
for i in np.arange(10):
resample = np.random.choice(original, 3, replace=True)
print("Resample: ", resample, " Median: ", np.median(resample))
```

When we resample without replacement, resamples look just like the original samples.

When we resample with replacement, resamples can have a different mean, median, max, and min than the original sample.

- So, we need to sample
**with replacement**to ensure that our resamples can be different from the original sample.

We can simulate the act of collecting new samples by **sampling with replacement from our original sample, my_sample**.

In [38]:

```
# Note that the population DataFrame doesn't appear anywhere here.
# This is all based on one sample.
n_resamples = 5000
boot_medians = np.array([])
for i in range(n_resamples):
# Resample from my_sample WITH REPLACEMENT.
resample = my_sample.sample(500, replace=True)
# Compute the median.
median = resample.get('TotalWages').median()
# Store it in our array of medians.
boot_medians = np.append(boot_medians, median)
```

In [39]:

```
boot_medians
```

Out[39]:

array([72538. , 70989.5, 71874. , ..., 71372. , 69750. , 71486.5])

In [40]:

```
bpd.DataFrame().assign(BootstrapMedians=boot_medians).plot(kind='hist', density=True, bins=np.arange(60000, 85000, 1000), ec='w', figsize=(10, 5))
plt.scatter(population_median, 0.000004, color='blue', s=100, label='population median').set_zorder(2)
plt.legend();
```

- The population median (
**blue dot**) is near the middle.**In reality, we'd never get to see this!**

We have a sample median wage:

In [41]:

```
my_sample.get('TotalWages').median()
```

Out[41]:

72016.0

With it, we can say that the population median wage is approximately \$72,016, and not much else.

But by bootstrapping, we can generate an empirical distribution of the sample median:

In [42]:

```
(bpd.DataFrame()
.assign(BootstrapMedians=boot_medians)
.plot(kind='hist', density=True, bins=np.arange(60000, 85000, 1000), ec='w', figsize=(10, 5))
)
plt.legend();
```

which allows us to say things like

We think the population median wage is between \$67,000 and \\$77,000.

**Question:** We could also say that we think the population median wage is between \$70,000 and \\$75,000, or between \$60,000 and \\$80,000. What range should we pick?

- Given a single sample, we want to estimate some population parameter using just one sample.
- In real life, you don't get access to the population, only a sample!

- One sample gives one estimate of the parameter. To get a sense of how much our estimate might have been different with a different sample, we need more samples.
- In real life, sampling is expensive.
**You only get one sample!**

- In real life, sampling is expensive.
**Key idea**: The distribution of a sample looks a lot like the distribution of the population it was drawn from. So we can treat it like the population and**resample**from it.- Each resample yields another estimate of the parameter. Taken together, many estimates give a sense of how much variability exists in our estimates, or how certain we are of any single estimate being accurate.

- We just learned how to approximate the distribution of a sample statistic, which means we now have a sense of how much our estimates can vary.
- Bootstrapping lets us
**quantify uncertainty**.

- Bootstrapping lets us
- Next time, we'll learn how to give an
**interval**of likely values where we think a population parameter falls, based on data in our sample.- The width of such an interval will reflect our uncertainty about the actual value of the parameter.