# Lecture 18 – Permutation Testing, Bootstrapping¶

## DSC 10, Fall 2022¶

### Announcements¶

• Lab 5 is due Saturday 11/5 at 11:59pm.
• Homework 5 is due Tuesday 11/8 at 11:59pm.

### Agenda¶

• Permutation testing examples.
• Are the distributions of weight for babies 👶 born to smoking mothers vs. non-smoking mothers different?
• Are the distributions of pressure drops for footballs 🏈 from two different teams different?
• Bootstrapping 🥾.

## Permutation testing¶

### Purpose¶

Permutation tests help answer questions of the form:

I have two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?

• Are the distributions of weight for babies 👶 born to smoking mothers vs. non-smoking mothers different?
• Are the distributions of pressure drops for footballs 🏈 from two different teams different?

### Setup for the hypothesis test¶

• Null Hypothesis: In the population, birth weights of smokers' babies and non-smokers' babies have the same distribution, and the observed differences in our samples are due to random chance.
• Alternative Hypothesis: In the population, smokers' babies have lower birth weights than non-smokers' babies, on average. The observed differences in our samples cannot be explained by random chance alone.
• Test statistic: Difference in mean birth weight of non-smokers' babies and smokers' babies.

### Strategy and implementation¶

• Strategy:
• Create a "population" by pooling data from both samples together.
• Randomly divide this "population" into two groups of the same sizes as the original samples.
• Repeat this process, calculating the test statistic for each pair of random groups.
• Generate an empirical distribution of test statistics and see whether the observed statistic is consistent with it.
• Implementation:
• To randomly divide the "population" into two groups of the same sizes as the original samples, we'll just shuffle the group labels and use the shuffled group labels to define the two random groups.

### Shuffling the labels¶

The 'Maternal Smoker' column defines the original groups. The 'Shuffed_Labels' column defines the random groups.

### Calculating the test statistic¶

For the original groups:

For the random groups:

### Repeating the process¶

• Note that the empirical distribution of the test statistic (difference in means) is centered around 0.
• This matches our intuition – if the null hypothesis is true, there should be no difference in the group means on average.

### Conclusion¶

• Under the null hypothesis, we rarely see differences as large as 9.26 ounces.
• Therefore, we reject the null hypothesis: the evidence implies that the groups do not come from the same distribution.
• Can we conclude that smoking causes lower birth weight? Why or why not? Think about it, then click here for the answer.No, we cannot. This was an observational study; there may be confounding factors. For instance, maybe smokers are more likely to drink caffeine, and caffeine causes lower birth weight.

### Concept Check ✅ – Answer at cc.dsc10.com¶

Recall, babies has two columns.

To randomly assign weights to groups, we shuffled 'Maternal Smoker' column. Could we have shuffled the 'Birth Weight' column instead?

• A. Yes
• B. No

### Example: Did the New England Patriots cheat? 🏈¶

• On January 18, 2015, the New England Patriots played the Indianapolis Colts for a spot in the Super Bowl.
• The Patriots won, 45-7. They went on to win the Super Bowl.
• After the game, it was alleged that the Patriots intentionally deflated footballs, making them easier to catch. This scandal was called "Deflategate."

### Background¶

• Each team brings 12 footballs to the game. Teams use their own footballs while on offense.
• NFL rules stipulate that each ball must be inflated to between 12.5 and 13.5 pounds per square inch (psi).
• Before the game, officials found that all of the Patriots' footballs were at about 12.5 psi, and that all of the Colts' footballs were at about 13.0 psi.
• This pre-game data was not written down.
• In the second quarter, the Colts intercepted a Patriots ball and notified officials that it felt under-inflated.
• At halftime, two officials (Clete Blakeman and Dyrol Prioleau) independently measured the pressures of as many of the 24 footballs as they could.
• They ran out of time before they could finish.
• Note that the relevant quantity is the change in pressure from the start of the game to the halftime.
• The Patriots' balls started at a lower psi (which is not an issue on its own).
• The allegations were that the Patriots deflated their balls, during the game.

### The measurements¶

• There are only 15 rows (11 for Patriots footballs, 4 for Colts footballs) since the officials weren't able to record the pressures of every ball.
• The 'Pressure' column records the average of the two officials' measurements at halftime.
• The 'PressureDrop' column records the difference between the estimated starting pressure and the average recorded 'Pressure' of each football.

### The question¶

Did the Patriots' footballs drop in pressure more than the Colts'?

• We want to test whether two samples came from the same distribution – this calls for a permutation test.
• Null hypothesis: The drop in pressures for both teams came from the same distribution.
• By chance, the Patriots' footballs deflated more.
• Alternative hypothesis: No, the Patriots' footballs deflated more than one would expect due to random chance alone.

### The test statistic¶

Similar to the baby weights example, our test statistic will be the difference between the teams' average pressure drops. We'll calculate the mean drop for the 'Patriots' minus the mean drop for the 'Colts'.

The average pressure drop for the Patriots was about 0.74 psi more than the Colts.

### Creating random groups and calculating one value of the test statistic¶

We'll run a permutation test to see if 0.74 psi is a significant difference.

• To do this, we'll need to repeatedly shuffle either the 'Team' or the 'PressureDrop' column.
• We'll shuffle the 'PressureDrop' column.
• Tip: It's a good idea to simulate one value of the test statistic before putting everything in a for-loop.

### The simulation¶

• Repeat the process many times by wrapping it inside a for-loop.
• Keep track of the difference in group means in an array, appending each time.
• Optionally, create a function to calculate the difference in group means.

### Conclusion¶

It doesn't look good for the Patriots. What is the p-value?

• Recall, the p-value is the probability, under the null hypothesis, of seeing a result as or more extreme than the observation.
• In this case, that's the probability of the difference in mean pressure drops being greater than or equal to 0.74 psi.

This p-value is low enough to consider this result to be highly statistically significant ($p<0.01$).

### Caution! ⚠️¶

• We reject the null hypothesis, as it is unlikely that the difference in mean pressure drops is due to chance alone.
• But this doesn't establish causation.
• That is, we can't conclude that the Patriots intentionally deflated their footballs.

### Aftermath¶

Quote from an investigative report commissioned by the NFL:

“[T]he average pressure drop of the Patriots game balls exceeded the average pressure drop of the Colts balls by 0.45 to 1.02 psi, depending on various possible assumptions regarding the gauges used, and assuming an initial pressure of 12.5 psi for the Patriots balls and 13.0 for the Colts balls.”

• Many different methods were used to determine whether the drop in pressures were due to chance, including physics.
• We computed an observed difference of 0.74, which is in line with the findings of the report.
• In the end, Tom Brady (quarterback for the Patriots at the time) was suspended 4 games and the team was fined $1 million dollars. • The Deflategate Wikipedia article is extremely thorough; give it a read if you're curious! ### Aside: Establishing causation¶ To actually establish causation, we need the following two statements to be true: 1. The data must come from a randomized controlled trial, to mitigate the effects of confounding factors. 1. A permutation test must show a statistically significant difference in the outcome between the treatment and control group. If both of these conditions are met, then we can conclude that the treatment causes the outcome. ## Bootstrapping 🥾¶ ### City of San Diego employee salary data¶ All City of San Diego employee salary data is public. We are using the latest available data. When you load in a dataset that has so many columns that you can't see them all, it's a good idea to look at the column names. We only need the 'TotalWages' column, so let's get just that column. ### Concept Check ✅ – Answer at cc.dsc10.com¶ Consider the question What is the median salary of all San Diego city employees? What is the right tool to answer this question? • A. Standard hypothesis testing • B. Permutation testing • C. Either of the above • D. None of the above ### The median salary¶ • We can use .median() to find the median salary of all city employees. • This is not a random quantity. ### Let's be realistic...¶ • In practice, it is costly and time-consuming to survey all 12,000+ employees. • More generally, we can't expect to survey all members of the population we care about. • Instead, we gather salaries for a random sample of, say, 500 people. • Hopefully, the median of the sample is close to the median of the population. ### In the language of statistics¶ • The full DataFrame of salaries is the population. • We observe a sample of 500 salaries from the population. • We want to determine the population median (a parameter), but we don't have the whole population, so instead we use the sample median (a statistic) as an estimate. • Hopefully the sample median is close to the population median. ### The sample median¶ Let's survey 500 employees at random. To do so, we can use the .sample method. We won't reassign my_sample at any point in this notebook, so it will always refer to this particular sample. ### How confident are we that this is a good estimate?¶ • Our estimate depended on a random sample. • If our sample was different, our estimate may have been different, too. • How different could our estimate have been? • Our confidence in the estimate depends on the answer to this question. ### The sample median is random¶ • The sample median is a random number. • It comes from some distribution, which we don't know. • How different could our estimate have been, if we drew a different sample? • "Narrow" distribution$\Rightarrow$not too different. • "Wide" distribution$\Rightarrow\$ quite different.
• What is the distribution of the sample median?

### An impractical approach¶

• One idea: repeatedly collect random samples of 500 from the population and compute its median.
• This is what we did in Lecture 14 to compute an empirical distribution of the sample mean of flight delays.
• The animation below visualizes the process of repeatedly collecting a sample and computing its median.