In [1]:

```
# Set up packages for lecture. Don't worry about understanding this code,
# but make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
from scipy import stats
set_matplotlib_formats("svg")
plt.style.use('ggplot')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
# Animations
from IPython.display import display, IFrame, HTML
import ipywidgets as widgets
import warnings
warnings.filterwarnings('ignore')
def normal_curve(x, mu=0, sigma=1):
return 1 / np.sqrt(2*np.pi) * np.exp(-(x - mu)**2/(2 * sigma**2))
def normal_area(a, b, bars=False, title=None):
x = np.linspace(-4, 4)
y = normal_curve(x)
ix = (x >= a) & (x <= b)
plt.plot(x, y, color='black')
plt.fill_between(x[ix], y[ix], color='gold')
if bars:
plt.axvline(a, color='red')
plt.axvline(b, color='red')
if title:
plt.title(title)
else:
plt.title(f'Area between {np.round(a, 2)} and {np.round(b, 2)}')
plt.show()
def area_within(z):
title = f'Proportion of values within {z} SDs of the mean: {np.round(stats.norm.cdf(z) - stats.norm.cdf(-z), 4)}'
normal_area(-z, z, title=title)
def show_clt_slides():
src = "https://docs.google.com/presentation/d/e/2PACX-1vTcJd3U1H1KoXqBFcWGKFUPjZbeW4oiNZZLCFY8jqvSDsl4L1rRTg7980nPs1TGCAecYKUZxH5MZIBh/embed?start=false&loop=false&delayms=3000&rm=minimal"
width = 960
height = 509
display(IFrame(src, width, height))
```

- Quiz 3 is
**today in discussion section**.- It covers Lectures 14 through 17.
- Prepare by solving relevant problems on practice.dsc10.com.

- Lab 5 is due on
**Saturday 11/18 at 11:59PM**. - Homework 5 is due on
**Tuesday 11/21 at 11:59PM**.

- Recap: The Central Limit Theorem (CLT).
- Choosing sample sizes.
- Models.

In [2]:

```
show_clt_slides()
```

- The Central Limit Theorem (CLT) says that the probability distribution of the
**sum or mean**of a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.

- The distribution of the sample mean is centered at the population mean, and its standard deviation is

A 95% confidence interval for the population mean is given by

$$ \left[\text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] $$This CI doesn't require bootstrapping, and it only requires three numbers – the sample mean, the sample SD, and the sample size!

Bootstrapping still has its uses!

Bootstrapping | CLT | |
---|---|---|

Pro |
Works for many sample statistics (mean, median, standard deviation). |
Only requires 3 numbers – the sample mean, sample SD, and sample size. |

Con |
Very computationally expensive (requires drawing many, many samples from the original sample). |
Only works for the sample mean (and sum). |

We just saw that when $z = 2$, the following is a 95% confidence interval for the population mean.

$$ \left[\text{sample mean} - z\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + z\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] $$**Question**: What value of $z$ should we use to create an 80% confidence interval? 90%?

In [3]:

```
z = widgets.FloatSlider(value=2, min=0,max=4,step=0.05, description='z')
ui = widgets.HBox([z])
out = widgets.interactive_output(area_within, {'z': z})
display(ui, out)
```

HBox(children=(FloatSlider(value=2.0, description='z', max=4.0, step=0.05),))

Output()

Which one of these histograms corresponds to the distribution of the sample mean for samples of size 100 drawn from a population with mean 50 and SD 20?

- You want to estimate the proportion of UCSD students that use BeReal.

- To do so, you will ask a random sample of UCSD students whether or not they use BeReal.

- You want to create a confidence interval that has:
- A 95% confidence level.
- A width of at most 0.06.
- The interval (0.21, 0.25) would be fine, but the interval (0.21, 0.28) would not.

**Question**: How big of a sample do you need? 🤔

- The sample we collect will consist of only two unique values:
- 1, if the student uses BeReal.
- 0, if they don't.

- We're interested in the proportion of values in our sample that are 1.

- This proportion is the same as the mean of our sample!

- For instance, suppose our sample is $0, 1, 1, 0, 1$. Then $\frac{3}{5}$ of the values are $1$. The sample mean is

**Key takeaway**: The CLT applies in this case as well! The distribution of the proportion of 1s in our sample is roughly normal.

We will:

- Collect a random sample.
- Compute the sample mean (i.e., the proportion of people who say "yes").
- Compute the sample standard deviation.
- Construct a 95% confidence interval for the population mean:

Note that the width of our CI is the right endpoint minus the left endpoint:

$$ \text{width} = 4 \cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} $$- We want a CI whose width is at most 0.06.

- The width of our confidence interval depends on two things: the sample SD and the sample size.

- If we know the sample SD, we can find the appropriate sample size by re-arranging the following inequality:

$$\sqrt{\text{sample size}} \geq 4 \cdot \frac{\text{sample SD}}{0.06} \\
\implies \boxed{\text{sample size} \geq \left( 4 \cdot \frac{\text{sample SD}}{0.06} \right)^2}$$

**Problem**: Before polling, we don't know the sample SD, because we don't have a sample! We don't know the population SD either.

**Solution**: Find an upper bound – i.e. the largest possible value – for the sample SD and use that.

- However, in this case, we know that the only values in our sample will be 0 ("no") and 1 ("yes").

- In Homework 5, we introduce a formula for the standard deviation of a collection of 0s and 1s:

In [4]:

```
# Plot the SD of a collection of 0s and 1s with p proportion of Os.
p = np.arange(0, 1.01, 0.01)
sd = np.sqrt(p * (1 - p))
plt.plot(p, sd)
plt.xlabel('p')
plt.ylabel(r'$\sqrt{p(1-p)}$');
```

**Fact**: The largest possible value of the SD of a collection of 0s and 1s is 0.5.- This happens when half the values are 0 and half are 1.

- In the sample we will collect, the maximum possible SD is 0.5.

- Earlier, we saw that to construct a confidence interval with the desired confidence level and width, our sample size needs to satisfy:

- Notice that as the sample SD increases, the required sample size increases.

- By using the maximum possible SD above,
**we ensure that we collect a large enough sample, no matter what the population and sample look like**.

By substituting 0.5 for the sample size, we get

$$\text{sample size} \geq \left( 4 \cdot \frac{\text{0.5}}{0.06} \right)^2$$

**smallest sample size that satisfies the above inequality**.

In [5]:

```
(4 * 0.5 / 0.06) ** 2
```

Out[5]:

1111.1111111111113

**Conclusion**: We must sample **1112 people** to construct a 95% CI for the population mean that is at most 0.06 wide.

Suppose we instead want an a 95% CI for the population mean that is at most 0.03 wide. What is the smallest sample size we could collect?

- So far in the second half of this class, we've focused on the problem of parameter estimation.
- Given a single sample, we can construct a confidence interval using bootstrapping (for most statistics) or the CLT (for the sample mean).
- This confidence interval gives us a range of estimates for a parameter.

- Next, we'll turn our attention to answering yes-no questions about the relationships between samples and populations.
- Example: Does it look like this jury panel was drawn randomly from this population of eligible jurors?
- Example: Does it look like this sequence of coin tosses was generated by a fair coin?

- Both of these problems fall under the umbrella of
**statistical inference**– using a sample to draw conclusions about the population.

A model is a set of assumptions about how data was generated.

- Our goal is to
**assess the quality of a model**.

- Suppose we have access to a dataset. What we'll try to do is
**determine whether a model "explains" the patterns in the dataset**.

- Robert Swain was a Black man convicted of crime in Talladega County, Alabama.

- At the time, only men 21 years or older were allowed to serve on juries.
**26%**of this eligible population was Black.

- But of the 100 men on Robert Swain's jury panel, only
**8**were Black.

- About disparities between the percentages in the eligible population and the jury panel, the Supreme Court wrote:

"... the overall percentage disparity has been small...”

- The Supreme Court denied Robert Swain’s appeal and he was sentenced to life in prison.

- We now have the tools to show
**quantitatively**that the Supreme Court's claim was misguided.

- This "overall percentage disparity" turns out to be not so small, and is an example of racial bias.
- Jury panels were often made up of people in the jury commissioner's professional and social circles.
- Of the 8 Black men on the jury panel,
**none**were selected to be part of the actual jury.

**Model**: Jury panels consist of 100 men,**randomly**chosen from a population that is 26% Black.

**Observation**: On the actual jury panel, only 8 out of 100 men were Black.

**Question**: Does the model explain the observation?

- We'll start by assuming that the model is true.

- We'll generate many jury panels using this assumption.

- We'll count the number of Black men in each simulated jury panel to see how likely it is for a random panel to contain 8 or fewer Black men.
- If we see 8 or fewer Black men often, then the model seems reasonable.
- If we rarely see 8 or fewer Black men, then the model may not be reasonable.

- Run an experiment once to generate one value of our chosen statistic.
- In this case, sample 100 people randomly from a population that is 26% Black, and count
**the number of Black men (statistic)**.

- In this case, sample 100 people randomly from a population that is 26% Black, and count

- Visualize the resulting
**empirical distribution of the statistic**.

- How do we randomly sample a jury panel?
`np.random.choice`

won't help us, because we don't know how large the eligible population is.

- The function
`np.random.multinomial`

helps us sample at random from a**categorical distribution**.

```
np.random.multinomial(sample_size, pop_distribution)
```

`np.random.multinomial`

samples at random from the population,**with replacement**, and returns a random array containing counts in each category.`pop_distribution`

needs to be an array containing the probabilities of each category.

**Aside: Example usage of np.random.multinomial**

On Halloween 👻, you trick-or-treated at 35 houses, each of which had an identical candy box, containing:

- 30% Starbursts.
- 30% Sour Patch Kids.
- 40% Twix.

At each house, you selected one candy blindly from the candy box.

To simulate the act of going to 35 houses, we can use `np.random.multinomial`

:

In [6]:

```
np.random.multinomial(35, [0.3, 0.3, 0.4])
```

Out[6]:

array([10, 11, 14])

In [7]:

```
demographics = [0.26, 0.74]
```

Each time we run the following cell, we'll get a new random sample of 100 people from this population.

- The first element of the resulting array is the number of Black men in the sample.
- The second element is the number of non-Black men in the sample.

In [8]:

```
np.random.multinomial(100, demographics)
```

Out[8]:

array([22, 78])

We also need to calculate the statistic, which in this case is the number of Black men in the random sample of 100.

In [9]:

```
np.random.multinomial(100, demographics)[0]
```

Out[9]:

21

- Let's run 10,000 simulations.
- We'll keep track of the number of Black men in each simulated jury panel in the array
`counts`

.

In [10]:

```
counts = np.array([])
for i in np.arange(10000):
new_count = np.random.multinomial(100, demographics)[0]
counts = np.append(counts, new_count)
counts
```

Out[10]:

array([28., 27., 24., ..., 28., 22., 28.])

Was a jury panel with 8 Black men suspiciously unusual?

In [11]:

```
(bpd.DataFrame().assign(count_black_men=counts)
.plot(kind='hist', bins = np.arange(9.5, 45, 1),
density=True, ec='w', figsize=(10, 5),
title='Empiricial Distribution of the Number of Black Men in Simulated Jury Panels of Size 100'));
observed_count = 8
plt.axvline(observed_count, color='black', linewidth=4, label='Observed Number of Black Men in Actual Jury Panel')
plt.legend();
```

In [12]:

```
# In 10,000 random experiments, the panel with the fewest Black men had how many?
counts.min()
```

Out[12]:

11.0

- Our simulation shows that there's essentially no chance that a random sample of 100 men drawn from a population in which 26% of men are Black will contain 8 or fewer Black men.
- As a result, it seems that the model we proposed – that the jury panel was drawn at random from the eligible population – is flawed.
- There were likely factors
**other than chance**that explain why there were only 8 Black men on the jury panel.

- A 95% confidence interval for the population mean is given by $$ \left[\text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] $$
- If we want to construct a confidence interval of a particular width and confidence level for a population
*proportion*:- Choose a confidence level (e.g. 95%) and maximum width (e.g. 0.06).
- Solve for the minimum sample size that satisfies both conditions.
- Collect a sample of that size.
- Use the formula above to construct an interval.

- A
**model**is an assumption about how data was generated. We're interested in determining the validity a model, given some data we've collected. - When assessing a model, we consider two viewpoints of the world: one where the model is true, and another where the model is false for some reason.

- Next time, we'll see more examples of testing models and deciding between two viewpoints.
- We'll formalize this notion, which is called hypothesis testing.