In [1]:
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

Lecture 14 – Models and Viewpoints¶

DSC 10, Winter 2023¶

Announcements¶

  • Midterm Exam scores are available. See this post for details.
    • Only worth 10%. Take it as a learning experience!
  • The Midterm Project is due tomorrow at 11:59PM.
    • Slip days can be used if needed. Will detract from both partner's allocation.
    • Only one partner should submit and "Add Group Member" on Gradescope.
  • Lab 5 is due Saturday 2/18 at 11:59PM.

Agenda¶

  • Statistical models.
  • Example: Jury selection.
  • Example: Genetics of peas. 🟢
  • Viewpoints and test statistics.
  • Example: Is our coin fair?

Statistical models¶

Models¶

  • A model is a set of assumptions about how data was generated.
  • We want a way to assess the quality of a given model.

Example¶

Galileo's Leaning Tower of Pisa Experiment

Example: Jury selection¶

Swain vs. Alabama, 1965¶

  • Robert Swain was a Black man convicted of crime in Talladega County, Alabama.
  • He appealed the jury's decision all the way to the Supreme Court, on the grounds that Talladega County systematically excluded Black people from juries.
  • At the time, only men 21 years or older were allowed to serve on juries. 26% of this eligible population was Black.
  • But of the 100 men on Robert Swain's jury panel, only 8 were Black.
$\substack{\text{eligible} \\ \text{population}} \xrightarrow{\substack{\text{representative} \\ \text{sample}}} \substack{\text{jury} \\ \text{panel}} \xrightarrow{\substack{\text{selection by} \\ \text{judge/attorneys}}} \substack{\text{actual} \\ \text{jury}}$

Supreme Court ruling¶

  • About disparities between the percentages in the eligible population and the jury panel, the Supreme Court wrote:

"... the overall percentage disparity has been small...”

  • The Supreme Court denied Robert Swain’s appeal and he was sentenced to life in prison.
  • We now have the tools to show quantitatively that the Supreme Court's claim was misguided.
  • This "overall percentage disparity" turns out to be not so small, and is an example of racial bias.
    • Jury panels were often made up of people in the jury commissioner's professional and social circles.
    • Of the 8 Black men on the jury panel, none were selected to be part of the actual jury.

Our model for simulating Swain's jury panel¶

  • We will assume the jury panel consists of 100 men, randomly chosen from a population that is 26% Black.
  • Our question: is this model (i.e. assumption) right or wrong?

Our approach: simulation¶

  • We'll start by assuming that this model is true.
  • We'll generate many jury panels using this assumption.
  • We'll count the number of Black men in each simulated jury panel to see how likely it is for a random panel to contain 8 or fewer Black men.

Simulating statistics¶

Recall, a statistic is a number calculated from a sample.

  1. Run an experiment once to generate one value of a statistic.
    • In this case, sample 100 people randomly from a population that is 26% Black, and count the number of Black men (statistic).
  1. Run the experiment many times, generating many values of the statistic, and store these statistics in an array.
  1. Visualize the resulting empirical distribution of the statistic.

Step 1 – Running the experiment once¶

  • How do we randomly sample a jury panel?
    • np.random.choice won't help us, because we don't know how large the eligible population is.
  • The function np.random.multinomial helps us sample at random from a categorical distribution.
np.random.multinomial(sample_size, pop_distribution)
  • np.random.multinomial samples at random from the population, with replacement, and returns a random array containing counts in each category.
    • pop_distribution needs to be an array containing the probabilities of each category.

Aside: Example usage of np.random.multinomial

On Halloween 👻 you'll trick-or-treat at 35 houses, each of which has an identical candy box, containing:

  • 30% Starbursts.
  • 30% Sour Patch Kids.
  • 40% Twix.

At each house, you'll select one candy blindly from the candy box.

To simulate the act of going to 35 houses, we can use np.random.multinomial:

In [2]:
np.random.multinomial(35, [0.3, 0.3, 0.4])
Out[2]:
array([10, 11, 14])

Step 1 – Running the experiment once¶

In our case, a randomly selected member of our population is Black with probability 0.26 and not Black with probability 1 - 0.26 = 0.74.

In [3]:
demographics = [0.26, 0.74]

Each time we run the following cell, we'll get a new random sample of 100 people from this population.

  • The first element of the resulting array is the number of Black men in the sample.
  • The second element is the number of non-Black men in the sample.
In [4]:
np.random.multinomial(100, demographics)
Out[4]:
array([26, 74])

Step 1 – Running the experiment once¶

We also need to calculate the statistic, which in this case is the number of Black men in the random sample of 100.

In [5]:
np.random.multinomial(100, demographics)[0]
Out[5]:
22

Step 2 – Repeat the experiment many times¶

  • Let's run 10,000 simulations.
  • We'll keep track of the number of Black men in each simulated jury panel in the array counts.
In [6]:
counts = np.array([])

for i in np.arange(10000):
    new_count = np.random.multinomial(100, demographics)[0]
    counts = np.append(counts, new_count)
In [7]:
counts
Out[7]:
array([27., 28., 25., ..., 27., 20., 22.])

Step 3 – Visualize the resulting distribution¶

Was a jury panel with 8 Black men suspiciously unusual?

In [8]:
(bpd.DataFrame().assign(count_black_men=counts)
                .plot(kind='hist', bins = np.arange(9.5, 45, 1), 
                      density=True, ec='w', figsize=(10, 5),
                      title='Empiricial Distribution of the Number of Black Men in Simulated Jury Panels of Size 100'));
observed_count = 8
plt.axvline(observed_count, color='black', linewidth=4, label='Observed Number of Black Men in Actual Jury Panel')
plt.legend();
2023-02-12T20:15:09.118601 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
In [9]:
# In 10,000 random experiments, the panel with the fewest Black men had how many?
counts.min()
Out[9]:
11.0

Conclusion¶

  • Our simulation shows that there's essentially no chance that a random sample of 100 men drawn from a population in which 26% of men are Black will contain 8 or fewer Black men.
  • As a result, it seems that the model we proposed – that the jury panel was drawn at random from the eligible population – is flawed.
  • There were likely factors other than chance that explain why there were only 8 Black men on the jury panel.

Example: Genetics of peas 🟢¶

Gregor Mendel, 1822-1884¶