In [1]:
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')

# Imports for animation.
from lec13 import sampling_animation
from IPython.display import display, IFrame, HTML, YouTubeVideo

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

Lecture 13 – Distributions and Sampling¶

DSC 10, Winter 2023¶

Announcements¶

  • The Midterm Exam is this Friday during lecture. See this post for lots of details, including what is covered, what to bring, and how to study.
    • Seating assignments coming soon; check your email!
  • Discussion is today. Dasha is sick, so all students in her section should join Dylan's discussion in Center 212 at 1pm or 2pm instead.
  • The Midterm Project is due Tuesday 2/14 at 11:59PM. Only one partner needs to submit.

Agenda¶

  • Probability distributions vs. empirical distributions.
  • Populations and samples.
  • Parameters and statistics.

⚠️ The second half of the course is more conceptual than the first. Reading the textbook will become more critical.

Probability distributions vs. empirical distributions¶

Probability distributions¶

  • Consider a random quantity with various possible values, each of which has some associated probability.
  • A probability distribution is a description of:
    • All possible values of the quantity.
    • The theoretical probability of each value.

Example: Probability distribution of a die roll 🎲¶

The distribution is uniform, meaning that each outcome has the same probability of occurring.

In [2]:
die_faces = np.arange(1, 7, 1)
die = bpd.DataFrame().assign(face=die_faces)
die
Out[2]:
face
0 1
1 2
2 3
3 4
4 5
5 6
In [3]:
bins = np.arange(0.5, 6.6, 1)

# Note that you can add titles to your visualizations, like this!
die.plot(kind='hist', y='face', bins=bins, density=True, ec='w', 
         title='Probability Distribution of a Die Roll',
         figsize=(5, 3))

# You can also set the y-axis label with plt.ylabel
plt.ylabel('Probability');
2023-02-08T01:17:21.832646 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Empirical distributions¶

  • Unlike probability distributions, which are theoretical, empirical distributions are based on observations.
  • Commonly, these observations are of repetitions of an experiment.
  • An empirical distribution describes:
    • All observed values.
    • The proportion of observations in which each value occurred.
  • Unlike probability distributions, empirical distributions represent what actually happened in practice.

Example: Empirical distribution of a die roll 🎲¶

  • Let's simulate a roll by using np.random.choice.
  • Rolling a die = sampling with replacement.
    • If you roll a 4, you can roll a 4 again.
In [4]:
num_rolls = 25
many_rolls = np.random.choice(die_faces, num_rolls)
many_rolls
Out[4]:
array([5, 5, 4, ..., 3, 5, 4])
In [5]:
(bpd.DataFrame()
 .assign(face=many_rolls) 
 .plot(kind='hist', y='face', bins=bins, density=True, ec='w',
       title=f'Empirical Distribution of {num_rolls} Dice Rolls',
       figsize=(5, 3))
)
plt.ylabel('Probability');
2023-02-08T01:17:21.992973 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Many die rolls 🎲¶

In [6]:
for num_rolls in [10, 50, 100, 500, 1000, 5000, 10000]:
    # Don't worry about how .sample works just yet – we'll cover it shortly
    (die.sample(n=num_rolls, replace=True)
     .plot(kind='hist', y='face', bins=bins, density=True, ec='w', 
           title=f'Distribution of {num_rolls} Die Rolls',
           figsize=(8, 3))
    )
2023-02-08T01:17:22.312885 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
2023-02-08T01:17:22.410448 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
2023-02-08T01:17:22.512328 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
2023-02-08T01:17:22.627101 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
2023-02-08T01:17:22.754195 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
2023-02-08T01:17:22.877241 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
2023-02-08T01:17:22.994463 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Why does this happen? ⚖️¶

The law of large numbers states that if a chance experiment is repeated

  • many times,
  • independently, and
  • under the same conditions,

then the proportion of times that an event occurs gets closer and closer to the theoretical probability of that event.

For example: As you roll a die repeatedly, the proportion of times you roll a 5 gets closer to $\frac{1}{6}$.

Sampling¶

Populations and samples¶

  • A population is the complete group of people, objects, or events that we want to learn something about.
  • It's often infeasible to collect information about every member of a population.
  • Instead, we can collect a sample, which is a subset of the population.
  • Goal: estimate the distribution of some numerical variable in the population, using only a sample.
    • For example, say we want to know the number of credits each UCSD student is taking this quarter.
    • It's too hard to get this information for every UCSD student – we can't find the population distribution.
    • Instead, we can collect data from a subset of UCSD students, to compute a sample distribution.

Question: How do we collect a good sample, so that the sample distribution closely approximates the population distribution?

Bad idea ❌: Survey whoever you can get ahold of (e.g. internet survey, people in line at Panda Express at PC).

  • Such a sample is known as a convenience sample.
  • Convenience samples often contain hidden sources of bias.

Good idea ✔️: Select individuals at random.

Simple random sample¶

A simple random sample (SRS) is a sample drawn uniformly at random without replacement.

  • "Uniformly" means every individual has the same chance of being selected.
  • "Without replacment" means we won't pick the same individual more than once.

Sampling from a list or array¶

To perform an SRS from a list or array options, we use np.random.choice(options, n, replace=False).

In [7]:
tutors = ['Gabriel Cha', 'Eric Chen', 'Charlie Gillet', 'Vanessa Hu', 'Dylan Lee', 'Anthony Li', 
          'Jasmine Lo', 'Linda Long', 'Aishani Mohapatra', 'Harshi Saha', 'Abel Seyoum', 
          'Selim Shaalan', 'Yutian Shi', 'Tony Ta', 'Zairan Xiang', 'Diego Zavalza', 'Luran Zhang']

# Simple random sample of tutors
np.random.choice(tutors, 4, replace=False)
Out[7]:
array(['Harshi Saha', 'Anthony Li', 'Diego Zavalza', 'Aishani Mohapatra'],
      dtype='<U17')

If we use replace=True, then we're sampling uniformly at random with replacement – there's no simpler term for this.

Example: Distribution of flight delays ✈️¶

united_full contains information about all United flights leaving SFO between 6/1/15 and 8/31/15.

In [8]:
united_full = bpd.read_csv('data/united_summer2015.csv')
united_full
Out[8]:
Date Flight Number Destination Delay
0 6/1/15 73 HNL 257
1 6/1/15 217 EWR 28
2 6/1/15 237 STL -3
... ... ... ... ...
13822 8/31/15 1994 ORD 3
13823 8/31/15 2000 PHX -1
13824 8/31/15 2013 EWR -2

13825 rows × 4 columns

Sampling rows from a DataFrame¶

If we want to sample rows from a DataFrame, we can use the .sample method on a DataFrame. That is,

df.sample(n)

returns a random subset of n rows of df, drawn without replacement (i.e. the default is replace=False, unlike np.random.choice).

In [9]:
# 5 flights, chosen randomly without replacement
united_full.sample(5)
Out[9]:
Date Flight Number Destination Delay
9564 8/3/15 1483 IAD -1
11739 8/17/15 1124 SEA -5
796 6/6/15 637 JFK 33
3887 6/26/15 1662 BOS 30
6859 7/16/15 1748 AUS 28
In [10]:
# 5 flights, chosen randomly with replacement
united_full.sample(5, replace=True)
Out[10]:
Date Flight Number Destination Delay
13218 8/27/15 1655 DEN 1
6707 7/15/15 1916 DEN 8
4749 7/2/15 1453 SEA 18
124 6/1/15 1645 IAD 7
9077 7/31/15 693 IAH -7

Observe: The probability of a repetition in our sample is quite low, since our sample is small relative to the number of rows in the DataFrame.

The effect of sample size¶

  • The law of large numbers states that when we repeat a chance experiment more and more times, the empirical distribution will look more and more like the true probability distribution.
  • Similarly, if we take a large simple random sample, then the sample distribution is likely to be a good approximation of the true population distribution.

Population distribution of flight delays ✈️¶

We only need the 'Delay's, so let's select just that column.

In [11]:
united = united_full.get(['Delay'])
united
Out[11]:
Delay
0 257
1 28
2 -3
... ...
13822 3
13823 -1
13824 -2

13825 rows × 1 columns

In [12]:
bins = np.arange(-20, 300, 10)
united.plot(kind='hist', y='Delay', bins=bins, density=True, ec='w', 
            title='Population Distribution of Flight Delays', figsize=(8, 3))
plt.ylabel('Proportion per minute');
2023-02-08T01:17:23.339102 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Note that this distribution is fixed – nothing about it is random.

Sample distribution of flight delays ✈️¶

  • The 13825 flight delays in united constitute our population.
  • Normally, we won't have access to the entire population.
  • To replicate a real-world scenario, we will sample from united without replacement.
In [13]:
# Sample distribution
sample_size = 100
(united
 .sample(sample_size)
 .plot(kind='hist', y='Delay', bins=bins, density=True, ec='w',
       title='Sample Distribution of Flight Delays',
       figsize=(8, 3))
);
2023-02-08T01:17:23.514755 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Note that as we increase sample_size, the sample distribution of delays looks more and more like the true population distribution of delays.

Parameters and statistics¶

Terminology¶

  • Statistical inference is the practice of making conclusions about a population, using data from a random sample.
  • Parameter: A number associated with the population.
    • Example: The population mean.
  • Statistic: A number calculated from the sample.
    • Example: The sample mean.
  • A statistic can be used as an estimate for a parameter.

To remember: parameter and population both start with p, statistic and sample both start with s.

Mean flight delay ✈️¶

Question: What is the average delay of United flights out of SFO? 🤔

  • We'd love to know the mean delay in the population (parameter), but in practice we'll only have a sample.
  • How does the mean delay in the sample (statistic) compare to the mean delay in the population (parameter)?

Population mean¶

The population mean is a parameter.

In [14]:
# Calculate the mean of the population
united_mean = united.get('Delay').mean()
united_mean
Out[14]:
16.658155515370705

This number (like the population distribution) is fixed, and is not random. In reality, we would not be able to see this number – we can only see it right now because this is a pedagogical demonstration!

Sample mean¶

The sample mean is a statistic. Since it depends on our sample, which was drawn at random, the sample mean is also random.

In [15]:
# Size 100
united.sample(100).get('Delay').mean()
Out[15]:
13.04
  • Each time we run the cell above, we are:
    • Collecting a new sample of size 100 from the population, and
    • Computing the sample mean.
  • We see a slightly different value on each run of the cell.
    • Sometimes, the sample mean is close to the population mean.
    • Sometimes, it's far away from the population mean.

The effect of sample size¶

What if we choose a larger sample size?

In [16]:
# Size 1000
united.sample(1000).get('Delay').mean()
Out[16]:
17.833
  • Each time we run this cell, the result is still slightly different.
  • However, the results seem to be much closer together – and much closer to the true population mean – than when we used a sample size of 100.
  • In general, statistics computed on larger samples tend to be more accurate than statistics computed on smaller samples.

Smaller samples:

Larger samples:

Probability distribution of a statistic¶

  • The value of a statistic, e.g. the sample mean, is random, because it depends on a random sample.
  • Like other random quantities, we can study the "probability distribution" of the statistic (also known as its "sampling distribution").
    • This describes all possible values of the statistic and all the corresponding probabilities.
    • Why? We want to know how different our statistic could have been, had we collected a different sample.
  • Unfortunately, this can be hard to calculate exactly.
    • Option 1: Do the math by hand.
    • Option 2: Generate all possible samples and calculate the statistic on each sample.
  • So we'll use simulation again to approximate:
    • Generate a lot of possible samples and calculate the statistic on each sample.

Empirical distribution of a statistic¶

  • The empirical distribution of a statistic is based on simulated values of the statistic. It describes
    • all the observed values of the statistic, and
    • the proportion of times each value appeared.
  • The empirical distribution of a statistic can be a good approximation to the probability distribution of the statistic, if the number of repetitions in the simulation is large.

Distribution of sample means¶

  • To understand how the sample mean can come out, let's...
    • Repeatedly draw a bunch of samples.
    • Record the mean of each.
    • Draw a histogram of these values.
  • The animation below visualizes the process of repeatedly sampling 1000 flights and computing the mean flight delay.
In [17]:
%%capture
anim, anim_means = sampling_animation(united, 1000);
In [18]:
HTML(anim.to_jshtml())
Out[18]: