In [1]:

```
# Set up packages for lecture. Don't worry about understanding this code,
# but make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')
# Imports for animation.
from lec14 import sampling_animation
from IPython.display import display, HTML
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
```

- Midterm Exam scores are available. See this post for details.
- Only worth 10%. Take it as a learning experience. If you do better on the final, we'll replace your score!
- Come to
**discussion today**to see how to do the problems on the exam.

- The Midterm Project is due
**Saturday at 11:59PM**.*If you fail to plan, you plan to fail.*Get started!- Slip days can be used if needed. Will detract from both partner's allocation.
- Only
**one**partner should submit and "Add Group Member" on Gradescope.

- A
**probability distribution**is a description of:- All possible values of the quantity.
- The
**theoretical**probability of each value.

The distribution is **uniform**, meaning that each outcome has the same chance of occurring.

In [2]:

```
die_faces = np.arange(1, 7, 1)
die = bpd.DataFrame().assign(face=die_faces)
die
```

Out[2]:

face | |
---|---|

0 | 1 |

1 | 2 |

2 | 3 |

3 | 4 |

4 | 5 |

5 | 6 |

In [3]:

```
bins = np.arange(0.5, 6.6, 1)
# Note that you can add titles to your visualizations, like this!
die.plot(kind='hist', y='face', bins=bins, density=True, ec='w',
title='Probability Distribution of a Die Roll',
figsize=(5, 3))
# You can also set the y-axis label with plt.ylabel.
plt.ylabel('Probability');
```

- Unlike probability distributions, which are theoretical,
**empirical distributions are based on observations**.

- Commonly, these observations are of repetitions of an experiment.

- An
**empirical distribution**describes:- All observed values.
- The proportion of experiments in which each value occurred.

- Let's simulate a roll by using
`np.random.choice`

. - To simulate the rolling of a die, we must sample
**with**replacement.- If we roll a 4, we can roll a 4 again.

In [4]:

```
num_rolls = 25
many_rolls = np.random.choice(die_faces, num_rolls)
many_rolls
```

Out[4]:

array([3, 5, 2, ..., 6, 5, 2])

In [5]:

```
(bpd.DataFrame()
.assign(face=many_rolls)
.plot(kind='hist', y='face', bins=bins, density=True, ec='w',
title=f'Empirical Distribution of {num_rolls} Dice Rolls',
figsize=(5, 3))
)
plt.ylabel('Probability');
```

What happens as we increase the number of rolls?

In [6]:

```
for num_rolls in [10, 50, 100, 500, 1000, 5000, 10000]:
# Don't worry about how .sample works just yet – we'll cover it shortly.
(die.sample(n=num_rolls, replace=True)
.plot(kind='hist', y='face', bins=bins, density=True, ec='w',
title=f'Distribution of {num_rolls} Die Rolls',
figsize=(8, 3))
)
```

The **law of large numbers** states that if a chance experiment is repeated

- many times,
- independently, and
- under the same conditions,

**proportion** of times that an event occurs gets closer and closer to the **theoretical probability** of that event.

- The law of large numbers is
**why**we can use simulations to approximate probability distributions!

- A
**population**is the complete group of people, objects, or events that we want to learn something about.

- It's often infeasible to collect information about every member of a population.

- Instead, we can collect a
**sample**, which is a subset of the population.

**Goal**: Estimate the distribution of some numerical variable in the population, using only a sample.- For example, suppose we want to know the height of every single UCSD student.
- It's too hard to collect this information for every single UCSD student – we can't find the
**population distribution**. - Instead, we can collect data from a subset of UCSD students, to form a
**sample distribution**.

**Question**: How do we collect a good sample, so that the sample distribution closely resembles the population distribution?

**Bad idea ❌**: Survey whoever you can get ahold of (e.g. internet survey, people in line at Panda Express at PC).- Such a sample is known as a convenience sample.
- Convenience samples often contain hidden sources of
**bias**.

**Good idea ✔️**: Select individuals at random.

A **simple random sample (SRS)** is a sample drawn **uniformly** at random **without replacement**.

- "Uniformly" means every individual has the same chance of being selected.
- "Without replacement" means we won't pick the same individual more than once.

`options`

, we use `np.random.choice(options, n, replace=False)`

.

In [7]:

```
staff = ['Oren Ciolli', 'Nate Del Rosario', 'Jack Determan', 'Sophia Fang', 'Charlie Gillet',
'Ashley Ho', 'Henry Ho', 'Vanessa Hu', 'Leena Kang', 'Norah Kerendian', 'Anthony Li', 'Weiyue Li',
'Jasmine Lo', 'Arjun Malleswaran', 'Mert Ozer', 'Baby Panda', 'Arya Rahnama', 'Aaron Rasin', 'Chandiner Rishi', 'Gina Roberg',
'Harshi Saha', 'Keenan Serrao', 'Abel Seyoum', 'Suhani Sharma', 'Yutian Shi', 'Ester Tsai',
'Bill Wang', 'Ylesia Wu', 'Jason Xu', 'Diego Zavalza', 'Ciro Zhang']
# Simple random sample of 4 course staff members.
np.random.choice(staff, 4, replace=False)
```

Out[7]:

array(['Baby Panda', 'Vanessa Hu', 'Oren Ciolli', 'Sophia Fang'], dtype='<U17')

`replace=True`

, then we're sampling uniformly at random with replacement – there's no simpler term for this.

`united_full`

contains information about all United flights leaving SFO between 6/1/15 and 8/31/15.

For this lecture, treat this dataset as our **population**.

In [8]:

```
united_full = bpd.read_csv('data/united_summer2015.csv')
united_full
```

Out[8]:

Date | Flight Number | Destination | Delay | |
---|---|---|---|---|

0 | 6/1/15 | 73 | HNL | 257 |

1 | 6/1/15 | 217 | EWR | 28 |

2 | 6/1/15 | 237 | STL | -3 |

... | ... | ... | ... | ... |

13822 | 8/31/15 | 1994 | ORD | 3 |

13823 | 8/31/15 | 2000 | PHX | -1 |

13824 | 8/31/15 | 2013 | EWR | -2 |

13825 rows × 4 columns

If we want to sample rows from a DataFrame, we can use the `.sample`

method on a DataFrame. That is,

```
df.sample(n)
```

returns a random subset of `n`

rows of `df`

, drawn **without replacement** (i.e. the default is `replace=False`

, unlike `np.random.choice`

).

In [9]:

```
# 5 flights, chosen randomly without replacement.
united_full.sample(5)
```

Out[9]:

Date | Flight Number | Destination | Delay | |
---|---|---|---|---|

7024 | 7/17/15 | 1774 | IAD | -3 |

11637 | 8/16/15 | 1780 | SEA | 6 |

5631 | 7/8/15 | 1754 | EWR | 66 |

79 | 6/1/15 | 1171 | DEN | 17 |

10683 | 8/10/15 | 1671 | SAN | -2 |

In [10]:

```
# 5 flights, chosen randomly with replacement.
united_full.sample(5, replace=True)
```

Out[10]:

Date | Flight Number | Destination | Delay | |
---|---|---|---|---|

382 | 6/3/15 | 1199 | MCO | 3 |

5637 | 7/8/15 | 1922 | EWR | 19 |

8925 | 7/30/15 | 824 | JFK | 0 |

12188 | 8/20/15 | 1276 | LAX | -5 |

6326 | 7/13/15 | 1182 | LAX | 3 |

**Note**: The probability of seeing the same row multiple times when sampling with replacement is quite low, since our sample size (5) is small relative to the size of the population (13825).

We only need the `'Delay'`

s, so let's select just that column.

In [11]:

```
united = united_full.get(['Delay'])
united
```

Out[11]:

Delay | |
---|---|

0 | 257 |

1 | 28 |

2 | -3 |

... | ... |

13822 | 3 |

13823 | -1 |

13824 | -2 |

13825 rows × 1 columns

In [12]:

```
bins = np.arange(-20, 300, 10)
united.plot(kind='hist', y='Delay', bins=bins, density=True, ec='w',
title='Population Distribution of Flight Delays', figsize=(8, 3))
plt.ylabel('Proportion per minute');
```

Note that this distribution is **fixed** – nothing about it is random.

- The 13825 flight delays in
`united`

constitute our population. - Normally, we won't have access to the entire population.
- To replicate a real-world scenario, we will sample from
`united`

**without replacement**.

In [13]:

```
sample_size = 100 # Change this and see what happens!
(united
.sample(sample_size)
.plot(kind='hist', y='Delay', bins=bins, density=True, ec='w',
title=f'Distribution of Flight Delays in a Sample of Size {sample_size}',
figsize=(8, 3))
);
```

`sample_size`

, the sample distribution of delays looks more and more like the true population distribution of delays.

**Statistical inference**is the practice of making conclusions about a population, using data from a random sample.

**Parameter**: A number associated with the population.- Example: The population mean.

**Statistic**: A number calculated from the sample.- Example: The sample mean.

- A statistic can be used as an
**estimate**for a parameter.

*To remember: parameter and population both start with p, statistic and sample both start with s.*

**Question**: What was the average delay of *all* United flights out of SFO in Summer 2015? 🤔

- We'd love to know the
**mean delay in the population (parameter)**, but in practice we'll only have a**sample**.

- How does the
**mean delay in the sample (statistic)**compare to the**mean delay in the population (parameter)**?

The **population mean** is a **parameter**.

In [14]:

```
# Calculate the mean of the population.
united_mean = united.get('Delay').mean()
united_mean
```

Out[14]:

16.658155515370705

The **sample mean** is a **statistic**. Since it depends on our sample, which was drawn at random, the sample mean is **also random**.

In [15]:

```
# Size 100.
united.sample(100).get('Delay').mean()
```

Out[15]:

12.03

- Each time we run the cell above, we are:
- Collecting a new sample of size 100 from the population, and
- Computing the sample mean.

- We see a slightly different value on each run of the cell.
- Sometimes, the sample mean is close to the population mean.
- Sometimes, it's far away from the population mean.

What if we choose a larger sample size?

In [16]:

```
# Size 1000.
united.sample(1000).get('Delay').mean()
```

Out[16]:

14.975

- Each time we run the above cell, the result is still slightly different.
- However, the results seem to be much closer together – and much closer to the true population mean – than when we used a sample size of 100.
**In general**, statistics computed on**larger**samples tend to be**better**estimates of population parameters than statistics computed on smaller samples.

**Smaller samples**:

**Larger samples**:

- The value of a statistic, e.g. the sample mean, is random, because it depends on a random sample.

- Like other random quantities, we can study the "probability distribution" of the statistic (also known as its "sampling distribution").
- This describes all possible values of the statistic and all the corresponding probabilities.
- Why?
**We want to know how different our statistic***could have*been, had we collected a different sample.

- Unfortunately, this can be hard to calculate exactly.
- Option 1: Do the math by hand.
- Option 2: Generate
**all**possible samples and calculate the statistic on each sample.

- So, we'll instead use a simulation to approximate the distribution of the sample statistic.
- We'll need to generate
**a lot of**possible samples and calculate the statistic on each sample.

- We'll need to generate

- The empirical distribution of a statistic is based on simulated values of the statistic. It describes:
- All observed values of the statistic.
- The proportion of samples in which each value occurred.

- The empirical distribution of a statistic can be a good approximation to the probability distribution of the statistic,
**if the number of repetitions in the simulation is large**.

- To understand how different the sample mean can be in different samples, we'll:
- Repeatedly draw many samples.
- Record the mean of each.
- Draw a histogram of these values.

In [17]:

```
%%capture
anim, anim_means = sampling_animation(united, 1000);
```

In [18]:

```
HTML(anim.to_jshtml())
```

Out[18]: