# Lecture 14 – Distributions and Sampling¶

## DSC 10, Fall 2022¶

### Announcements¶

• Homework 4 is due tomorrow at 11:59PM.
• The Midterm Project is due Tuesday 11/1 at 11:59PM. Use pair programming 👯. See this post for clarifications.
• The Midterm Exam is on Friday 10/28 during lecture. See this post for lots of details, including how to find your assigned seat, what to bring, and how to study.
• 10+ more weekly office hours are new this week!

### Agenda¶

• Probability distributions vs. empirical distributions.
• Populations and samples.
• Parameters and statistics.

⚠️ The second half of the course is more conceptual than the first. Reading the textbook will become more critical.

## Probability distributions vs. empirical distributions¶

### Probability distributions¶

• Consider a random quantity with various possible values, each of which has some associated probability.
• A probability distribution is a description of:
• All possible values of the quantity.
• The theoretical probability of each value.

### Example: Probability distribution of a die roll 🎲¶

The distribution is uniform, meaning that each outcome has the same probability of occurring.

### Empirical distributions¶

• Unlike probability distributions, which are theoretical, empirical distributions are based on observations.
• Commonly, these observations are of repetitions of an experiment.
• An empirical distribution describes:
• All observed values.
• The proportion of observations in which each value occurred.
• Unlike probability distributions, empirical distributions represent what actually happened in practice.

### Example: Empirical distribution of a die roll 🎲¶

• Let's simulate a roll by using np.random.choice.
• Rolling a die = sampling with replacement.
• If you roll a 4, you can roll a 4 again.

### Why does this happen? ⚖️¶

The law of large numbers states that if a chance experiment is repeated

• many times,
• independently, and
• under the same conditions,

then the proportion of times that an event occurs gets closer and closer to the theoretical probability of that event.

For example: As you roll a die repeatedly, the proportion of times you roll a 5 gets closer to $\frac{1}{6}$.

## Sampling¶

### Populations and samples¶

• A population is the complete group of people, objects, or events that we want to learn something about.
• It's often infeasible to collect information about every member of a population.
• Instead, we can collect a sample, which is a subset of the population.
• Goal: estimate the distribution of some numerical variable in the population, using only a sample.
• For example, say we want to know the number of credits each UCSD student is taking this quarter.
• It's too hard to get this information for every UCSD student – we can't find the population distribution.
• Instead, we can collect data from a subset of UCSD students, to compute a sample distribution.

Question: How do we collect a good sample, so that the sample distribution closely approximates the population distribution?

Bad idea ❌: Survey whoever you can get ahold of (e.g. internet survey, people in line at Panda Express at PC).

• Such a sample is known as a convenience sample.
• Convenience samples often contain hidden sources of bias.

### Probability sample (aka random sample)¶

• In order for a sample to be a probability sample, you must be able to calculate the probability of selecting any subset of the population.
• Not all individuals need to have an equal chance of being selected.

### A probability sample¶

• Scheme: Start with a random number between 0 and 9 take every tenth row thereafter.
• This is a probability sample!
• Any given row is equally likely to be picked, with probability $\frac{1}{10}$.
• It is not true that every subset of rows has the same probability of being selected.
• There are only 10 possible samples: rows (0, 10, 20, 30, ..., 190), rows (1, 11, 21, ..., 191), and so on.

### Simple random sample¶

• A simple random sample (SRS) is a sample drawn uniformly at random without replacement.
• In an SRS...
• Every individual has the same chance of being selected.
• Every pair has the same chance of being selected.
• Every triplet has the same chance of being selected.
• And so on...
• To perform an SRS from a list or array options, we use np.random.choice(options, replace=False).
• If we use replace=True, then we're sampling uniformly at random with replacement – there's no simpler term for this.

### Sampling rows from a DataFrame¶

If we want to sample rows from a DataFrame, we can use the .sample method on a DataFrame. That is,

df.sample(n)


returns a random subset of n rows of df, drawn without replacement (i.e. the default is replace=False, unlike np.random.choice).

### The effect of sample size¶

• The law of large numbers states that when we repeat a chance experiment more and more times, the empirical distribution will look more and more like the true probability distribution.
• Similarly, if we take a large simple random sample, then the sample distribution is likely to be a good approximation of the true population distribution.

### Example: Distribution of flight delays ✈️¶

united_full contains information about all United flights leaving SFO between 6/1/15 and 8/31/15.

We only need the 'Delay's, so let's select just that column.

### Population distribution of flight delays ✈️¶

Note that this distribution is fixed – nothing about it is random.

### Sample distribution of flight delays ✈️¶

• The 13825 flight delays in united constitute our population.
• To replicate a real-world scenario, we will sample from united without replacement.

Note that as we increase sample_size, the sample distribution of delays looks more and more like the true population distribution of delays.

## Parameters and statistics¶

### Terminology¶

• Statistical inference is the practice of making conclusions about a population, using data from a random sample.
• Parameter: A number associated with the population.
• Example: The population mean.
• Statistic: A number calculated from the sample.
• Example: The sample mean.
• A statistic can be used as an estimate for a parameter.

### Mean flight delay ✈️¶

Question: What is the average delay of United flights out of SFO? 🤔

• We'd love to know the mean delay in the population (parameter), but in practice we'll only have a sample.
• How does the mean delay in the sample (statistic) compare to the mean delay in the population (parameter)?

### Population mean¶

The population mean is a parameter.

This number (like the population distribution) is fixed, and is not random. In reality, we would not be able to see this number – we can only see it right now because this is a pedagogical demonstration!

### Sample mean¶

The sample mean is a statistic. Since it depends on our sample, which was drawn at random, the sample mean is also random.

• Each time we run the cell above, we are:
• Collecting a new sample of size 100 from the population, and
• Computing the sample mean.
• We see a slightly different value on each run of the cell.
• Sometimes, the sample mean is close to the population mean.
• Sometimes, it's far away from the population mean.

### The effect of sample size¶

What if we choose a larger sample size?

• Each time we run this cell, the result is still slightly different.
• However, the results seem to be much closer together – and much closer to the true population mean – than when we used a sample size of 100.
• In general, statistics computed on larger samples tend to be more accurate than statistics computed on smaller samples.

Smaller samples:

Larger samples:

### Probability distribution of a statistic¶

• The value of a statistic, e.g. the sample mean, is random, because it depends on a random sample.
• Like other random quantities, we can study the "probability distribution" of the statistic (also known as its "sampling distribution").
• This describes all possible values of the statistic and all the corresponding probabilities.
• Why? We want to know how different our statistic could have been, had we collected a different sample.
• Unfortunately, this can be hard to calculate exactly.
• Option 1: Do the math by hand.
• Option 2: Generate all possible samples and calculate the statistic on each sample.
• So we'll use simulation again to approximate:
• Generate a lot of possible samples and calculate the statistic on each sample.

### Empirical distribution of a statistic¶

• The empirical distribution of a statistic is based on simulated values of the statistic. It describes
• all the observed values of the statistic, and
• the proportion of times each value appeared.
• The empirical distribution of a statistic can be a good approximation to the probability distribution of the statistic, if the number of repetitions in the simulation is large.

### Distribution of sample means¶

• Let's...
• Repeatedly draw a bunch of samples.
• Record the mean of each.
• Draw a histogram of the resulting distribution.
• Try different sample sizes and look at the resulting histogram!

### What's the point?¶

• In practice, we will only be able to collect one sample and calculate one statistic.
• Sometimes, that sample will be very representative of the population, and the statistic will be very close to the parameter we are trying to estimate.
• Other times, that sample will not be as representative of the population, and the statistic will not be very close to the parameter we are trying to estimate.
• The empirical distribution of the sample mean helps us answer the question "what would the sample mean have looked like if we drew a different sample?"

### Concept Check ✅ – Answer at cc.dsc10.com¶

We just sampled one thousand flights, two thousand times. If we now sample one hundred flights, two thousand times, how will the histogram change?

• A. narrower
• B. wider
• C. shifted left
• D. shifted right
• E. unchanged

### How we sample matters!¶

• So far, we've taken large simple random samples, without replacement, from the full population.
• If the population is large enough, then it doesn't really matter if we sample with or without replacement.
• The sample mean, for samples like this, is a good approximation of the population mean.
• But this is not always the case if we sample differently.

## Summary, next time¶

### Summary¶

• The probability distribution of a random quantity describes the values it takes on along with the probability of each value occurring.
• An empirical distribution describes the values and frequencies of the results of a random experiment.
• With more trials of an experiment, the empirical distribution gets closer to the probability distribution.
• A population distribution describes the values and frequencies of some characteristic of a population.
• A sample distribution describes the values and frequencies of some characteristic of a sample, which is a subset of a population.
• When we take a simple random sample, as we increase our sample size, the sample distribution gets closer and closer to the population distribution.
• A parameter is a number associated with a population, and a statistic is a number associated with a sample.
• We can use statistics calculated on a random samples to estimate population parameters.
• For example, to estimate the mean of a population, we can calculate the mean of the sample.
• Larger samples tend to lead to better estimates.

### Next time¶

Next, we'll start talking about statistical models, which will lead us towards hypothesis testing.