In [1]:

```
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
from scipy import stats
import otter
set_matplotlib_formats("svg")
plt.style.use('ggplot')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
# Setup to start where we left off last time
keep_cols = ['business_name', 'inspection_date', 'inspection_score', 'risk_category', 'Neighborhoods', 'Zip Codes']
restaurants_full = bpd.read_csv('data/restaurants_full.csv').get(keep_cols)
bakeries = restaurants_full[restaurants_full.get('business_name').str.lower().str.contains('bake')]
bakeries = bakeries[bakeries.get('inspection_score') >= 0] # Keeping only the rows where we know the inspection score
# Animation
from IPython.display import IFrame, display
def show_clt_slides():
src = "https://docs.google.com/presentation/d/e/2PACX-1vTcJd3U1H1KoXqBFcWGKFUPjZbeW4oiNZZLCFY8jqvSDsl4L1rRTg7980nPs1TGCAecYKUZxH5MZIBh/embed?start=false&loop=false&delayms=3000"
width = 960
height = 509
display(IFrame(src, width, height))
```

- The Final Exam is
**tomorrow at 3PM in Galbraith Hall 242**.- See this EdStem announcement for details.
- Assigned seats will be emailed to you by Friday.
- We will check IDs.
- You'll have
**2 hours, 50 minutes**to work on the exam. - No questions during the exam.
- The DSC 10 Reference Sheet will be provided. No calculators or other notes.
- Practice with old exam problems at practice.dsc10.com.

- If at least 80% of the class fills out both CAPEs and the End of Quarter Survey, then the entire class gets 0.5% of extra credit on their overall grade.
- We value your feedback!

- More review.
- Working on personal projects.
- Demo: Gapminder 🌎.
- Some parting thoughts.

Consider this **population** of bakeries in San Francisco.

In [2]:

```
bakeries
```

Out[2]:

business_name | inspection_date | inspection_score | risk_category | Neighborhoods | Zip Codes | |
---|---|---|---|---|---|---|

327 | Le Marais Bakery Castro | 2018-08-06T00:00:00.000 | 90.0 | Moderate Risk | NaN | NaN |

365 | Pho Luen Fat Bakery & Restaurant | 2019-04-08T00:00:00.000 | 76.0 | Low Risk | NaN | NaN |

372 | Brioche Bakery & Cafe | 2019-01-31T00:00:00.000 | 88.0 | Low Risk | NaN | NaN |

... | ... | ... | ... | ... | ... | ... |

53954 | Fancy Wheatfield Bakery | 2019-03-04T00:00:00.000 | 83.0 | Moderate Risk | NaN | NaN |

54102 | New Hollywood Bakery & Restaurant | 2016-08-30T00:00:00.000 | 74.0 | High Risk | NaN | NaN |

54171 | Speciality's Cafe and Bakery | 2019-04-29T00:00:00.000 | 89.0 | Moderate Risk | NaN | NaN |

1216 rows × 6 columns

In [3]:

```
score_bins = np.arange(50, 102, 2)
bakeries.plot(kind='hist', y='inspection_score', density=True, bins=score_bins, ec='w', figsize=(10, 5),
title='Population Distribution');
```

For reference, the mean and standard deviation of the population distribution are calculated below.

In [4]:

```
bakeries.get('inspection_score').describe()
```

Out[4]:

count 1216.00 mean 84.20 std 8.35 ... 50% 86.00 75% 90.00 max 100.00 Name: inspection_score, Length: 8, dtype: float64

In this case we happen to have the inspection scores for all members of the population, but in reality we won't. So let's instead take a random **sample** of 200 bakeries from the population.

In [5]:

```
np.random.seed(23) # Ignore this
sample_of_bakeries = bakeries.sample(200) # SOLUTION
sample_of_bakeries
```

Out[5]:

business_name | inspection_date | inspection_score | risk_category | Neighborhoods | Zip Codes | |
---|---|---|---|---|---|---|

33359 | Universal Bakery Inc. | 2019-01-28T00:00:00.000 | 83.0 | Low Risk | 2.0 | 28859.0 |

19980 | Cherry Blossom Bakery 2 | 2016-06-28T00:00:00.000 | 90.0 | Moderate Risk | NaN | NaN |

29825 | Waterfront Bakery | 2018-06-07T00:00:00.000 | 94.0 | Low Risk | 32.0 | 308.0 |

... | ... | ... | ... | ... | ... | ... |

4835 | Marla Bakery | 2018-09-10T00:00:00.000 | 91.0 | High Risk | NaN | NaN |

26932 | PRINCESS BAKERY | 2016-08-16T00:00:00.000 | 79.0 | Low Risk | 5.0 | 28861.0 |

34201 | Castro Tarts Cafe and Bakery Inc. | 2017-08-23T00:00:00.000 | 82.0 | Low Risk | NaN | NaN |

200 rows × 6 columns

In [6]:

```
sample_of_bakeries.plot(kind='hist', y='inspection_score', density=True, bins=score_bins, ec='w', figsize=(10, 5),
title='Sample Distribution');
```

Note that since we took a large, random sample of the population, we expect that our sample looks similiar to the population and has a similar mean and SD.

In [7]:

```
sample_of_bakeries.get('inspection_score').describe()
```

Out[7]:

count 200.00 mean 84.67 std 8.38 ... 50% 87.00 75% 91.25 max 98.00 Name: inspection_score, Length: 8, dtype: float64

Indeed, the sample mean is quite close to the population mean, and the sample standard deviation is quite close to the population standard deviation.

Let's suppose we want to estimate the population mean (that is, the mean inspection score of all bakeries in SF).

One estimate of the population mean is the mean of our sample.

In [8]:

```
sample_of_bakeries.get('inspection_score').mean()
```

Out[8]:

84.665

However, our sample was random and could have been different, meaning our sample mean could also have been different.

**Question**: What's a reasonable range of possible values for the sample mean? **What is the distribution of the sample mean?**

The Central Limit Theorem (CLT) says that the probability distribution of the

sum or meanof a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.

In [9]:

```
show_clt_slides()
```

To see an empirical distribution of the sample mean, let's take a large number of samples directly from the population and compute the mean of each one.

Remember, in real life we wouldn't be able to do this, since we wouldn't have access to the population.

In [10]:

```
sample_means = np.array([])
# BEGIN SOLUTION
for i in np.arange(5000):
sample_mean = bakeries.sample(200).get('inspection_score').mean()
sample_means = np.append(sample_means, sample_mean)
# END SOLUTION
```

In [11]:

```
sample_means
```

Out[11]:

array([84.34, 85.02, 83.79, ..., 84.64, 84.49, 84.17])

In [12]:

```
bpd.DataFrame().assign(sample_means=sample_means).plot(kind='hist', density=True, ec='w', bins=25, figsize=(10, 5));
```

Unsurprisingly, the distribution of the sample mean is bell-shaped. The CLT told us that!

The CLT also tells us that

$$\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}$$Let's try this out.

In [13]:

```
np.std(bakeries.get('inspection_score')) / np.sqrt(200)
```

Out[13]:

0.5904894545352809

In [14]:

```
np.std(sample_means)
```

Out[14]:

0.5469232018985846

Pretty close! Remember that `sample_means`

is an array of simulated sample means; the more samples we simulate, the closer that `np.std(sample_means)`

will get to the SD described by the CLT.

Note that in practice, we won't have the SD of the population, since we'll usually just have a single sample. In such cases, we can use the SD of the sample as an estimate of the SD of the population:

In [15]:

```
np.std(sample_of_bakeries.get('inspection_score')) / np.sqrt(200)
```

Out[15]:

0.5909855116667413

Using the CLT, we have that the distribution of the sample mean:

- is roughly normal,
- is centered at the population mean (for which the sample mean is an estimate), and
- has a standard deviation of $\frac{\text{Population SD}}{\sqrt{\text{sample size}}}$ (which can be estimated using $\frac{\text{Sample SD}}{\sqrt{\text{sample size}}}$).

Using this information, we can build a confidence interval for where we think the population mean might be. A 95% confidence interval for the population mean is given by

$$ \left[ \text{sample mean} - 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}}, \ \text{sample mean} + 2\cdot \frac{\text{sample SD}}{\sqrt{\text{sample size}}} \right] $$In [16]:

```
sample_mean = sample_of_bakeries.get('inspection_score').mean()
sample_std = np.std(sample_of_bakeries.get('inspection_score'))
```

In [17]:

```
[sample_mean - 2 * sample_std / np.sqrt(200), sample_mean + 2 * sample_std / np.sqrt(200)] # SOLUTION
```

Out[17]:

[83.48302897666652, 85.8469710233335]

Using a single sample of 200 bakeries, how can we estimate the **median** inspection score of all bakeries in San Francisco with an inspection score? What technique should we use?

A. Standard hypothesis testing

B. Permutation testing

C. Bootstrapping

D. The Central Limit Theorem

There is no CLT for sample medians, so instead we'll have to resort to bootstrapping to estimate the distribution of the sample median.

Recall, bootstrapping is the act of **sampling from the original sample, with replacement**. This is also called **resampling**.

In [18]:

```
# The median of our original sample – this is just one number
sample_of_bakeries.get('inspection_score').median() # SOLUTION
```

Out[18]:

87.0

In [19]:

```
# The median of a single bootstrap resample – this is just one number
sample_of_bakeries.sample(200, replace=True).get('inspection_score').median() # SOLUTION
```

Out[19]:

86.0

Let's resample repeatedly.

In [20]:

```
np.random.seed(23) # Ignore this
boot_medians = np.array([])
# BEGIN SOLUTION
for i in np.arange(5000):
boot_median = sample_of_bakeries.sample(200, replace=True).get('inspection_score').median()
boot_medians = np.append(boot_medians, boot_median)
# END SOLUTION
```

In [21]:

```
boot_medians
```

Out[21]:

array([87. , 85. , 86.5, ..., 87.5, 88. , 86. ])

In [22]:

```
bpd.DataFrame().assign(boot_medians=boot_medians).plot(kind='hist', density=True, ec='w', bins=10, figsize=(10, 5));
```

Note that this distribution is not at all normal.

To compute a 95% confidence interval, we take the middle 95% of the bootstrapped medians.

In [23]:

```
# BEGIN SOLUTION
left = np.percentile(boot_medians, 2.5)
right = np.percentile(boot_medians, 97.5)
[left, right]
# END SOLUTION
```

Out[23]:

[85.0, 88.0]

Which of the following interpretations of this confidence interval are valid?

- 95% of SF bakeries have an inspection score between 85 and 88.
- 95% of the resamples have a median inspection score between 85 and 88.
- There is a 95% chance that our sample has a median inspection score between 85 and 88.
- There is a 95% chance that the median inspection score of all SF bakeries is between 85 and 88.
- If we had taken 100 samples from the same population, about 95 of these samples would have a median inspection score between 85 and 88.
- If we had taken 100 samples from the same population, about 95 of the confidence intervals created would contain the median inspection score of all SF bakeries.

You work as a family physician. You collect data and you find that in 6354 patients, 3115 were children and 3239 were adults.

You want to test the following hypotheses:

**Null Hypothesis**: Family physicians see an equal number of children and adults.**Alternative Hypothesis**: Family physicians see more adults than they see children.

Which test statistic(s) could be used for this hypothesis test? Which values of the test statistic point towards the alternative?

A. Proportion of children seen

B. Number of children seen

C. Number of children minus number of adults seen

D. Absolute value of number of children minus number of adults seen

**There may be multiple correct answers; choose one.**

Let's use option B, the number of children seen, as a test statistic. Small values of this statistic favor the alternative hypothesis.

How do we generate a single value of the test statistic?

In [24]:

```
np.random.multinomial(6354, [0.5, 0.5])[0] # SOLUTION
```

Out[24]:

3172

As usual, let's simulate the test statistic many, many times (10,000).

In [25]:

```
test_stats = np.array([])
# BEGIN SOLUTION
for i in np.arange(10000):
stat = np.random.multinomial(6354, [0.5, 0.5])[0]
test_stats = np.append(test_stats, stat)
# END SOLUTION
```

In [26]:

```
test_stats
```

Out[26]:

array([3204., 3213., 3172., ..., 3150., 3198., 3213.])

In [27]:

```
bpd.DataFrame().assign(test_stats=test_stats) \
.plot(kind='hist', density=True, ec='w', figsize=(10, 5), bins=20);
plt.axvline(3115, lw=3, color='black', label='observed statistic')
plt.legend();
```

Recall that you collected data and found that in 6354 patients, 3115 were children and 3239 were adults.

What goes in blank (a)?

```
p_value = np.count_nonzero(test_stats __(a)__ 3115) / 10000
```

A. `>=`

B. `>`

C. `<=`

D. `<`

`<=`

In [28]:

```
# Calculate the p-value
```

What do we do, assuming that we're using a 5% p-value cutoff?

A. Reject the null

B. Fail to reject the null

C. It depends

Note that while we used `np.random.multinomial`

to simulate the test statistic, we could have used `np.random.choice`

, too:

In [29]:

```
choices = np.random.choice(['adult', 'child'], p=[0.5, 0.5], size=6354, replace=True) # SOLUTION
choices
```

Out[29]:

array(['adult', 'adult', 'adult', ..., 'child', 'child', 'adult'], dtype='<U5')

In [30]:

```
np.count_nonzero(choices == 'child') # SOLUTION
```

Out[30]:

3142

Is this an example of bootstrapping?

A. Yes, because we are sampling with replacement.

B. No, this is not bootstrapping.

- You may be interested in working on data science projects of your own.
- In this video, we show you how to make blank notebooks and upload datasets of your own to DataHub.
- Depending on the classes you're in, you may not have access to DataHub. Eventually, you'll want to install Jupyter Notebooks on your computer.
- Anaconda is a great way to do that, as it also installs many commonly used packages.
- You may want to download your work from DataHub so you can refer to it after the course ends.
- Remember, all
`babypandas`

code is regular`pandas`

code, too!

These sites allow you to search for datasets (in CSV format) from a variety of different domains. Some may require you to sign up for an account; these are generally reputable sources.

Note that all of these links are also available at rampure.org/find-datasets.

- Data is Plural
- FiveThirtyEight.
- CORGIS.
- Kaggle Datasets.
- Google’s dataset search.
- DataHub.io.
- Data.world.
- R datasets.
- Wikipedia. (Use this site to extract and download tables as CSVs.)
- Awesome Public Datasets GitHub repo.
- Links to even more sources.

- Sports: Basketball Reference, Baseball Reference, etc.
- US Government Sources: census.gov, data.gov, data.ca.gov, data.sfgov.org, FBI’s Crime Data Explorer, Centers for Disease Control and Prevention.
- Global Development: data.worldbank.org, databank.worldbank.org, WHO.
- Transportation: New York Taxi trips, Bureau of Transportation Statistics, SFO Air Traffic Statistics.
- Music: Spotify Charts.
- COVID: Johns Hopkins.
- Any Google Forms survey you’ve administered! (Go to the results spreadsheet, then go to “File > Download > Comma-separated values”.)

Tip: if a site only allows you to download a file as an Excel file, not a CSV file, you can download it, open it in a spreadsheet viewer (Excel, Numbers, Google Sheets), and export it to a CSV.

The Data Science Student Society organizes project groups, which are a great way to get experience and build your resume. Keep your eye out for applications!

- Students have the opportunity to join a team to pursue a unique data science project that will last two quarters.
- At the end of the project, teams will have developed a polished, complete personal project which they will showcase to their peers, faculty, and companies in the data science industry.
- Great for if you don't yet have a lot of experience and are looking for that first data science project to get you started.

`plotly`

¶- All of the visualizations (scatter plots, histograms, etc.) in this course were created using a library called
`matplotlib`

.- This library was called under-the-hood everytime we wrote
`df.plot`

.

- This library was called under-the-hood everytime we wrote
`plotly`

is a different visualization library that allows us to create**interactive**visualizations.- You may learn about it in a future course, but we'll briefly show you some cool visualizations you can make with it.

In [31]:

```
import plotly.express as px
```

Gapminder Foundation is a non-profit venture registered in Stockholm, Sweden, that promotes sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels. - Gapminder Wikipedia

In [32]:

```
gapminder = px.data.gapminder()
gapminder
```

Out[32]:

country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|

0 | Afghanistan | Asia | 1952 | 28.80 | 8425333 | 779.45 | AFG | 4 |

1 | Afghanistan | Asia | 1957 | 30.33 | 9240934 | 820.85 | AFG | 4 |

2 | Afghanistan | Asia | 1962 | 32.00 | 10267083 | 853.10 | AFG | 4 |

... | ... | ... | ... | ... | ... | ... | ... | ... |

1701 | Zimbabwe | Africa | 1997 | 46.81 | 11404948 | 792.45 | ZWE | 716 |

1702 | Zimbabwe | Africa | 2002 | 39.99 | 11926563 | 672.04 | ZWE | 716 |

1703 | Zimbabwe | Africa | 2007 | 43.49 | 12311143 | 469.71 | ZWE | 716 |

1704 rows × 8 columns

The dataset contains information for each country for several different years.

In [33]:

```
gapminder.get('year').unique()
```

Out[33]:

array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007], dtype=int64)

Let's start by just looking at 2007 data (the most recent year in the dataset).

In [34]:

```
gapminder_2007 = gapminder[gapminder.get('year') == 2007]
gapminder_2007
```

Out[34]:

country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|

11 | Afghanistan | Asia | 2007 | 43.83 | 31889923 | 974.58 | AFG | 4 |

23 | Albania | Europe | 2007 | 76.42 | 3600523 | 5937.03 | ALB | 8 |

35 | Algeria | Africa | 2007 | 72.30 | 33333216 | 6223.37 | DZA | 12 |

... | ... | ... | ... | ... | ... | ... | ... | ... |

1679 | Yemen, Rep. | Asia | 2007 | 62.70 | 22211743 | 2280.77 | YEM | 887 |

1691 | Zambia | Africa | 2007 | 42.38 | 11746035 | 1271.21 | ZMB | 894 |

1703 | Zimbabwe | Africa | 2007 | 43.49 | 12311143 | 469.71 | ZWE | 716 |

142 rows × 8 columns

We can plot life expectancy vs. GDP per capita. If you hover over a point, you will see the name of the country.

In [35]:

```
px.scatter(gapminder_2007, x='gdpPercap', y='lifeExp', hover_name='country')
```