In [1]:

```
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
# Animations
import time
from IPython.display import display, HTML, IFrame, clear_output
import ipywidgets as widgets
import warnings
warnings.filterwarnings('ignore')
def normal_curve(x, mu=0, sigma=1):
return 1 / np.sqrt(2*np.pi) * np.exp(-(x - mu)**2/(2 * sigma**2))
def normal_area(a, b, bars=False):
x = np.linspace(-4, 4, 1000)
y = normal_curve(x)
ix = (x >= a) & (x <= b)
plt.figure(figsize=(10, 5))
plt.plot(x, y, color='black')
plt.fill_between(x[ix], y[ix], color='gold')
if bars:
plt.axvline(a, color='red')
plt.axvline(b, color='red')
plt.title(f'Area between {np.round(a, 2)} and {np.round(b, 2)}')
plt.show()
def show_clt_slides():
src = "https://docs.google.com/presentation/d/e/2PACX-1vTcJd3U1H1KoXqBFcWGKFUPjZbeW4oiNZZLCFY8jqvSDsl4L1rRTg7980nPs1TGCAecYKUZxH5MZIBh/embed?start=false&loop=false&delayms=3000"
width = 960
height = 509
display(IFrame(src, width, height))
```

- Lab 6 is due
**Tuesday 3/7 at 11:59PM**. - Homework 6 is due
**Thursday 3/7 at 11:59PM**. - Check out the DSC Senior Capstone Showcase on Wednesday 3/15.
- See how DSC majors are putting their skills to work on problems from a variety of domains.
- Block 1 (11AM-12:30PM): Medicine and Bioinformatics 💊, Graphs and Deep Learning 📈, Finance and Blockchain 💰
- Block 2 (1-2:30PM): NLP, Sentiment Analysis, and Social Media 🗣, Fairness and Causality 🤝, Other Applications ⚙️
- RSVP here by 3/13.

The Final project is due on **Tuesday 3/14 at 11:59PM** and has 8 sections. How much progress have you made?

A. Not started or barely started ⏳

B. Finished 1 or 2 sections

C. Finished 3 or 4 sections ❤️

D. Finished 5 or 6 sections

E. Finished 7 or 8 sections 🤯

- The normal distribution.
- The Central Limit Theorem.

SAT scores range from 0 to 1600. The distribution of SAT scores has a mean of 950 and a standard deviation of 300. Your friend tells you that their SAT score, in standard units, is 2.5. What do you conclude?

- The standard normal distribution can be thought of as a "continuous histogram."

- Like a histogram:
- The
**area**between $a$ and $b$ is the**proportion**of values between $a$ and $b$. - The total area underneath the normal curve is is 1.

- The

- The standard normal distribution's
**cumulative density function**(CDF) describes the proportion of values in the distribution less than or equal to $z$, for all values of $z$.- In Python, we use the function
`scipy.stats.norm.cdf`

.

- In Python, we use the function

Last time, we looked at a data set of heights and weights of 5000 adult males.

In [2]:

```
height_and_weight = bpd.read_csv('data/height_and_weight.csv')
height_and_weight
```

Out[2]:

Height | Weight | |
---|---|---|

0 | 73.85 | 241.89 |

1 | 68.78 | 162.31 |

2 | 74.11 | 212.74 |

... | ... | ... |

4997 | 67.01 | 199.20 |

4998 | 71.56 | 185.91 |

4999 | 70.35 | 198.90 |

5000 rows × 2 columns

*benefit* is there to knowing that the two distributions are roughly normal?

**Key idea: The $x$-axis in a plot of the**__standard__normal distribution is in__standard__units.- For instance, the area between -1 and 1 is the proportion of values within 1 standard deviation of the mean.

- Suppose a distribution is roughly normal. Then, these are two are approximately equal:
- The proportion of values in the distribution between $a$ and $b$.
- The area between $z(a)$ and $z(b)$ under the standard normal curve. (Recall, $z(x_i) = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$.)

Let's suppose, as is often the case, that we don't have access to the entire distribution of heights, just the mean and SD.

In [3]:

```
heights = height_and_weight.get('Height')
height_mean = heights.mean()
height_mean
```

Out[3]:

69.02634590621737

In [4]:

```
height_std = np.std(heights)
height_std
```

Out[4]:

2.863075878119538

Using just this information, we can estimate the proportion of heights between 65 and 70 inches:

- Convert 65 to standard units.
- Convert 70 to standard units.
- Use
`stats.norm.cdf`

to find the area between (1) and (2).

In [5]:

```
left = (65 - height_mean) / height_std
left
```

Out[5]:

-1.4063008029189459

In [6]:

```
right = (70 - height_mean) / height_std
right
```

Out[6]:

0.3400727522534686

In [7]:

```
normal_area(left, right)
```

In [8]:

```
from scipy import stats
approximation = stats.norm.cdf(right) - stats.norm.cdf(left)
approximation
```

Out[8]:

0.5532817187111865

Since we have access to the entire set of heights, we can compute the true proportion of heights between 65 and 70 inches.

In [9]:

```
# True proportion of values between 65 and 70.
height_and_weight[
(height_and_weight.get('Height') >= 65) &
(height_and_weight.get('Height') <= 70)
].shape[0] / height_and_weight.shape[0]
```

Out[9]:

0.554

In [10]:

```
# Approximation using the standard normal curve.
approximation
```

Out[10]:

0.5532817187111865

Pretty good for an approximation! 🤩

Consider the distribution of delays from earlier in the lecture.

In [11]:

```
delays = bpd.read_csv('data/delays.csv')
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', figsize=(10, 5), title='Flight Delays')
plt.xlabel('Delay (minutes)');
```

In [12]:

```
HTML('data/delay_anim.html')
```

Out[12]:

- As we just discovered, the $x$-axis in the standard normal curve represents standard units.
- Often times, we want to know the proportion of values within $z$ standard deviations of the mean.

Percent in Range | Normal Distribution |
---|---|

$\text{mean} \pm 1 \: \text{SD}$ | $\approx 68\%$ |

$\text{mean} \pm 2 \: \text{SDs}$ | $\approx 95\%$ |

$\text{mean} \pm 3 \: \text{SDs}$ | $\approx 99.73\%$ |

In [13]:

```
normal_area(-1, 1, bars=True)
```

In [14]:

```
stats.norm.cdf(1) - stats.norm.cdf(-1)
```

Out[14]:

0.6826894921370859

In [15]:

```
normal_area(-2, 2, bars=True)
```

In [16]:

```
stats.norm.cdf(2) - stats.norm.cdf(-2)
```

Out[16]:

0.9544997361036416

- If a variable follows a normal distribution, approximately 95% of values will be within 2 SDs of the mean.
- Consequently, 5% of values will be outside this range.
- Since the normal curve is symmetric,
- 2.5% of values will be more than 2 SDs above the mean, and
- 2.5% of values will be more than 2 SDs below the mean.

- Last class, we looked at Chebyshev's inequality, which stated that the proportion of data within $z$ SDs of the mean is
**at least**$1-\frac{1}{z^2}$.- This works for any distribution, and is a lower bound.

- If we know that the distribution is normal, we can be even more specific:

Range | All Distributions (via Chebyshev's inequality) | Normal Distribution |
---|---|---|

mean $\pm \ 1$ SD | $\geq 0\%$ | $\approx 68\%$ |

mean $\pm \ 2$ SDs | $\geq 75\%$ | $\approx 95\%$ |

mean $\pm \ 3$ SDs | $\geq 88.8\%$ | $\approx 99.73\%$ |

- The percentages you see for normal distributions above are approximate, but are not lower bounds.
**Important**: They apply to all normal distributions, standardized or not. This is because all normal distributions are just stretched and shifted versions of the standard normal distribution.

- Last class, we mentioned that the standard normal curve has inflection points at $z = \pm 1$.
- An inflection point is where a curve goes from "opening down" 🙁 to "opening up" 🙂.

Remember: The distribution of heights is roughly normal, but it is *not* a *standard* normal distribution.

In [17]:

```
height_and_weight.plot(kind='hist', y='Height', density=True, ec='w', bins=40, alpha=0.8, figsize=(10, 5));
plt.xticks(np.arange(60, 78, 2));
```

- The center appears to be around 69.
- The inflection points appear to be around 66 and 72.
- So, the standard deviation is roughly 72 - 69 = 3.

In [18]:

```
np.std(height_and_weight.get('Height'))
```

Out[18]:

2.863075878119538

The distribution of flight delays that we've been looking at is *not* roughly normal.

In [19]:

```
delays = bpd.read_csv('data/delays.csv')
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', figsize=(10, 5), title='Population Distribution of Flight Delays')
plt.xlabel('Delay (minutes)');
```

In [20]:

```
delays.get('Delay').describe()
```

Out[20]:

count 13825.00 mean 16.66 std 39.48 ... 50% 2.00 75% 18.00 max 580.00 Name: Delay, Length: 8, dtype: float64

- Before we started discussing center, spread, and the normal distribution, our focus was on bootstrapping.

- We used the bootstrap to estimate
**the distribution of a sample statistic (e.g. sample mean or sample median)**, using just a single sample.

- We did this to construct confidence intervals for a population parameter.

**Important**: For now, we'll suppose our parameter of interest is the population mean,**so we're interested in estimating the distribution of the sample mean**.

Since we have access to the population of flight delays, let's remind ourselves what the distribution of the sample mean looks like by drawing samples repeatedly from the population.

- This is
**not bootstrapping**. - This is also
**not practical**. If we had access to a population, we wouldn't need to understand the distribution of the sample mean – we'd be able to compute the population mean directly.

In [21]:

```
sample_means = np.array([])
repetitions = 2000
for i in np.arange(repetitions):
sample = delays.sample(500)
sample_mean = sample.get('Delay').mean()
sample_means = np.append(sample_means, sample_mean)
sample_means
```

Out[21]:

array([15.65, 17.02, 16.58, ..., 18.76, 16.87, 13.23])

In [22]:

```
bpd.DataFrame().assign(sample_means=sample_means).plot(kind='hist', density=True, ec='w', alpha=0.65, bins=20, figsize=(10, 5));
plt.scatter([sample_means.mean()], [-0.005], marker='^', color='green', s=250)
plt.axvline(sample_means.mean(), color='green', label=f'mean={np.round(sample_means.mean(), 2)}', linewidth=4)
plt.xlim(5, 30)
plt.ylim(-0.013, 0.26)
plt.legend();
```