In [1]:

```
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
# Animations
import time
from IPython.display import display, HTML, IFrame, clear_output
import ipywidgets as widgets
def normal_curve(x, mu=0, sigma=1):
return 1 / np.sqrt(2*np.pi) * np.exp(-(x - mu)**2/(2 * sigma**2))
def normal_area(a, b, bars=False):
x = np.linspace(-4, 4, 1000)
y = normal_curve(x)
ix = (x >= a) & (x <= b)
plt.figure(figsize=(10, 5))
plt.plot(x, y, color='black')
plt.fill_between(x[ix], y[ix], color='gold')
if bars:
plt.axvline(a, color='red')
plt.axvline(b, color='red')
plt.title(f'Area between {np.round(a, 2)} and {np.round(b, 2)}')
plt.show()
def show_clt_slides():
src = "https://docs.google.com/presentation/d/e/2PACX-1vTcJd3U1H1KoXqBFcWGKFUPjZbeW4oiNZZLCFY8jqvSDsl4L1rRTg7980nPs1TGCAecYKUZxH5MZIBh/embed?start=false&loop=false&delayms=3000"
width = 960
height = 509
display(IFrame(src, width, height))
```

- The normal distribution.
- The Central Limit Theorem.

SAT scores range from 0 to 1600. The distribution of SAT scores has a mean of 950 and a standard deviation of 300. Your friend tells you that their SAT score, in standard units, is 2.5. What do you conclude?

- The standard normal distribution can be thought of as a "continuous histogram."

- Like a histogram:
- The
**area**between $a$ and $b$ is the**proportion**of values between $a$ and $b$. - The total area underneath the normal curve is is 1.

- The

- The standard normal distribution's
**cumulative density function**(CDF) describes the proportion of values in the distribution less than or equal to $z$, for all values of $z$.- In Python, we use the function
`scipy.stats.norm.cdf`

.

- In Python, we use the function

What does `scipy.stats.norm.cdf(0)`

evaluate to? Why?

In [2]:

```
normal_area(-np.inf, 0)
```

In [3]:

```
from scipy import stats
stats.norm.cdf(0)
```

Out[3]:

0.5

Suppose we want to find the area to the **right** of 2 under the standard normal curve.

In [4]:

```
normal_area(2, np.inf)
```

The following expression gives us the area to the **left** of 2.

In [5]:

```
stats.norm.cdf(2)
```

Out[5]:

0.9772498680518208

In [6]:

```
normal_area(-np.inf, 2)
```

However, since the total area under the standard normal curve is 1:

$$\text{area right of $2$} = 1 - (\text{area left of $2$})$$In [7]:

```
1 - stats.norm.cdf(2)
```

Out[7]:

0.02275013194817921

How might we use `stats.norm.cdf`

to compute the area between -1 and 0?

In [8]:

```
normal_area(-1, 0)
```

Strategy:

$$\text{area from $-1$ to $0$} = (\text{area left of $0$}) - (\text{area left of $-1$})$$In [9]:

```
stats.norm.cdf(0) - stats.norm.cdf(-1)
```

Out[9]:

0.3413447460685429

The area under the standard normal curve in the interval $[a, b]$ is

```
stats.norm.cdf(b) - stats.norm.cdf(a)
```

What can we do with this? We're about to see!

Let's return to our data set of heights and weights.

In [10]:

```
height_and_weight = bpd.read_csv('data/height_and_weight.csv')
height_and_weight
```

Out[10]:

Height | Weight | |
---|---|---|

0 | 73.85 | 241.89 |

1 | 68.78 | 162.31 |

2 | 74.11 | 212.74 |

... | ... | ... |

4997 | 67.01 | 199.20 |

4998 | 71.56 | 185.91 |

4999 | 70.35 | 198.90 |

5000 rows × 2 columns

As we saw before, both variables are roughly normal. What *benefit* is there to knowing that the two distributions are roughly normal?

**Key idea: The $x$-axis in a plot of the**__standard__normal distribution is in__standard__units.- For instance, the area between -1 and 1 is the proportion of values within 1 standard deviation of the mean.

- Suppose a distribution is roughly normal. Then, these are two are approximately equal:
- The proportion of values in the distribution between $a$ and $b$.
- The area between $z(a)$ and $z(b)$ under the standard normal curve. (Recall, $z(x_i) = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$.)

Let's suppose, as is often the case, that we don't have access to the entire distribution of weights, just the mean and SD.

In [11]:

```
weights = height_and_weight.get('Weight')
weight_mean = weights.mean()
weight_mean
```

Out[11]:

187.0206206581932

In [12]:

```
weight_std = np.std(weights)
weight_std
```

Out[12]:

19.779176302396458

Using just this information, we can estimate the proportion of weights between 200 and 225 pounds:

- Convert 200 to standard units.
- Convert 225 to standard units.
- Use
`stats.norm.cdf`

to find the area between (1) and (2).

In [13]:

```
left = (200 - weight_mean) / weight_std
left
```

Out[13]:

0.656214351061435

In [14]:

```
right = (225 - weight_mean) / weight_std
right
```

Out[14]:

1.9201699181580782

In [15]:

```
normal_area(left, right)
```

In [16]:

```
approximation = stats.norm.cdf(right) - stats.norm.cdf(left)
approximation
```

Out[16]:

0.22842488819306406

Since we have access to the entire set of weights, we can compute the true proportion of weights between 200 and 225 pounds.

In [17]:

```
# True proportion of values between 200 and 225.
height_and_weight[
(height_and_weight.get('Weight') >= 200) &
(height_and_weight.get('Weight') <= 225)
].shape[0] / height_and_weight.shape[0]
```

Out[17]:

0.2294

In [18]:

```
# Approximation using the standard normal curve.
approximation
```

Out[18]:

0.22842488819306406

Pretty good for an approximation! 🤩

Consider the distribution of delays from earlier in the lecture.

In [19]:

```
delays = bpd.read_csv('data/delays.csv')
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', figsize=(10, 5), title='Flight Delays')
plt.xlabel('Delay (minutes)');
```

The distribution above does not look normal. It won't look normal even if we standardize it. By standardizing a distribution, all we do is move it horizontally and stretch it vertically – the shape itself doesn't change.

In [20]:

```
HTML('data/delay_anim.html')
```

Out[20]:

- As we just discovered, the $x$-axis in the standard normal curve represents standard units.
- Often times, we want to know the proportion of values within $z$ standard deviations of the mean.

Percent in Range | Normal Distribution |
---|---|

$\text{mean} \pm 1 \: \text{SD}$ | $\approx 68\%$ |

$\text{mean} \pm 2 \: \text{SDs}$ | $\approx 95\%$ |

$\text{mean} \pm 3 \: \text{SDs}$ | $\approx 99.73\%$ |

In [21]:

```
normal_area(-1, 1, bars=True)
```

In [22]:

```
stats.norm.cdf(1) - stats.norm.cdf(-1)
```

Out[22]:

0.6826894921370859

This means that if a variable follows a normal distribution, approximately 68% of values will be within 1 SD of the mean.

In [23]:

```
normal_area(-2, 2, bars=True)
```