In [1]:
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

# Animations
import ipywidgets as widgets
from IPython.display import display, HTML

Lecture 20 – Spread, The Normal Distribution¶

DSC 10, Winter 2023¶

Announcements¶

  • Discussion section is this afternoon.
  • Homework 5 is due tomorrow at 11:59PM.
  • Lab 6 is due Tuesday 3/7 at 11:59PM.
  • The Final Project is due on Tuesday 3/14 at 11:59PM.

Agenda¶

  • Recap: Mean and median.
  • Standard deviation.
  • Standardization.
  • The normal distribution.

Recap: Mean and median¶

Example: Flight delays ✈️¶

In [2]:
delays = bpd.read_csv('data/delays.csv')
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', figsize=(10, 5), title='Flight Delays')
plt.xlabel('Delay (minutes)');
2023-02-28T20:40:45.446868 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Question: Which is larger – the mean or the median?

In [3]:
delays.get('Delay').mean()
Out[3]:
16.658155515370705
In [4]:
delays.get('Delay').median()
Out[4]:
2.0
In [5]:
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', alpha=0.65, figsize=(10, 5), title='Flight Delays')
plt.plot([delays.get('Delay').mean(), delays.get('Delay').mean()], [0, 1], color='green', label='Mean', linewidth=2)
plt.scatter([delays.get('Delay').mean()], [-0.0017], color='green', marker='^', s=250)
plt.plot([delays.get('Delay').median(), delays.get('Delay').median()], [0, 1], color='purple', label='Median', linewidth=2)
plt.xlabel('Delay (minutes)')
plt.ylim(-0.005, 0.065)
plt.legend();
2023-02-28T20:40:45.681088 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Comparing the mean and median¶

  • Mean: Balance point of the histogram.
    • Numerically: the sum of the differences between all data points and the mean is 0.
    • Physically: Think of a see-saw.
  • Median: Half-way point of the data.
    • Half of the area of a histogram is to the left of the median, and half is to the right.
  • If the distribution is symmetric about a value, then that value is both the mean and the median.
  • If the distribution is skewed, then the mean is pulled away from the median in the direction of the tail.
  • Key property: The median is more robust (less sensitive) to outliers.

Standard deviation¶

Question: How "wide" is a distribution?¶

  • One idea:
    • The range quantifies how far the extreme values are from one another (max - min).
    • Issue: This doesn’t tell us much about the shape of the distribution.
  • Another idea:
    • The mean is at the center.
    • The standard deviation quantifies how far the data points typically are from the center.

Deviations from the mean¶

In [6]:
data = np.array([2, 3, 3, 9])
np.mean(data)
Out[6]:
4.25
In [7]:
deviations = data - np.mean(data)
deviations
Out[7]:
array([-2.25, -1.25, -1.25,  4.75])

Each entry in deviations describes how far the corresponding element in data is from 4.25.

What is the average deviation?

In [8]:
np.mean(deviations)
Out[8]:
0.0
  • This is true of any dataset – the average deviation from the mean is always 0.
  • This implies that the average deviation itself is not useful in measuring the spread of data.

Average squared deviation¶

In [9]:
# Square all the deviations:
deviations ** 2
Out[9]:
array([ 5.06,  1.56,  1.56, 22.56])
In [10]:
variance = np.mean(deviations ** 2)
variance
Out[10]:
7.6875

This quantity, the average squared deviation from the mean, is called the variance.

Standard deviation¶

  • Our data usually has units, e.g. dollars.
  • The variance is in "squared" units, e.g. $\text{dollars}^2$.
  • To account for this, we can take the square root of the variance, and the result is called the standard deviation.
In [11]:
# Standard deviation (SD) is the square root of the variance.
sd = variance ** 0.5
sd
Out[11]:
2.7726341266023544

Standard deviation¶

  • The standard deviation (SD) measures something about how far the data values are from their average.
    • It is not directly interpretable because of the squaring and square rooting.
    • But generally, larger SD = more spread out.
  • The standard deviation has the same units as the original data.
  • numpy has a function, np.std, that calculates the standard deviation for us.
In [12]:
# Note that this evaluates to the same number we found on the previous slide.
np.std(data)
Out[12]:
2.7726341266023544

Variance and standard deviation¶

To summarize:

$$\begin{align*}\text{variance} &= \text{average squared deviation from the mean}\\ &= \frac{(\text{value}_1 - \text{mean})^2 + ... + (\text{value}_n - \text{mean})^2}{n}\\ \text{standard deviation} &= \sqrt{\text{variance}} \end{align*}$$

where $n$ is the number of observations.

What can we do with the standard deviation?¶

It turns out, in any numerical distribution, the bulk of the data are in the range “mean ± a few SDs”.

Let's make this more precise.

Chebyshev’s inequality¶

Fact: In any numerical distribution, the proportion of values in the range “mean ± $z$ SDs” is at least

$$1 - \frac{1}{z^2}$$
Range Proportion
mean ± 2 SDs at least $1 - \frac{1}{4}$ (75%)
mean ± 3 SDs at least $1 - \frac{1}{9}$ (88.88..%)
mean ± 4 SDs at least $1 - \frac{1}{16}$ (93.75%)
mean ± 5 SDs at least $1 - \frac{1}{25}$ (96%)

Flight delays, revisited¶

In [13]:
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', figsize=(10, 5), title='Flight Delays')
plt.xlabel('Delay (minutes)');
2023-02-28T20:40:45.933540 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
In [14]:
delay_mean = delays.get('Delay').mean()
delay_mean
Out[14]:
16.658155515370705
In [15]:
delay_std = np.std(delays.get('Delay')) # There is no .std() method in babypandas!
delay_std
Out[15]:
39.480199851609314

Mean and standard deviation¶

Chebyshev's inequality tells us that

  • At least 75% of delays are in the following interval:
In [16]:
delay_mean - 2 * delay_std, delay_mean + 2 * delay_std
Out[16]:
(-62.30224418784792, 95.61855521858934)
  • At least 88.88% of delays are in the following interval:
In [17]:
delay_mean - 3 * delay_std, delay_mean + 3 * delay_std
Out[17]:
(-101.78244403945723, 135.09875507019865)

Let's visualize these intervals!

In [18]:
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, alpha=0.65, ec='w', figsize=(10, 5), title='Flight Delays')
plt.axvline(delay_mean - 2 * delay_std, color='maroon', label='± 2 SD')
plt.axvline(delay_mean + 2 * delay_std, color='maroon')

plt.axvline(delay_mean + 3 * delay_std, color='blue',  label='± 3 SD')
plt.axvline(delay_mean - 3 * delay_std, color='blue')

plt.axvline(delay_mean, color='green', label='Mean')
plt.scatter([delay_mean], [-0.0017], color='green', marker='^', s=250)
plt.ylim(-0.0038, 0.06)
plt.legend();
2023-02-28T20:40:46.184457 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Chebyshev's inequality provides lower bounds!¶

Remember, Chebyshev's inequality states that at least $1 - \frac{1}{z^2}$ of values are within $z$ SDs from the mean, for any numerical distribution.

For instance, it tells us that at least 75% of delays are in the following interval:

In [19]:
delay_mean - 2 * delay_std, delay_mean + 2 * delay_std
Out[19]:
(-62.30224418784792, 95.61855521858934)

However, in this case, a much larger fraction of delays are in that interval.

In [20]:
within_2_sds = delays[(delays.get('Delay') >= delay_mean - 2 * delay_std) & 
                      (delays.get('Delay') <= delay_mean + 2 * delay_std)]

within_2_sds.shape[0] / delays.shape[0]
Out[20]:
0.9560940325497288

If we know more about the shape of the distribution, we can provide better guarantees for the proportion of values within $z$ SDs of the mean.

Activity¶

For a particular set of data points, Chebyshev's inequality states that at least $\frac{8}{9}$ of the data points are between $-20$ and $40$. What is the standard deviation of the data?

Click here to see the answer after you've tried it yourself. - Chebyshev's inequality states that at least $1 - \frac{1}{z^2}$ of values are within $z$ standard deviations of the mean. - When $z = 3$, $1 - \frac{1}{z^2} = \frac{8}{9}$. - So, $-20$ is $3$ standard deviations below the mean, and $40$ is $3$ standard deviations above the mean. - $10$ is in the middle of $-20$ and $40$, so the mean is $10$. - $3$ standard deviations are between $10$ and $40$, so $1$ standard deviation is $\frac{30}{3} = 10$.

Standardization¶

Heights and weights 📏¶

We'll work with a data set containing the heights and weights of 5000 adult males.

In [21]:
height_and_weight = bpd.read_csv('data/height_and_weight.csv')
height_and_weight
Out[21]:
Height Weight
0 73.85 241.89
1 68.78 162.31
2 74.11 212.74
... ... ...
4997 67.01 199.20
4998 71.56 185.91
4999 70.35 198.90

5000 rows × 2 columns

Distributions of height and weight¶

Let's look at the distributions of both numerical variables.

In [22]:
height_and_weight.plot(kind='hist', y='Height', density=True, ec='w', bins=30, alpha=0.8, figsize=(10, 5));
2023-02-28T20:40:46.440671 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
In [23]:
height_and_weight.plot(kind='hist', y='Weight', density=True, ec='w', bins=30, alpha=0.8, color='C1', figsize=(10, 5));
2023-02-28T20:40:46.674191 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/
In [24]:
height_and_weight.plot(kind='hist', density=True, ec='w', bins=60, alpha=0.8, figsize=(10, 5));
2023-02-28T20:40:47.000879 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Observation: The two distributions look like shifted and stretched versions of the same basic shape, called a bell curve 🔔.

Standard units¶

Suppose $x$ is a numerical variable, and $x_i$ is one value of that variable. The function $$z(x_i) = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$$

converts $x_i$ to standard units, which represents the number of standard deviations $x_i$ is above the mean.

Example: Suppose someone weighs 225 pounds. What is their weight in standard units?

In [25]:
weights = height_and_weight.get('Weight')
(225 - weights.mean()) / np.std(weights)
Out[25]:
1.920169918158094
  • Interpretation: 225 is 1.92 standard deviations above the mean weight.
  • 225 becomes 1.92 in standard units.

Standardization¶

The process of converting all values of a variable (i.e. a column) to standard units is known as standardization, and the resulting values are considered to be standardized.

In [26]:
def standard_units(col):
    return (col - col.mean()) / np.std(col)
In [27]:
standardized_height = standard_units(height_and_weight.get('Height'))
standardized_height
Out[27]:
0       1.68
1      -0.09
2       1.78
        ... 
4997   -0.70
4998    0.88
4999    0.46
Name: Height, Length: 5000, dtype: float64
In [28]:
standardized_weight = standard_units(height_and_weight.get('Weight'))
standardized_weight
Out[28]:
0       2.77
1      -1.25
2       1.30
        ... 
4997    0.62
4998   -0.06
4999    0.60
Name: Weight, Length: 5000, dtype: float64

The effect of standardization¶

Standardized variables have:

  • A mean of 0.
  • An SD of 1.

We often standardize variables to bring them to the same scale.

Aside: To quickly see summary statistics for a numerical Series, use the .describe() Series method.

In [29]:
# e-14 means 10^(-14), which is a very small number, effectively zero.
standardized_height.describe()
Out[29]:
count    5.00e+03
mean     1.64e-14
std      1.00e+00
           ...   
50%      4.76e-04
75%      6.85e-01
max      3.48e+00
Name: Height, Length: 8, dtype: float64
In [30]:
standardized_weight.describe()
Out[30]:
count    5.00e+03
mean     1.64e-14
std      1.00e+00
           ...   
50%      6.53e-04
75%      6.74e-01
max      4.19e+00
Name: Weight, Length: 8, dtype: float64

Let's look at how the process of standardization works visually.

In [31]:
HTML('data/height_anim.html')
Out[31]:
Your browser does not support the video tag.
In [32]:
HTML('data/weight_anim.html')
Out[32]: