Lecture 21 – Spread, The Normal Distribution

DSC 10, Fall 2022

Announcements

Agenda

Recap: Mean and median

Example: Flight delays ✈️

Question: Which is larger – the mean or the median?

Comparing the mean and median

Standard deviation

Question: How "wide" is a distribution?

Deviations from the mean

Each entry in deviations describes how far the corresponding element in data is from 4.25.

What is the average deviation?

Average squared deviation

This quantity, the average squared deviation from the mean, is called the variance.

Standard deviation

Standard deviation

Variance and standard deviation

To summarize:

$$\begin{align*}\text{variance} &= \text{average squared deviation from the mean}\\ &= \frac{(\text{value}_1 - \text{mean})^2 + ... + (\text{value}_n - \text{mean})^2}{n}\\ \text{standard deviation} &= \sqrt{\text{variance}} \end{align*}$$

where $n$ is the number of observations.

What can we do with the standard deviation?

It turns out, in any numerical distribution, the bulk of the data are in the range “mean ± a few SDs”.

Let's make this more precise.

Chebyshev’s inequality

Fact: In any numerical distribution, the proportion of values in the range “mean ± $z$ SDs” is at least

$$1 - \frac{1}{z^2}$$
Range Proportion
mean ± 2 SDs at least $1 - \frac{1}{4}$ (75%)
mean ± 3 SDs at least $1 - \frac{1}{9}$ (88.88..%)
mean ± 4 SDs at least $1 - \frac{1}{16}$ (93.75%)
mean ± 5 SDs at least $1 - \frac{1}{25}$ (96%)

Flight delays, revisited

Mean and standard deviation

Chebyshev's inequality tells us that

Let's visualize these intervals!

Chebyshev's inequality provides lower bounds!

Remember, Chebyshev's inequality states that at least $1 - \frac{1}{z^2}$ of values are within $z$ SDs from the mean, for any numerical distribution.

For instance, it tells us that at least 75% of delays are in the following interval:

However, in this case, a much larger fraction of delays are in that interval.

If we know more about the shape of the distribution, we can provide better guarantees for the proportion of values within $z$ SDs of the mean.

Activity

For a particular set of data points, Chebyshev's inequality states that at least $\frac{8}{9}$ of the data points are between -20 and 40. What is the standard deviation of the data?

Click here to see the answer after you've tried it yourself. - Chebyshev's inequality states that at least $1 - \frac{1}{z^2}$ of values are within $z$ standard deviations of the mean. - When $z = 3$, $1 - \frac{1}{z^2} = \frac{8}{9}$. - So, -20 is 3 standard deviations below the mean, and 40 is 3 standard deviations above the mean. - 10 is in the middle of -20 and 40, so the mean is 10. - 3 standard deviations are between 10 and 40, so 1 standard deviation is $\frac{30}{3} = 10$.

Standardization

Heights and weights 📏

We'll work with a data set containing the heights and weights of 5000 adult males.

Distributions of height and weight

Let's look at the distributions of both numerical variables.

Observation: The two distributions look like shifted and stretched versions of the same basic shape, called a bell curve 🔔.

Standard units

Suppose $x$ is a numerical variable, and $x_i$ is one value of that variable. The function $$z(x_i) = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$$

converts $x_i$ to standard units, which represents the number of standard deviations $x_i$ is above the mean.

Example: Suppose someone weighs 225 pounds. What is their weight in standard units?

Standardization

The process of converting all values of a variable (i.e. a column) to standard units is known as standardization, and the resulting values are considered to be standardized.

The effect of standardization

Standardized variables have:

We often standardize variables to bring them to the same scale.

Aside: To quickly see summary statistics for a numerical Series, use the .describe() Series method.

Let's look at how the process of standardization works visually.