# Lecture 21 – Spread, The Normal Distribution¶

## DSC 10, Fall 2022¶

### Announcements¶

• Homework 6 is due tomorrow at 11:59pm.
• Lab 7 is due Saturday 11/19 at 11:59pm.
• The Final Project is released, and has two deadlines:
• The checkpoint is due Thursday 11/17 at 11:59pm. No slip days!
• The final submission is due Tuesday 11/29 at 11:59pm. Slip days allowed.
• Tomorrow from 10-11am in the SDSC Auditorium, come talk to Janine, Suraj, and other HDSI faculty at the HDSI faculty/student mixer!

### Agenda¶

• Recap: Mean and median.
• Standard deviation.
• Standardization.
• The normal distribution.

## Recap: Mean and median¶

### Example: Flight delays ✈️¶

Question: Which is larger – the mean or the median?

### Comparing the mean and median¶

• Mean: Balance point of the histogram.
• Numerically: the sum of the differences between all data points and the mean is 0.
• Physically: Think of a see-saw.
• Median: Half-way point of the data.
• Half of the area of a histogram is to the left of the median, and half is to the right.
• If the distribution is symmetric about a value, then that value is both the mean and the median.
• If the distribution is skewed, then the mean is pulled away from the median in the direction of the tail.
• Key property: The median is more robust (less sensitive) to outliers.

## Standard deviation¶

### Question: How "wide" is a distribution?¶

• One idea:
• The range quantifes how far the extreme values are from one another (max - min).
• Issue: this doesn’t tell us much about the shape of the distribution.
• Another idea:
• The mean is at the center.
• The standard deviation quantifies how far the data points typically are from the center.

### Deviations from the mean¶

Each entry in deviations describes how far the corresponding element in data is from 4.25.

What is the average deviation?

• This is true of any dataset – the average deviation from the mean is always 0.
• This implies that the average deviation itself is not useful in measuring the spread of data.

### Average squared deviation¶

This quantity, the average squared deviation from the mean, is called the variance.

### Standard deviation¶

• Our data usually has units, e.g. dollars.
• The variance is in "squared" units, e.g. $\text{dollars}^2$.
• To account for this, we can take the square root of the variance, and the result is called the standard deviation.

### Standard deviation¶

• The standard deviation (SD) measures something about how far the data values are from their average.
• It is not directly interpretable because of the squaring and square rooting.
• But generally, larger SD = more spread out.
• The standard deviation has the same units as the original data.
• numpy has a function, np.std, that calculates the standard deviation for us.

### Variance and standard deviation¶

To summarize:

\begin{align*}\text{variance} &= \text{average squared deviation from the mean}\\ &= \frac{(\text{value}_1 - \text{mean})^2 + ... + (\text{value}_n - \text{mean})^2}{n}\\ \text{standard deviation} &= \sqrt{\text{variance}} \end{align*}

where $n$ is the number of observations.

### What can we do with the standard deviation?¶

It turns out, in any numerical distribution, the bulk of the data are in the range “mean ± a few SDs”.

Let's make this more precise.

### Chebyshev’s inequality¶

Fact: In any numerical distribution, the proportion of values in the range “mean ± $z$ SDs” is at least

$$1 - \frac{1}{z^2}$$
Range Proportion
mean ± 2 SDs at least $1 - \frac{1}{4}$ (75%)
mean ± 3 SDs at least $1 - \frac{1}{9}$ (88.88..%)
mean ± 4 SDs at least $1 - \frac{1}{16}$ (93.75%)
mean ± 5 SDs at least $1 - \frac{1}{25}$ (96%)

### Mean and standard deviation¶

Chebyshev's inequality tells us that

• At least 75% of delays are in the following interval:
• At least 88.88% of delays are in the following interval:

Let's visualize these intervals!

### Chebyshev's inequality provides lower bounds!¶

Remember, Chebyshev's inequality states that at least $1 - \frac{1}{z^2}$ of values are within $z$ SDs from the mean, for any numerical distribution.

For instance, it tells us that at least 75% of delays are in the following interval:

However, in this case, a much larger fraction of delays are in that interval.

If we know more about the shape of the distribution, we can provide better guarantees for the proportion of values within $z$ SDs of the mean.

### Activity¶

For a particular set of data points, Chebyshev's inequality states that at least $\frac{8}{9}$ of the data points are between -20 and 40. What is the standard deviation of the data?

Click here to see the answer after you've tried it yourself. - Chebyshev's inequality states that at least $1 - \frac{1}{z^2}$ of values are within $z$ standard deviations of the mean. - When $z = 3$, $1 - \frac{1}{z^2} = \frac{8}{9}$. - So, -20 is 3 standard deviations below the mean, and 40 is 3 standard deviations above the mean. - 10 is in the middle of -20 and 40, so the mean is 10. - 3 standard deviations are between 10 and 40, so 1 standard deviation is $\frac{30}{3} = 10$.

## Standardization¶

### Heights and weights 📏¶

We'll work with a data set containing the heights and weights of 5000 adult males.

### Distributions of height and weight¶

Let's look at the distributions of both numerical variables.

Observation: The two distributions look like shifted and stretched versions of the same basic shape, called a bell curve 🔔.

### Standard units¶

Suppose $x$ is a numerical variable, and $x_i$ is one value of that variable. The function $$z(x_i) = \frac{x_i - \text{mean of x}}{\text{SD of x}}$$

converts $x_i$ to standard units, which represents the number of standard deviations $x_i$ is above the mean.

Example: Suppose someone weighs 225 pounds. What is their weight in standard units?

• Interpretation: 225 is 1.92 standard deviations above the mean weight.
• 225 becomes 1.92 in standard units.

### Standardization¶

The process of converting all values of a variable (i.e. a column) to standard units is known as standardization, and the resulting values are considered to be standardized.

### The effect of standardization¶

Standardized variables have:

• A mean of 0.
• An SD of 1.

We often standardize variables to bring them to the same scale.

Aside: To quickly see summary statistics for a numerical Series, use the .describe() Series method.

Let's look at how the process of standardization works visually.