In [1]:

```
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, PillowWriter
from scipy import stats
set_matplotlib_formats("svg")
plt.style.use('ggplot')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
# Animations
from IPython.display import display, HTML, IFrame, clear_output
import ipywidgets as widgets
# Demonstration code
def r_scatter(r):
"Generate a scatter plot with a correlation approximately r"
x = np.random.normal(0, 1, 1000)
z = np.random.normal(0, 1, 1000)
y = r * x + (np.sqrt(1 - r ** 2)) * z
plt.scatter(x, y)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
def show_scatter_grid():
plt.subplots(1, 4, figsize=(10, 2))
for i, r in enumerate([-1, -2/3, -1/3, 0]):
plt.subplot(1, 4, i+1)
r_scatter(r)
plt.title(f'r = {np.round(r, 2)}')
plt.show()
plt.subplots(1, 4, figsize=(10, 2))
for i, r in enumerate([1, 2/3, 1/3]):
plt.subplot(1, 4, i+1)
r_scatter(r)
plt.title(f'r = {np.round(r, 2)}')
plt.subplot(1, 4, 4)
plt.axis('off')
plt.show()
```

- Homework 7 is due
**tomorrow at 11:59pm**. - Lab 8 is due on
**Saturday 11/26 at 11:59pm**. - The Final Project is due on
**Tuesday 11/29 at 11:59pm**. - Suraj's lecture section (C00) is not meeting on Wednesday 11/23 or Monday 11/28.
- Attend and/or watch the podcasts of any other lecture section on those days.
- Try to attend the earlier sections, since they have more space.

- Recap: Statistical inference.
- Association.
- Correlation.
- Regression.

Every statistical test and simulation we've run in the second half of the class is related to one of the following four ideas. To solidify your understanding of what we've done, it's a good idea to review past lectures and assignments and see how what we did in each section relates to one of these four ideas.

- To test whether a sample came from a known population distribution, use "standard" hypothesis testing.

- To test whether two samples came from the same unknown population distribution, use permutation testing.

- To estimate a population parameter given a single sample, construct a confidence interval using bootstrapping (for most statistics) or the CLT (for the sample mean).

- To test whether a population parameter is equal to a particular value, $x$, construct a confidence interval using bootstrapping (for most statistics) or the CLT (for the sample mean), and check whether $x$ is in the interval.

- Is the set of Twitter users that follow Elon Musk a random sample of all Twitter users?

- What does statistical significance mean in this context? How is this related to our "Choosing sample sizes" example from Lecture 23?

- Suppose we have a dataset with at least two numerical variables.

- We're interested in
**predicting**one variable based on another:- Given my education level, what is my income?
- Given my height, how tall will my kid be as an adult?
- Given my age, how many countries have I visited?

- To do this effectively, we need to first observe a pattern between the two numerical variables.

- To see if a pattern exists, we'll need to draw a scatter plot.

- In Lecture 2, we said "association is another term for "any relation" or "link" 🔗."

- In this context, an
**association**is any relationship or link between two variables in a**scatter plot**. Associations can be linear or non-linear.

- If two variables have a positive association ↗️, then as one variable increases, the other tends to increase.
- If two variables have a negative association ↘️, then as one variable increases, the other tends to decrease.

- As we saw in Lecture 2, association $\neq$ causation!
- However, association is enough to let us make predictions.

In [2]:

```
hybrid = bpd.read_csv('data/hybrid.csv')
hybrid
```

Out[2]:

vehicle | year | price | acceleration | mpg | class | |
---|---|---|---|---|---|---|

0 | Prius (1st Gen) | 1997 | 24509.74 | 7.46 | 41.26 | Compact |

1 | Tino | 2000 | 35354.97 | 8.20 | 54.10 | Compact |

2 | Prius (2nd Gen) | 2000 | 26832.25 | 7.97 | 45.23 | Compact |

... | ... | ... | ... | ... | ... | ... |

150 | C-Max Energi Plug-in | 2013 | 32950.00 | 11.76 | 43.00 | Midsize |

151 | Fusion Energi Plug-in | 2013 | 38700.00 | 11.76 | 43.00 | Midsize |

152 | Chevrolet Volt | 2013 | 39145.00 | 11.11 | 37.00 | Compact |

153 rows × 6 columns

`'acceleration'`

and `'price'`

¶Is there an association between these two variables? If so, what kind?

In [3]:

```
hybrid.plot(kind='scatter', x='acceleration', y='price', figsize=(10, 5));
```

`'mpg'`

and `'price'`

¶Is there an association between these two variables? If so, what kind?

In [4]:

```
hybrid.plot(kind='scatter', x='mpg', y='price', figsize=(10, 5));
```

**Observations:**

- There is an association – cars with better fuel economy tended to be cheaper.
- Why do we think that is? 🤔

- The association looks more curved than linear.
- It may roughly follow $y \approx \frac{1}{x}$.

- A linear change in units doesn't change the shape of the plot, it only changes the scale of the plot.
- Linear change means adding or subtracting a constant, and multiplying or dividing by a constant.

- In other words, instead of plotting price in
*dollars*and fuel economy in*miles per gallon*, we can plot price in*Yen (🇯🇵)*and fuel economy in*kilometers per liter*and the plot would look the same, just with different axes:

In [5]:

```
hybrid.assign(
km_per_liter=hybrid.get('mpg') * 0.425144,
yen=hybrid.get('price') * 140.34
).plot(kind='scatter', x='km_per_liter', y='yen', figsize=(10, 5));
```

- Recall: Suppose $x$ is a numerical variable, and $x_i$ is one value of that variable. To convert $x_i$ to standard units, $$x_{i \: \text{(su)}} = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$$

- Converting columns to standard units makes different scatter plots comparable, by putting the $x$ and $y$ axes on the
**same scale**.- Both axes measure the number of standard deviations above the mean.

- Converting columns to standard units doesn't change shape of the scatter plot, because the conversion is linear.

In [6]:

```
def standard_units(any_numbers):
"Convert any array of numbers to standard units."
any_numbers = np.array(any_numbers)
return (any_numbers - any_numbers.mean()) / np.std(any_numbers)
```

In [7]:

```
def standardize(df):
"""Return a DataFrame in which all columns of df are converted to standard units."""
df_su = bpd.DataFrame()
for column in df.columns:
df_su = df_su.assign(**{column + ' (su)': standard_units(df.get(column))})
return df_su
```

For a given pair of variables:

- Which cars are average in both variables?
- Which cars are well above or well below average in both variables?

In [8]:

```
hybrid_su = standardize(hybrid.get(['price', 'acceleration', 'mpg'])).assign(vehicle=hybrid.get('vehicle'))
hybrid_su
```

Out[8]:

price (su) | acceleration (su) | mpg (su) | vehicle | |
---|---|---|---|---|

0 | -6.94e-01 | -1.54 | 0.59 | Prius (1st Gen) |

1 | -1.86e-01 | -1.28 | 1.76 | Tino |

2 | -5.85e-01 | -1.36 | 0.95 | Prius (2nd Gen) |

... | ... | ... | ... | ... |

150 | -2.98e-01 | -0.07 | 0.75 | C-Max Energi Plug-in |

151 | -2.90e-02 | -0.07 | 0.75 | Fusion Energi Plug-in |

152 | -8.17e-03 | -0.29 | 0.20 | Chevrolet Volt |

153 rows × 4 columns

`'acceleration'`

and `'price'`

¶In [9]:

```
hybrid_su.plot(kind='scatter', x='acceleration (su)', y='price (su)', figsize=(10, 5));
```

Which cars have `'acceleration'`

s and `'price'`

s that are more than 2 SDs above average?

In [10]:

```
hybrid_su[(hybrid_su.get('acceleration (su)') > 2) &
(hybrid_su.get('price (su)') > 2)]
```

Out[10]:

price (su) | acceleration (su) | mpg (su) | vehicle | |
---|---|---|---|---|

47 | 2.71 | 2.05 | -1.46 | ActiveHybrid X6 |

60 | 3.04 | 2.88 | -1.16 | ActiveHybrid 7 |

95 | 2.96 | 2.12 | -1.35 | ActiveHybrid 7i |

146 | 2.11 | 2.12 | -0.90 | ActiveHybrid 7L |

147 | 2.66 | 2.24 | -0.90 | Panamera S |

`'mpg'`

and `'price'`

¶In [11]:

```
hybrid_su.plot(kind='scatter', x='mpg (su)', y='price (su)', figsize=(10, 5));
```

Which cars have close to average `'mpg'`

s and close to average `'price'`

s?

In [12]:

```
hybrid_su[(hybrid_su.get('mpg (su)') <= 0.3) &
(hybrid_su.get('mpg (su)') >= -0.3) &
(hybrid_su.get('price (su)') <= 0.3) &
(hybrid_su.get('price (su)') >= -0.3)]
```

Out[12]:

price (su) | acceleration (su) | mpg (su) | vehicle | |
---|---|---|---|---|

10 | -1.24e-01 | -0.56 | -0.26 | Escape |

22 | -2.13e-01 | -1.02 | -0.17 | Mercury Mariner |

57 | -8.47e-02 | 0.72 | -0.11 | Audi Q5 |

... | ... | ... | ... | ... |

70 | -2.14e-01 | -0.07 | 0.02 | HS 250h |

102 | -2.69e-03 | -0.29 | 0.20 | Chevrolet Volt |

152 | -8.17e-03 | -0.29 | 0.20 | Chevrolet Volt |

8 rows × 4 columns

- If two variables are positively associated ↗️,
- their high, positive values in standard units are typically seen together, and
- their low, negative values are typically seen together as well.

- If two variables are negatively associated ↘️,
- high, positive values of one are typically coupled with low, negative values of the other.

- If two variables aren't associated, there should be no such pattern.

When there is a positive association, most data points fall in the lower left and upper right quadrants.

In [13]:

```
hybrid_su.plot(kind='scatter', x='acceleration (su)', y='price (su)', figsize=(10, 5))
plt.axvline(0, color='black');
plt.axhline(0, color='black');
```

When there is a negative association, most data points fall in the upper left and lower right quadrants.

In [14]:

```
hybrid_su.plot(kind='scatter', x='mpg (su)', y='price (su)', figsize=(10, 5))
plt.axvline(0, color='black');
plt.axhline(0, color='black');
```

The correlation coefficient $r$ of two variables $x$ and $y$ is defined as the

**average**value of the**product**of $x$ and $y$- when both are measured in
**standard units**.

If `x`

and `y`

are two Series or arrays,

```
r = (x_su * y_su).mean()
```

where `x_su`

and `y_su`

are `x`

and `y`

converted to standard units.

Let's calculate $r$ for `'acceleration'`

and `'price'`

.

In [15]:

```
hybrid_su
```

Out[15]:

price (su) | acceleration (su) | mpg (su) | vehicle | |
---|---|---|---|---|

0 | -6.94e-01 | -1.54 | 0.59 | Prius (1st Gen) |

1 | -1.86e-01 | -1.28 | 1.76 | Tino |

2 | -5.85e-01 | -1.36 | 0.95 | Prius (2nd Gen) |

... | ... | ... | ... | ... |

150 | -2.98e-01 | -0.07 | 0.75 | C-Max Energi Plug-in |

151 | -2.90e-02 | -0.07 | 0.75 | Fusion Energi Plug-in |

152 | -8.17e-03 | -0.29 | 0.20 | Chevrolet Volt |

153 rows × 4 columns

In [16]:

```
r_acc_price = (hybrid_su.get('acceleration (su)') * hybrid_su.get('price (su)')).mean()
r_acc_price
```

Out[16]:

0.6955778996913978

In [17]:

```
hybrid_su.plot(kind='scatter', x='acceleration (su)', y='price (su)', figsize=(10, 5))
plt.axvline(0, color='black');
plt.axhline(0, color='black');
```

Note that the correlation is positive, and most data points fall in the lower left and upper right quadrants!

Let's now calculate $r$ for `'mpg'`

and `'price'`

.

In [18]:

```
hybrid_su
```

Out[18]:

price (su) | acceleration (su) | mpg (su) | vehicle | |
---|---|---|---|---|

0 | -6.94e-01 | -1.54 | 0.59 | Prius (1st Gen) |

1 | -1.86e-01 | -1.28 | 1.76 | Tino |

2 | -5.85e-01 | -1.36 | 0.95 | Prius (2nd Gen) |

... | ... | ... | ... | ... |

150 | -2.98e-01 | -0.07 | 0.75 | C-Max Energi Plug-in |

151 | -2.90e-02 | -0.07 | 0.75 | Fusion Energi Plug-in |

152 | -8.17e-03 | -0.29 | 0.20 | Chevrolet Volt |

153 rows × 4 columns

In [19]:

```
r_mpg_price = (hybrid_su.get('mpg (su)') * hybrid_su.get('price (su)')).mean()
r_mpg_price
```

Out[19]:

-0.5318263633683786

In [20]:

```
hybrid_su.plot(kind='scatter', x='mpg (su)', y='price (su)', figsize=(10, 5));
plt.axvline(0, color='black');
plt.axhline(0, color='black');
```

Note that the correlation is negative, and most data points fall in the upper left and lower right quadrants!

- $r$ measures how clustered points are around a straight line –
**it measures linear association**.- If two variables are correlated, it means they are linearly associated.

- $r$ is always between $-1$ and $1$.
- If $r = 1$, the scatter plot is a line of slope 1.
- If $r = -1$, the scatter plot is a line of slope -1.
- If $r = 0$, there is no linear association (
*uncorrelated*).

In [21]:

```
show_scatter_grid()
```