Lecture 24 – Correlation

DSC 10, Fall 2022

Announcements

Agenda

Recap: Statistical inference

Four big ideas in statistical inference

Every statistical test and simulation we've run in the second half of the class is related to one of the following four ideas. To solidify your understanding of what we've done, it's a good idea to review past lectures and assignments and see how what we did in each section relates to one of these four ideas.

Recent events

Questions to think about:

Association

Prediction

Association

Example: Hybrid cars 🚗

'acceleration' and 'price'

Is there an association between these two variables? If so, what kind?

'mpg' and 'price'

Is there an association between these two variables? If so, what kind?

Observations:

Linear changes in units

Converting columns to standard units

Standard units for hybrid cars

For a given pair of variables:

'acceleration' and 'price'

Which cars have 'acceleration's and 'price's that are more than 2 SDs above average?

'mpg' and 'price'

Which cars have close to average 'mpg's and close to average 'price's?

Observation on associations in standard units

When there is a positive association, most data points fall in the lower left and upper right quadrants.

When there is a negative association, most data points fall in the upper left and lower right quadrants.

Correlation

Definition: Correlation coefficient

The correlation coefficient $r$ of two variables $x$ and $y$ is defined as the

If x and y are two Series or arrays,

r = (x_su * y_su).mean()

where x_su and y_su are x and y converted to standard units.

Let's calculate $r$ for 'acceleration' and 'price'.

Note that the correlation is positive, and most data points fall in the lower left and upper right quadrants!

Let's now calculate $r$ for 'mpg' and 'price'.

Note that the correlation is negative, and most data points fall in the upper left and lower right quadrants!

The correlation coefficient, $r$