Lecture 25 – Regression and Least Squares¶

DSC 10, Fall 2022¶

Announcements¶

• Lab 8 is due on Saturday 11/26 at 11:59pm.
• The Final Project is due on Tuesday 11/29 at 11:59pm.
• No class or office hours on Thursday or Friday. Happy Thanksgiving! 🦃
• There are several study sessions/group office hours in Week 10, which should be helpful as you complete the final project and study for the final exam. These are marked in green on the calendar.
• Monday 11/28 from 12-2pm in PCNYH 122.
• Tuesday 11/29 from 7-9pm in SDSC Auditorium (with no heat 🥶; dress warmly 🧣).
• Wednesday 11/30 from 3-7pm in SDSC Auditorium (with no heat 🥶; dress warmly 🧣).
• Friday 12/2 from 5-9pm in WLH 2205.
• Lecture section C00 is not meeting today or Monday 11/28 – Suraj is in India 🇮🇳.

Agenda¶

• The regression line, in standard units.
• The regression line, in original units.
• Outliers.
• Errors in prediction.

The regression line, in standard units¶

Example: Predicting heights 👪 📏¶

Recall, in the last lecture, we aimed to use a mother's height to predict her adult son's height.

Correlation¶

Recall, the correlation coefficient $r$ of two variables $x$ and $y$ is defined as the

• average value of the
• product of $x$ and $y$
• when both are measured in standard units.

The regression line¶

• The regression line is the line through $(0,0)$ with slope $r$, when both variables are measured in standard units.
• We use the regression line to make predictions!

Making predictions in standard units¶

• If $r = 0.32$, and the given $x$ is $2$ in standard units, then the prediction for $y$ is $0.64$ standard units.
• The regression line predicts that a mother whose height is $2$ SDs above average has a son whose height is $0.64$ SDs above average.
• We always predict that a son will be somewhat closer to average in height than his mother.
• This is a consequence of the slope $r$ having magnitude less than 1.
• This effect is called regression to the mean.
• The regression line passes through the origin $(0, 0)$ in standard units. This means that, no matter what $r$ is, for an average $x$ value, we predict an average $y$ value.

Making predictions in original units¶

Of course, we'd like to be able to predict a son's height in inches, not just in standard units. Given a mother's height in inches, here's how we'll predict her son's height in inches:

1. Convert the mother's height from inches to standard units.
$$x_{i \: \text{(su)}} = \frac{x_i - \text{mean of x}}{\text{SD of x}}$$
1. Multiply by the correlation coefficient to predict the son's height in standard units.
$$\text{predicted } y_{i \: \text{(su)}} = r \cdot x_{i \: \text{(su)}}$$
1. Convert the son's predicted height from standard units back to inches.
$$\text{predicted } y_i = \text{predicted } y_{i \: \text{(su)}} \cdot \text{SD of y} + \text{mean of y}$$

Concept Check ✅ – Answer at cc.dsc10.com¶

A course has a midterm (mean 80, standard deviation 15) and a really hard final (mean 50, standard deviation 12).

If the scatter plot comparing midterm & final scores for students looks linearly associated with correlation 0.75, then what is the predicted final exam score for a student who received a 90 on the midterm?

• A. 54
• B. 56
• C. 58
• D. 60
• E. 62

The regression line, in original units¶

Reflection¶

Each time we wanted to predict the height of an adult son given the height of a mother, we had to:

1. Convert the mother's height from inches to standard units.
1. Multiply by the correlation coefficient to predict the son's height in standard units.
1. Convert the son's predicted height from standard units back to inches.

This is inconvenient – wouldn't it be great if we could express the regression line itself in inches?

From standard units to original units¶

When $x$ and $y$ are in standard units, the regression line is given by

What is the regression line when $x$ and $y$ are in their original units?

The regression line in original units¶

• We can work backwards from the relationship $$\text{predicted } y_{\text{(su)}} = r \cdot x_{\text{(su)}}$$ to find the line in original units.
$$\frac{\text{predicted } y - \text{mean of }y}{\text{SD of }y} = r \cdot \frac{x - \text{mean of } x}{\text{SD of }x}$$
• Note that $r, \text{mean of } x$, $\text{mean of } y$, $\text{SD of } x$, and $\text{SD of } y$ are constants – if you have a DataFrame with two columns, you can determine all 5 values.
• Re-arranging the above equation into the form $\text{predicted } y = mx + b$ yields the formulas:
$$m = r \cdot \frac{\text{SD of } y}{\text{SD of }x}, \: \: b = \text{mean of } y - m \cdot \text{mean of } x$$
• $m$ is the slope of the regression line and $b$ is the intercept.

Let's implement these formulas in code and try them out.

Below, we compute the slope and intercept of the regression line between mothers' heights and sons' heights (in inches).

So, the regression line is

$$\text{predicted son's height} = 0.365 \cdot \text{mother's height} + 45.858$$

Making predictions¶

What's the predicted height of a son whose mother is 62 inches tall?

What if the mother is 55 inches tall? 73 inches tall?

Outliers¶

The effect of outliers on correlation¶

Consider the dataset below. What is the correlation between $x$ and $y$?

Removing the outlier¶

Takeaway: Even a single outlier can have a massive impact on the correlation, and hence the regression line. Look for these before performing regression. Always visualize first!

Errors in prediction¶

Motivation¶

• We've presented the regression line in standard units as the line through the origin with slope $r$, given by $\text{predicted } y_{\text{(su)}} = r \cdot x_{\text{(su)}}$. Then, we used this equation to find a formula for the regression line in original units.
• In examples we've seen so far, the regression line seems to fit our data pretty well.
• But how well?
• What makes the regression line good?
• Would another line be better?

Example: Without the outlier¶

We think our regression line is pretty good because most data points are pretty close to the regression line. The red lines are quite short.

Measuring the error in prediction¶

$$\text{error} = \text{actual value} - \text{prediction}$$
• Typically, some errors are positive and some negative.
• What does a positive error mean? What about a negative error?
• To measure the rough size of the errors, for a particular set of predictions:
1. Square the errors so that they don't cancel each other out.
2. Take the mean of the squared errors.
3. Take the square root to fix the units.
• This is called root mean square error (RMSE).
• Notice the similarities to computing the SD!

Root mean squared error (RMSE) of the regression line's predictions¶

The RMSE of the regression line's predictions is about 2.2. Is this big or small, relative to the predictions of other lines? 🤔

Root mean squared error (RMSE) in an arbirtrary line's predictions¶

• We've been using the regression line to make predictions. But we could use a different line!
• To make a prediction for x using an arbitrary line defined by slope and intercept, compute x * slope + intercept.
• For this dataset, if we choose a different line, we will end up with different predictions, and hence a different RMSE.

Let's compute the RMSEs of several different lines on the same dataset.