From basic concepts to not so basic concepts
Linear regression is the first step to understanding machine learning. We start our exploration into linear regression by considering simplest possible situation — linear functions.
A linear function
New York University (NYU), like most schools, charges its part-time students by credit-hour. Suppose we have three part-time students at NYU, Isaac, Octavia, and Margaret.
- Isaac is taking 5 credit hours this semester and paying $7,060.
- Octavia is taking 12 credit hours and paying $16,944.
- Margaret is taking 3 credit hours and paying $4,236.
Let’s graph the three students’ credit-hours verses their costs.
These points are on the same line. The equation of a line with slope m and y-intercept b is y = mx + b.
We can calculate the line’s slope by considering any two points. For example take Octavia and Isaac. We divide their change in y (change in cost) by their change in x (change in credit-hours) to get the slope of the line. So the slope is (16944–7060)/(12–5) = 1412 dollars per credit-hour. The slope corresponds to the cost for each credit hour.
The y-intercept of this line is 0 since, as we would expect, a student taking 0 credit hours should pay $0.
Now that we have an equation we can make predictions. Suppose we want to know the cost of taking 8 credit-hours. y = 1412(8) = $11,296. A linear function leads to perfectly accurate predictions. If our data is not related by a linear function we can still make predictions but accuracy will vary.
A statistical relationship
Suppose, instead of tuition costs, we consider textbook costs. Unlike tuition, textbook costs are not defined by a mathematical formula based on credit hours. Textbook costs depend on the classes the individual students are taking.
- Isaac is taking 5 credit hours this semester and payed $500 dollars in textbook.
- Octavia is taking 12 credit hours and payed $700.
- Margaret is taking 3 credit hours and paying $250.
As we can see from the graph below, the points do not lie on the same line.
If we want to make a prediction about textbook costs, what line should we use? The line connecting Isaac and Octavia over-estimates costs for students with low credit hours. Margaret and Octavia’s line under-estimates costs for students with credit-hours that fall in the middle. Linear regression is a strategy for choosing a line that best fits the data. Consider the following two lines.
Which line, blue or red, is a better predictor of textbook cost given credit-hours? Which line better fits our data? For each point, we will consider its vertical distance to a line, its error. For several reasons it is more convenient to consider the square of each error rather than the error itself.
Sum of the squares of the errors as a measure of how well our line fits the data.
We see the first and last points (Margaret and Octavia) are both exactly on the line. However, the middle point (Isaac) is above the line by $150. So taking the sum of the squares, 0 + 150² + 0 = 22500.
The blue line is similarly situated with two points exactly on the line and one point away from the line. But this time the distance between the point and the line is greater, it’s $192.90. The sum of each error squared is 0 + 192.9²+ 0 = 37210.41. Since the sum of the squared errors is less for the red line, we conclude that the red line better fits our data. Consider the following line.
For this line, the total distances between the points and the line are: 70.5, 90.7, and 20.1. The sum of the squared errors is 70.5²+90.7² + 20.1²= 13600.75. This number is less than the corresponding value for both the red and blue lines. Therefore the black line is a better fit for our data. In fact, out of all possible lines the black line has minimal sum of the squared errors. Such a line is called the least squared regression line.
Before we learn how to find the least squared regression line, let’s review the key concepts.
- The error of a point with respect to a line is the vertical distance between a point and a line.
- The least squared regression line is a line that that has minimal sum of squared errors.
Calculating the least squared regression line
Given two columns of data, our goal is to find the least squared regression line.
Our data: Old faithful (pictured at the top of this article) is a geyser located in Yellowstone National Park, WY. It erupts on a surprisingly regular schedule. We will the Old Faithful dataset as an example throughout the rest of this article. It is a popular two column dataset that comes with the R programming language. Here are a few rows of the dataset.
Each row corresponds to an eruption of the Old Faithful Geyser. The first number in each row is the length of the eruption in minutes. The second number is the waiting time until the next eruption. There are a total of 272 rows. Our goal is to predict the waiting time for the next eruption based on the eruption length. First we plot each row of the data.
There are a lot of interesting features in this dataset. In particular notice that eruptions seem to fall into two categories long ones with long waiting times and short ones with short waiting times. But for now let’s ignore these two clusters and concentrate on finding the least squared regression line.
Each of our candidate lines will have a number corresponding to its sum of square errors (SSE). Since there are 272 points, our sum will have 272 terms.
- For each row, i, let xᵢ be the eruption length and yᵢ the waiting time.
- Suppose our candidate line is ŷ = mx + b where m is its slope and b its y intercept.
For example, consider the first point in the data set, x₁=3.6 and y₁ = 79. We have ŷ(x₁) =m x₁ + b = m(3.6) + b. We calculate the square error of this point as (y₁ - ŷ(x₁))² = (y₁ - (m x₁ + b ))² = (79 - m (3.6) - b )².
We add up the square errors for each point to obtain the sum of square errors (SSE).
The SSE depends only on the slope and y-intercept of the line.
Our goal is to find a line (m and b) that has minimal SSE.
There are an infinite number of possible lines but we don’t need to calculate the SSE for every line. To see how we can do this, consider all the lines with y-intercept b = 20. Here are 11 of them.
Supposing our line has a y-intercept (b) of 20, SSE simplifies to
Notice that now SSS depends only on the slope of the line and it is a parabola that opens up. Then we can easily find the slope that gives the minimum SSE by finding the vertex of the parabola.
So among all lines with a y-intercept of 20, the line with slope 14.22 has the lowest SSE. By considering all lines with a given y-intercept you should now have an idea of how it is possible to find a line that minimizes SSE.
Using calculus, we can solve this problem in general. We can always find the slope and y-intercept of the line of best fit. The symbol x̄ is the mean of the x values. That is, it is the sum of all the values divided by the number of values. The line of best fit, y = mx + b is
Example: Let’s find the line of best fit for our Old Faithful data. Our x values are the eruption lengths. The y values are the waiting times. Of course, with 272 terms this calculation must be done on a computer.
So we have m =10.73 and b= 33.47.
Even though we typically use a computer or a calculator to find the least squared regression line it is important to know how to perform the calculation. There are some interesting facts we can derive from the equations of m and b. For example, we can show that the point (x̄, y̅) is on the least squared regression line. To see this, note that x̄m+b = x̄m+ y̅ - mx̄ = y̅.
Example: Use the least squared regression line to, ŷ = 10.73x + 33.47, to predict the waiting time for the next eruption if the current eruption lasts 4 minutes. 10.73(4) + 33.47 = 76.39 minutes.
We know how to find the least squared regression line for two columns of data. And we can use this line to make predictions. But how reliable are our predictions?
The correlation coefficient (R) gives us a way to measure the reliability of our predictions.
Let’s start with the definition of standard deviation (S) and variance (σ).
The variable n is the size of our data set. So for our Old Faithful data n=272. Both standard deviation and variance describe the amount that the variable x differs from its mean. How reliable are our least squared regression predictions? To answer this question we should be concerned with the following question.
For our data, how well do variations in the variable x correspond to variations in the variable y? That is, how do the variables co-vary.
We define covariance in the following way.
Covariance is a good start towards our goal to measure the reliability of our predictions. But it has one problem. Covariance depends on the units of x and y. If we measure x in hours instead of minutes the covariance will decrease. Dividing by the variance of each variable takes care of this problem. The Pearson Correlation Coefficient, R, is defined as follows.
We can also express R in terms of the slope of the least squared regression line.
Here are some facts about R. The first fact is challenging to prove.
- -1 ≤ R ≤ 1 (This fact is a bit challenging to prove)
- The correlation is positive if R > 0. It is negative if R < 0.
- If R is close to 1 or -1 then the correlation is strong.
- If R is close to 0 then the correlation is weaker.
The stronger the correlation between two variables, the more reliable prediction we get from the least squares regression line.
Example: Our Old Faithful data set has R = 0.901. This is a strong correlation.
We learned about the sum of the square errors (SSE) between points and a line. The least squared regression line is a line that minimizes SSE. We used this line to make predictions. The accuracy of these predictions can be measured by the correlation coefficient (R). We conclude with some example of least squared regression lines and correlations.