Math Home
Machine Learning

Linear regression is used to find linear relationships in data. In a set of data, it is the line (or plane) of best fit. You have probably seen linear regression in a scientific or financial graph where there is a set of plotted points and a line through the points.

To use linear regression, you need data with one or more features that are being used to predict a \(1\)-dimensional outcome. For example, you could use location (2 coordinates), square feet and number of bathrooms to predict the price of a house. You will need at least one data point for each feature.

Click in the box to add points, then click the Run! button below to draw a linear regression line.

In this example, the feature is \(x\) and the outcome is \(y.\)


Draw the line to see the r score.



You will need at least 2 points before the code will run.

Math behind Linear Regression

Suppose you have data with \(n\) features that you use to predict a real valued outcome. Suppose that you also have \(m \geq n\) data points. Data point \(d_i\) will have feature values \((a_{1i}, a_{2i}, \dots, a_{ni})\) and outcome \(b_i.\) To approximate the data points with a linear function, you need values \(x_1, x_2, \dots, x_n\) such that \begin{align} & a_{11}x_1 + a_{12}x_2 + \dots + a_{1n}x_n = b_1 \\ & a_{21}x_1 + a_{22}x_2 + \dots + a_{2n}x_n = b_2 \\ & \vdots \\ & a_{m1}x_1 + a_{m2}x_2 + \dots + a_{mn}x_n = b_m \\ \end{align} Once you have found \(x_1, x_2, \dots, x_n,\) given a data point \(d = (c_1, c_2, \dots, c_n),\) you can predict the outcome is \[c_1x_1 + c_2x_2 + \dots + c_nx_n\] Rewriting the formula for the known data in matrix notation, we want to solve for \(x_1, x_2, \dots, x_n\) in the formula \[ \begin{bmatrix} a_{11} & a_{12} & \dots & a_{1n} \\ a_{21} & a_{22} & \dots & a_{2n} \\ \vdots & & & \\ a_{m1} & a_{m2} & \dots & a_{mn} \\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix} \] Writing the \((a_{ij})\) matrix as \(A,\) the \((x_i)\) vector as \(\overrightarrow{x},\) and the \((b_i)\) vector as \(\overrightarrow{b},\) we can write this as \[A\overrightarrow{x}=\overrightarrow{b}\] Since \(m \geq n\) there will be \(0\) or \(1\) solutions. Most likely there will be \(0\) solutions if \(m > n.\)

In the case that there is \(1\) solution \(\overrightarrow{x},\) that is the solution you should use to estimate outcomes.

In the case that there are \(0\) solutions, the linear regression algorithm says you should use the vector \(\hat{x}\) which minimizes the square error. In this case, given a solution \(\overrightarrow{x},\) the error term for the \(i\)th data point is \(e_i(\overrightarrow{x})\) defined by \[e_i(\overrightarrow{x}) = (a_{i1}x_1 + a_{i2}x_2 + \dots + a_{in}x_n) - b_i\] and the squared error of the vector \(\overline{x} = (x_1, x_2, \dots, x_n)\) is the sum of the squared errors for each data point: \[e_1(\overrightarrow{x})^2 + e_2(\overrightarrow{x})^2 + \dots + e_m(\overrightarrow{x})^2\] Since the squared error is positive quadratic there is exactly one minimum, \(\hat{x}.\) It is known from linear algebra that this minimum is \[\hat{x} = (A^\intercal A)^{-1} A^\intercal \overrightarrow{b} \]

2D Case

In the case that you have one input and one output, you are trying to find the line of best fit. You need \(m > 1,\) or \(m \geq 2\) data points. Data point \(d_i\) will have feature value \((a_i\) and output \(b_i.\)

The slope-intercept form of the equation of a line is \[y = mx + b\] To match the notation used above, we will replace \(m\) by \(x_1\) and \(b\) by \(x_2.\) So, to find an outcome that best fits the data, we need to find a slope \(x_1\) and a \(y\)-intercept \(x_2.\) To incorporate the constant into the equation, we can add a constant feature that has no effect on the outcome. This feature will always have input \(1.\) So, data point \(d_i\) will have features \((a_i, 1)\) and outputs \(b_i.\) (You can use this trick to incorporate a constant in any number of dimensions.\)

We want to find values \(x_1\) and \(x_2\) such that \begin{align} & x_1a_1 + x_2 = b_1 \\ & x_1a_2 + x_2 = b_2 \\ & \vdots \\ & x_1a_m + x_2 = b_m \\ \end{align} Written in matrix form, this is \[ \begin{bmatrix} a_1 & 1 \\ a_2 & 1 \\ \vdots & \\ a_m & 1 \\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix} \] The least squares solution is \(\hat{x} = (A^\intercal A)^{-1} A^\intercal \overrightarrow{b}.\) In this case, \[A^\intercal A = \begin{bmatrix} a_1 & a_2 & \dots & a_m \\ 1 & 1 & \dots & 1\\ \end{bmatrix} \begin{bmatrix} a_1 & 1 \\ a_2 & 1 \\ \vdots & \\ a_m & 1 \\ \end{bmatrix} = \begin{bmatrix} a_1^2+a_2^2+\dots+a_m^2 & a_1+a_2+\dots+a_m \\ a_1+a_2+\dots+a_m & m \\ \end{bmatrix} \] We shall multiply and divide by \(m\) so that we can write the notation more concisely. Namely, the mean of the squares of the data is \[\overline{a^2} = \frac{a_1^2+a_2^2+\dots+a_m^2}{m}\] and the mean of the data is \[\overline{a} = \frac{a_1+a_2+\dots+a_m}{m}\] Using this notation, we have \[A^\intercal A = m\begin{bmatrix} \overline{a^2} & \overline{a} \\ \overline{a} & 1 \\ \end{bmatrix}\] Therefore the inverse is \[(A^\intercal A)^{-1} = \frac{1}{m(\overline{a^2}-\overline{a}^2)} \begin{bmatrix} 1 & -\overline{a} \\ -\overline{a} & \overline{a^2} \\ \end{bmatrix} \] Let \(\sigma_a^2\) be the variance in the data. So, \[\sigma_a^2 = \overline{a^2}-\overline{a}^2\] Then \[(A^\intercal A)^{-1} = \frac{1}{m\sigma_a^2} \begin{bmatrix} 1 & -\overline{a} \\ -\overline{a} & \overline{a^2} \\ \end{bmatrix} \] The rest of the expression is \[A^\intercal \overrightarrow{b} = \begin{bmatrix} a_1 & a_2 & \dots & a_m \\ 1 & 1 & \dots & 1\\ \end{bmatrix} \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix} = \begin{bmatrix} a_1 b_1 + a_2 b_2 + \dots a_m b_m \\ b_1 + b_2 + \dots b_m \\ \end{bmatrix} = m \begin{bmatrix} \overline{ab} \\ \overline{b} \\ \end{bmatrix} \] Therefore, \begin{align} \hat{x} & = (A^\intercal A)^{-1} A^\intercal \overrightarrow{b} \\ & = \frac{1}{m\sigma_a^2} \begin{bmatrix} 1 & -\overline{a} \\ -\overline{a} & \overline{a^2} \\ \end{bmatrix} \cdot m \begin{bmatrix} \overline{ab} \\ \overline{b} \\ \end{bmatrix} \\ & = \frac{1}{\sigma_a^2} \begin{bmatrix} \overline{ab}-\overline{a}\cdot\overline{b} \\ -\overline{a}\cdot\overline{ab} + \overline{a^2}\cdot\overline{b} \\ \end{bmatrix} \\ & = \frac{1}{\sigma_a^2} \begin{bmatrix} \text{Cov}(a,b) \\ -\overline{a}(\text{Cov}(a,b)+\overline{a}\cdot\overline{b}) + (\sigma_a^2 + \overline{a}^2)\cdot\overline{b} \\ \end{bmatrix} \\ & = \begin{bmatrix} \text{Cov}(a,b)/\sigma_a^2 \\ \overline{b} - \overline{a}\text{Cov}(a,b)/\sigma_a^2 \\ \end{bmatrix} \\ \end{align} This solves for the slope and \(y\)-intercept. \begin{align} & \text{Slope: } x_1 = \text{Cov}(a,b)/\sigma_a^2 & \text{y-intercept: } x_2 = \overline{b} - \overline{a}\text{Cov}(a,b)/\sigma_a^2 \end{align}

Pearson's Correlation Coefficient - \(r\) Score

One way to measure how well the line fits the data is with Pearson's Correlation Coefficient, which is represented with an \(r.\) The value \(r\) measures the linear correlation between the \(x\) and \(y\) values, and can be computed as \[r = \frac{n\sum x_i y_i - \overline{x}\overline{y}}{\sqrt{\left(n\sum x_i^2 - \overline{x}^2\right)\left(n\sum y_i^2 - \overline{y}^2\right)}}\]