In order to understand multiple linear regression, it’s useful to first examine a simpler case. Let’s imagine we have n data points, with a response y to be modelled by 1 independent variable, \(x_1\). Let’s further assume that y is well modelled by a straight line.
There is some error in our response so the data points are not perfectly modelled by the straight line. For any individual point, we can say
\[y_i=(mx_i+b)+\epsilon_i\]
Where m is the slope of our line of fit and b is the intercept. \(mx_i+b\) gives the location of our model fit and \(\epsilon_i\) gives the error.
One of the assumptions of regression is that our error is normally distributed with mean 0 and constant variance.
Now in reality y, \(x_1\), and \(\epsilon\) are vectors with n values, we can express them as such
\[ \begin{matrix} [ & y_1 & ] & & & [&x_{11}&]& & &[& 1 &]& &[&\epsilon_1&] \\ | & y_2 & | & & & |&x_{12}&|& & &|& 1 &|& &|&\epsilon_2&| \\ | & ... & | &=&m& |& ... &|&+&b&|&...&|&+&|&...&| \\ [ & y_n & ] & & & [&x_{1n}&]& & &[& 1 &]& &[&\epsilon_n&] \\ \end{matrix} \]
Notice that we multiplied b by a vector of 1’s. It’s still true that \(y_i=mx_i+b+\epsilon_i\) for all i.
Now bear with me as I change some variables and move things around
\[ \begin{matrix} [ & y_1 & ] & & & [&1&]& & &[& x_{11} &]& &[&\epsilon_1&] \\ | & y_2 & |& & & |&1&|& & &|& x_{12} &|& &|&\epsilon_2&| \\ | & ... & | &=&\beta_0& |& ... &|&+&\beta_1&|&...&|&+&|&...&| \\ [ & y_n & ] & & & [&1&]& & &[& x_{1n} &]& &[&\epsilon_n&] \\ \end{matrix} \]
All we’ve done is changed b to \(\beta_0\), changed m to \(\beta_1\), and switched their order. This form can be easily generalized to the case with multiple independent variables predicting the response y.
Lets look at a situation where we have 2 independent variables, \(x_1\) and \(x_2\) predicting the response, y
Here our best fit for any point (x1,x2) is the plane given in purple, and our equation for that plane is
\[ \begin{matrix} [ & y_1 & ] & & & [&1&]& & &[& x_{11} &]& & &[&x_{21}&] & &[&\epsilon_1&] \\ | & y_2 & |& & & |&1&|& & &|& x_{12} &|& & & |&x_{22}&|& &|&\epsilon_2&| \\ | & ... & | &=&\beta_0& |& ... &|&+&\beta_1&|&...&|&+& \beta_2&|&...&|&+ &|&...&| \\ [ & y_n & ] & & & [&1&]& & &[& x_{1n} &] & & & [&x_{2n}&]& &[&\epsilon_n&] \\ \end{matrix} \]
We could see how we could generalize this for data with k independent variables
\[ \begin{matrix} [ & y_1 & ] & & & [&1&]& & &[& x_{11} &]& & &[&x_{21}&] & & & & &[&x_{k1}&] & &[&\epsilon_1&] \\ | & y_2 & |& & & |&1&|& & &|& x_{12} &|& & & |&x_{22}&| & & & & &|&x_{k2}&|& &|&\epsilon_2&| \\ | & ... & | &=&\beta_0& |& ... &|&+&\beta_1&|&...&|&+& \beta_2&|&...&|&+&...&+&\beta_k&|&...&|&+ &|&...&| \\ [ & y_n & ] & & & [&1&]& & &[& x_{1n} &] & & & [&x_{2n}&] & & & & & [&x_{kn}&]& &[&\epsilon_n&] \\ \end{matrix} \]
In trying to fit the data, the only thing we can manipulate is our \(\beta\) parameters. That is to say, in modelling our response, we can only take linear combinations of our independent variables.
We can represent this linear combination in terms of matrix multiplication.
\[Y=X\beta+\epsilon\]
Where X, known as the design matrix, has shape nx(k+1)
\[ \begin{matrix} & &[&1 &x_{11}&x_{12}&...&x_{1k}&] \\ & &|&1 &x_{21}&x_{22}&...&x_{2k}&| \\ &X=&|&...&... &... &...&... &| \\ & &[&1 &x_{n1}&x_{n2}&...&x_{nk}&] \\ \end{matrix} \]
\(\beta\) is an (k+1)x1 vector
\[ \begin{matrix} & &[&\beta_0&] \\ & &|&\beta_1&| \\ & \beta=&|&... &| \\ & &[&\beta_k&] \\ \end{matrix} \]
We want to find a vector \(\hat{\beta}\) that gives us the best approximation of the data, given the limitation that it must be a linear combination of the independent variables (or a matrix multiplication of the design matrix).
\[\hat{Y}=X\hat{\beta}\]
Let’s think about the column space of our design matrix. The column space of X, denoted col(X), is the set of all possible linear combinations of the column vectors of X. We can immediately see that both \(\beta\), and \(\hat{\beta}\) are in col(X). Notice that this is not the same as our plane of best fit for the case where k=2. In fact, the column space of X is actually an n-dimensional hyperplane.
To explain this, let’s imagine the case where n=3. In this case, Y must have shape 3x1. Let’s say we have one independent variable.
## Y
## 4
## 9
## 20
## X
## 1 2
## 1 4
## 1 6
Let’s take some random linear combinations of X and plot them in 3 dimensions. Notice that any linear combination of X in this case will have shape 3x1. So any individual linear combination of X will give you a vector in the plane col(X)
Next we’ll plot our vector Y along with col(X). But first let’s think about where Y will live.
\[Y=X\beta+\epsilon\]
Because of the error term Y in general is not a linear combination of the column vectors of X. Therefore we would expect Y to be outside of col(X)
So our vector \(\hat{\beta}\) is the best approximation of Y given the restriction that it must be in the col(X). This is the same as saying \(\hat{\beta}\) is a projection of Y onto col(X)
In general
\[Y=X\beta+\epsilon\]
Since \(\hat{\beta}\) is the vector that best approximates Y, that means its corresponding \(\epsilon\) vector will have the shortest length. I.e. it will be perpendicular to the column space of X, while also maintaining the above equation.
\[Y=X\hat{\beta}+\hat{\epsilon}\]
Now that we understand the column space of the design matrix, we are ready to derive an expression for \(\hat{\beta}\)
We can actually get \(\hat{\beta}\) with simple matrix algebra.
\[Y=X\hat{\beta}+\hat{\epsilon}\]
\[Y-X\hat{\beta}=\hat{\epsilon}\]
\[X'(Y-X\hat{\beta})=X'\hat{\epsilon}\]
Where \('\) denotes the transpose.
But, because we know that \(\hat{\epsilon}\) is perpendicular to the column space of X, we know that \(X'\hat{\epsilon}=0\)
\[X'(Y-X\hat{\beta})=0\]
\[X'Y-X'X\hat{\beta}=0\]
\[X'Y=X'X\hat{\beta}\]
\[(X'X)^{-1}X'Y=(X'X)^{-1}X'X\hat{\beta}\]
\[(X'X)^{-1}X'Y=I_{k+1}\hat{\beta}\]
\[(X'X)^{-1}X'Y=\hat{\beta}\]
Now we have \(\hat{\beta}\) in terms of observables only.
We say that \(\hat{Y}\) is a projection of Y onto the column space of X. We also know that \(\hat{Y}=X\hat{\beta}=X(X'X)^{-1}X'Y\). In general, left multiplying \(X(X'X)^{-1}X'\) will give you the projection onto col(X). We call this operation \(H\), or the hat matrix.
\[\hat{Y}=HY\]
\[H=X(X'X)^{-1}X'\]
Because it will be useful later, let’s prove that H is symmetric and idempotent
We need to show \(H'=H\)
\[H'=[X(X'X)^{-1}X']'=X(X(X'X)^{-1})'=X((X'X)^{-1})'X'\]
Another identity is that \((A^{-1})'=(A')^{-1}\)
\[H'=X((X'X)')^{-1}X'=X(X'X)^{-1}X'=H\]
We need to show \(H^2=H\), using \(A^{-1}A=I\)
\[H^2=HH=X(X'X)^{-1}X'X(X'X)^{-1}X'=X(X'X)^{-1}X'=H\]
Next we’ll find an expression for \(s^2\), an estimate of the variance.
Finding an estimate for the variance is a bit more complicated than finding \(\hat{\beta}\). We’ll need to start with the residual sum of squares (which we’ll denote SSE)
\[SSE=\sum_{i=1}^nr_i^2\]
Where \(r_i\) is the data minus the model at any given point.
We can see intuitively how this statistic could be useful, as it tells us how well the model approximates the data. The residuals are squared to account for the fact that the model could either under- or over-approximate the data at a given point.
Returning to our diagram below
We can see how \(\hat{\epsilon}\) is related to the residual. By \(\hat{\epsilon}=Y-X\hat{\beta}\) we can see that \(\hat{\epsilon}\) is just the vector form of the residuals. So we might think the following equation makes sense
\[SSE=\hat{\epsilon}^2\]
But that doesn’t quite make sense as we’d be multiplying 2 vectors of shape nx1, which is impossible. Instead we say
\[SSE=\hat{\epsilon}'\hat{\epsilon}\]
As an aside, let’s determine an easy way of calculating the residual sum of squares.
\[SSE=\hat{\epsilon}'\hat{\epsilon}\]
from our diagram above, \(\hat{\epsilon}=Y-HY\)
\[SSE=(Y-HY)'(Y-HY)\]
Using the identity \((AB)'=B'A'\)
\[SSE=(Y'-Y'H')(Y-HY)\]
\[SSE=Y'Y-Y'HY-Y'H'Y+Y'H'HY\]
Because H is both symmetric and idempotent
\[SSE=Y'Y-Y'HY\]
\[SSE=Y'(I-H)Y\]
Since these are all observables (or products of observables) we can calculate this directly.
Returning now to our attempt to find an estimator of the variance, let’s portray \(\hat{\epsilon}\) in a different way, using our diagram above.
\[\hat{\epsilon}=Y-HY\]
We know, in general, that \(Y=X\beta+\epsilon\)
\[\hat{\epsilon}=(X\beta+\epsilon)-H(X\beta+\epsilon)\]
\[\hat{\epsilon}=X\beta+\epsilon-HX\beta-H\epsilon\]
As an aside really quick we’ll prove that \(HX=X\), using \(A^{-1}A=I\), assuming A is invertible
\[HX=X(X'X)^{-1}X'X=X\]
Therefore
\[\hat{\epsilon}=X\beta+\epsilon-X\beta-H\epsilon\]
\[\hat{\epsilon}=\epsilon-H\epsilon\]
\[\hat{\epsilon}=(I-H)\epsilon\]
Now we can return to our residual sum of squares
\[SSE=\hat{\epsilon}'\hat{\epsilon}\]
\[SSE=[(I-H)\epsilon]'(I-H)\epsilon\]
\[SSE=\epsilon'(I-H)'(I-H)\epsilon\]
To get this in the form we want, we’ll also have to prove that (I-H) is both symmetric and idempotent.
\[(I-H)'=I'-H'=I-H\] Since H is symmetric
\[(I-H)^2=(I-H)(I-H)=I-H-H+H^2=I-H\] Since H is idempotent.
So finally we can express our residual sum of squares as
\[SSE=\epsilon'(I-H)\epsilon\]
To work on this further we’ll need to understand spectral decomposition
If you need a refresher on this or any of the other linear algebra discussed here, honestly 3Blue1Brown’s YouTube series Essence of Linear Algebra is an excellent resource for a refresher. Chapter 14 talks about eigenvectors and eigenvalues. I’ll give a short review below.
For a matrix A, the eigenvectors (\(\vec{e}\)) and eigenvalues (\(\lambda\)) are any values which make the following true
\[A\vec{e}=\lambda\vec{e}\]
You can express any symmetric square matrix, M, as the sum of orthonormal basis vectors
\[M=\sum_{i=1}^n \lambda_i e_ie_i'\]
Where \(\lambda_i\) are eigenvalues and \(e_i\) are eigenvectors of M. Because our basis vectors are orthonormal, we know that \(e_ie_j=0\) for \(i\neq j\) and \(e_ie_i=1\)
Now if M is both symmetric and idempotent (like (I-H)), then all of the eigenvalues will be either 0 or 1, and the number of eigenvalues that are 1 are given by the trace of M.
Let’s apply this to the residual sum of squares
\[SSE=\epsilon'(I-H)\epsilon\]
We know that (I-H) is a symmetric square matrix
\[SSE=\epsilon'(\sum_{i=1}^n\lambda_ie_ie_i')\epsilon\]
If we could calculate the trace of (I-H) we could pick out the eigenvalues that will not be 0 for our sum
\[tr(I-H)=tr(I)-tr(H)\]
We know that the trace of I is n, let’s solve for the trace of H
\[tr(H)=tr(X(X'X)^{-1}X')\]
Using the identity \(tr(AB)=tr(BA)\)
\[tr(H)=tr((X'X)^{-1}X'X)\]
This will be equivalent to an identity matrix, but what is the shape of that matrix? Since X has shape nx(k+1), X’X will have shape (k+1)x(k+1), and \((X'X)^{-1}\) will also have shape (k+1)x(k+1). Therefore our identity matrix will have shape (k+1)x(k+1)
\[tr(H)=tr(I_{k+1})=k+1\]
\[tr(I-H)=n-(k+1)\]
So \(n-(k+1)\) of our eigenvalues will be 1, and the remainder will be zero, let’s reflect that in the sum
\[SSE=\epsilon'(\sum_{i=1}^{n-(k+1)}e_ie_i')\epsilon\]
We can bring our \(\epsilon\)’s in the sum
\[SSE=\sum_{i=1}^{n-(k+1)}\epsilon'e_ie_i'\epsilon\]
We can simplify this using the identity \((AB)'=B'A'\), think of \(\epsilon'e_i\) as \(B'A'\)
\[\epsilon'e_ie_i'\epsilon=(e_i'\epsilon)'(e_i'\epsilon)=(e_i'\epsilon)^2\]
\[SSE=\sum_{i=1}^{n-(k+1)}(e_i'\epsilon)^2\]
Now let’s let \(v_i=e_i'\epsilon\). Remember that \(e_i\) are orthonormal basis vectors and our assumption that \(\epsilon\) is distributed normal with mean 0 and constant variance \(\sigma^2\). This assumption is why we wanted to get our residual sum of squares in terms of \(\epsilon\) in the first place, because now we know that \(v_i\) is distributed normal with mean 0 and variance \(\sigma^2\)!
\[SSE=\sum_{i=1}^{n-(k+1)}v_i^2\]
In order to make \(v_i\) into a Z statistic we divide both sides by \(\sigma^2\)
\[\frac{SSE}{\sigma^2}=\sum_{i=1}^{n-(k+1)}(\frac{v_i}{\sigma})^2\]
\[\frac{SSE}{\sigma^2}=\sum_{i=1}^{n-(k+1)}z_i^2\]
Now the sum from i=1 to n-(k+1) of independent \(z^2\)s has a \(\chi^2\) distribution with n-(k+1) degrees of freedom.
\[\frac{SSE}{\sigma^2}=\chi^2_{n-(k+1)}\]
\[\sigma^2=\frac{SSE}{\chi^2_{n-(k+1)}}\]
To get an estimator for \(\sigma^2\), which we’ll call \(s^2\) we take the expected value
\[E(\sigma^2)=E(\frac{SSE}{\chi^2_{n-(k+1)}})\]
\[s^2=\frac{SSE}{n-(k+1)}\]
There we have it! And since we have a way of calculating SSE in terms of observables we can easily find \(s^2\).