## Wednesday, 4 September 2013

### Linear Regression

The linear regression model is a fairly simple model, nevertheless it is used in many applications.

What follows is a fairly mathematical derivation of the linear regression parameters.

Let us denote the observations as $y[n]$ and $x[n]$ for $n=0,1,..,N-1$ for the model

$y[n] = ax[n]+b$

We assume that we can measure both $x[n]$ and $y[n]$ - the unknowns are $a$ and $b$, the gradient and intercept respectively. We can use the least squares error criterion to find estimates for $a$ and $b$, from data $x[n]$ and $y[n]$.

Let  $e[n] = y[n] - ax[n]-b$ be the error. We can represent the samples $e[n]$, $x[n]$ and $y[n]$ as length $N$ column vectors $\underline{e}$,$\underline{x}$ and $\underline{y}$ respectively.

Thus the error sum of squares (which can be regarded as a cost function $C$ to minimise over) will be

$\underline{e}^T\underline{e}=(\underline{y}-a\underline{x}-b\underline{1})^T(\underline{y}-a\underline{x}-b\underline{1}$).

where $\underline{1}$ is a column vector with all elements set to 1.

Expanding this cost function results in:-

$C=\underline{y}^T\underline{y}-2a\underline{x}^T\underline{y}-2b\underline{y}^T\underline{1}+a^2(\underline{x}^T\underline{x})+2ab(\underline{x}^T\underline{1})+b^2(\underline{1}^T\underline{1})$

Differentiating $C$ w.r.t. to $a$ and $b$:-

$\frac{\partial C}{\partial a}=2a(\underline{x}^T\underline{x})+2b(\underline{x}^T\underline{1})-2\underline{x}^T\underline{y}$

$\frac{\partial C}{\partial b}=2a(\underline{x}^T\underline{1})+2b(\underline{1}^T\underline{1})-2\underline{y}^T\underline{1}$

In order to estimate $\hat{a}$ and $\hat{b}$ (the least squares estimates of $a$ and $a$ respectively) we need to set the above two partial differential equations to 0, and solve for $a$ and $b$. We need to solve the following matrix equation

$\left[\begin{array}{cc}\underline{x}^T\underline{x}&\underline{x}^T\underline{1}\\ \underline{x}^T\underline{1}&\underline{1}^T\underline{1}\end{array}\right]$$\left[\begin{array}{c}\hat{a}\\ \hat{b}\end{array}\right] = \left[\begin{array}{c}\underline{x}^T\underline{y} \\ \underline{y}^T\underline{1} \end{array}\right] Inverting the matrix and pre multiplying both sides of the above matrix equation by the inverse results in: \left[\begin{array}{c}\hat{a} \\ \hat{b} \end{array}\right] = \frac{1}{N(\underline{x}^T\underline{x})-(\underline{x}^T\underline{1})^2}$$\left[\begin{array}{cc}\underline{1}^T\underline{1}&-\underline{x}^T\underline{1}\\ -\underline{x}^T\underline{1}&\underline{x}^T\underline{x}\end{array}\right]$$\left[\begin{array}{c}\underline{x}^T\underline{y} \\ \underline{y}^T\underline{1} \end{array}\right]$

If we revert to a more explicit notation in terms of the samples, we have

$\hat{a}=\frac{1}{N(\sum_{n=0}^{N-1}(x[n])^2)-(\sum_{n=0}^{N-1}x[n])^2}[N(\sum_{n=0}^{N-1}x[n]y[n])-(\sum_{n=0}^{N-1}y[n])(\sum_{n=0}^{N-1}x[n])]$

$\hat{b}=\frac{1}{N(\sum_{n=0}^{N-1}(x[n])^2)-(\sum_{n=0}^{N-1}x[n])^2}[(\sum_{n=0}^{N-1}(x[n])^2)(\sum_{n=0}^{N-1}y[n])-(\sum_{n=0}^{N-1}x[n])(\sum_{n=0}^{N-1}x[n]y[n])]$

where the denominator of the fraction in the above two equations is the matrix determinant.

Coefficient of Determination

Once we have calculated $\hat{a}$ and $\hat{b}$, we can see how well the linear model accounts for the data. This is achieved by finding the Coefficient of Determination ($r^2$). This is the ratio of the residual sum of squares ($rss$) to the total sum of squares ($tss$),

$\Large r^2=\frac{rss}{tss}$

and has a maximum value of 1, which is attained when the model is exactly linear. The larger $r^2$ is, the better the linear model fits the data. We can think of $r^2$ as the proportion of variation in data explained by the linear model, or the "goodness of fit" of the linear model.

The total sum of squares is given by

$tss=\sum_{n=0}^{N-1}(y[n]-\bar{y})^2$

where

$\bar{y}=\frac{1}{N}\sum_{n=0}^{N-1}y[n]$ is the mean of $y$.

The residual sum of squares ($rss$) is the total sum of squares ($tss$) minus the sum of squared errors ($sse$)

where

$sse=\sum_{n=0}^{N-1}(y[n]-\hat{a}x[n]-\hat{b})^2$

If the data perfectly fits the linear model, $sse$ will equal 0, so that $tss$ will equal $rss$ - the coefficient of determination $r^2$ will be $1$.

Summarising, we have

$\Large r^2=\frac{\sum_{n=0}^{N-1}(y[n]-\bar{y})^2-\sum_{n=0}^{N-1}(y[n]-\hat{a}x[n]-\hat{b})^2}{\sum_{n=0}^{N-1}(y[n]-\bar{y})^2}$

A worked example

Now, let us consider the following example:-

$x=[1,2,3,4]$, $y=[1.5,6,7,8]$

The matrix determinant will be $(4\times (1 + 2^2+3^2+4^2))-(1+2+3+4)^2=20$.

The least squares estimate of $a$ is

$\hat{a}=[4\times((1\times 1.5) + (2\times 6) + (3\times 7) + (4\times 8)) \\ -( (1.5+6+7+8)\times(1+2+3+4))]/20 \\ =2.05$

The least squares estimate of $b$ is

$\hat{b}=[(1+2^2+3^2+4^2)\times(1.5+6+7+8)-((1+2+3+4)\times((1\times 1.5) \\+ (2\times 6) + (3\times 7) +(4\times 8)))]/20 \\ = (675 - 665)/20 \\ =0.5$

The $tss$ is

$tss=(1.5-5.625)^2+(6-5.625)^2+(7-5.625)^2+(8-5.625)^2=24.688$

where the average of $y$ is $5.625$.

The $sse$ is

$(1.5 - (2.05\times 1) - 0.5)^2+(6 - (2.05\times 2) - 0.5)^2\\+(7 - (2.05\times 3) - 0.5)^2+(8 - (2.05\times 4) - 0.5)^2=3.675$

So that the $rss$ is $24.688-3.675=21.013$.

The coefficient of determination is

$r^2=21.013/24.688=0.8511$

This result is quite high, and is fairly indicative of a linear model.

There is a Linear Regression Calculator in this blog, which can be found here