The linear regression model is a fairly simple model, nevertheless it is used in many applications.
What follows is a fairly mathematical derivation of the linear regression parameters.
Let us denote the observations as $y[n]$ and $x[n]$ for $n=0,1,..,N-1$ for the model
$y[n] = ax[n]+b$
We assume that we can measure both $x[n]$ and $y[n]$ - the unknowns are $a$ and $b$, the gradient and intercept respectively. We can use the least squares error criterion to find estimates for $a$ and $b$, from data $x[n]$ and $y[n]$.
Let $e[n] = y[n] - ax[n]-b$ be the error. We can represent the samples $e[n]$, $x[n]$ and $y[n]$ as length $N$ column vectors $\underline{e}$,$\underline{x}$ and $\underline{y}$ respectively.
Thus the error sum of squares (which can be regarded as a cost function $C$ to minimise over) will be
$\underline{e}^T\underline{e}=(\underline{y}-a\underline{x}-b\underline{1})^T(\underline{y}-a\underline{x}-b\underline{1}$).
where $\underline{1}$ is a column vector with all elements set to 1.
Expanding this cost function results in:-
$C=\underline{y}^T\underline{y}-2a\underline{x}^T\underline{y}-2b\underline{y}^T\underline{1}+a^2(\underline{x}^T\underline{x})+2ab(\underline{x}^T\underline{1})+b^2(\underline{1}^T\underline{1})$
Differentiating $C$ w.r.t. to $a$ and $b$:-
$\frac{\partial C}{\partial a}=2a(\underline{x}^T\underline{x})+2b(\underline{x}^T\underline{1})-2\underline{x}^T\underline{y}$
$\frac{\partial C}{\partial b}=2a(\underline{x}^T\underline{1})+2b(\underline{1}^T\underline{1})-2\underline{y}^T\underline{1}$
In order to estimate $\hat{a}$ and $\hat{b}$ (the least squares estimates of $a$ and $a$ respectively) we need to set the above two partial differential equations to 0, and solve for $a$ and $b$. We need to solve the following matrix equation
$\left[\begin{array}{cc}\underline{x}^T\underline{x}&\underline{x}^T\underline{1}\\ \underline{x}^T\underline{1}&\underline{1}^T\underline{1}\end{array}\right]$$\left[\begin{array}{c}\hat{a}\\ \hat{b}\end{array}\right]$ = $\left[\begin{array}{c}\underline{x}^T\underline{y} \\ \underline{y}^T\underline{1} \end{array}\right]$
Inverting the matrix and pre multiplying both sides of the above matrix equation by the inverse results in:
$\left[\begin{array}{c}\hat{a} \\ \hat{b} \end{array}\right]$ = $\frac{1}{N(\underline{x}^T\underline{x})-(\underline{x}^T\underline{1})^2}$$\left[\begin{array}{cc}\underline{1}^T\underline{1}&-\underline{x}^T\underline{1}\\ -\underline{x}^T\underline{1}&\underline{x}^T\underline{x}\end{array}\right]$$\left[\begin{array}{c}\underline{x}^T\underline{y} \\ \underline{y}^T\underline{1} \end{array}\right]$
If we revert to a more explicit notation in terms of the samples, we have
$\hat{a}=\frac{1}{N(\sum_{n=0}^{N-1}(x[n])^2)-(\sum_{n=0}^{N-1}x[n])^2}[N(\sum_{n=0}^{N-1}x[n]y[n])-(\sum_{n=0}^{N-1}y[n])(\sum_{n=0}^{N-1}x[n])]$
$\hat{b}=\frac{1}{N(\sum_{n=0}^{N-1}(x[n])^2)-(\sum_{n=0}^{N-1}x[n])^2}[(\sum_{n=0}^{N-1}(x[n])^2)(\sum_{n=0}^{N-1}y[n])-(\sum_{n=0}^{N-1}x[n])(\sum_{n=0}^{N-1}x[n]y[n])]$
where the denominator of the fraction in the above two equations is the matrix determinant.
Coefficient of Determination
Once we have calculated $\hat{a}$ and $\hat{b}$, we can see how well the linear model accounts for the data. This is achieved by finding the Coefficient of Determination ($r^2$). This is the ratio of the residual sum of squares ($rss$) to the total sum of squares ($tss$),
$\Large r^2=\frac{rss}{tss}$
and has a maximum value of 1, which is attained when the model is exactly linear. The larger $r^2$ is, the better the linear model fits the data. We can think of $r^2$ as the proportion of variation in data explained by the linear model, or the "goodness of fit" of the linear model.
The total sum of squares is given by
$tss=\sum_{n=0}^{N-1}(y[n]-\bar{y})^2$
where
$\bar{y}=\frac{1}{N}\sum_{n=0}^{N-1}y[n]$ is the mean of $y$.
The residual sum of squares ($rss$) is the total sum of squares ($tss$) minus the sum of squared errors ($sse$)
where
$sse=\sum_{n=0}^{N-1}(y[n]-\hat{a}x[n]-\hat{b})^2$
If the data perfectly fits the linear model, $sse$ will equal 0, so that $tss$ will equal $rss$ - the coefficient of determination $r^2$ will be $1$.
Summarising, we have
A worked example
Now, let us consider the following example:-
$x=[1,2,3,4]$, $y=[1.5,6,7,8]$
The matrix determinant will be $(4\times (1 + 2^2+3^2+4^2))-(1+2+3+4)^2=20$.
The least squares estimate of $a$ is
$\hat{a}=[4\times((1\times 1.5) + (2\times 6) + (3\times 7) + (4\times 8)) \\ -( (1.5+6+7+8)\times(1+2+3+4))]/20 \\ =2.05$
The least squares estimate of $b$ is
$\hat{b}=[(1+2^2+3^2+4^2)\times(1.5+6+7+8)-((1+2+3+4)\times((1\times 1.5) \\+ (2\times 6) + (3\times 7) +(4\times 8)))]/20 \\ = (675 - 665)/20 \\ =0.5$
The $tss$ is
$tss=(1.5-5.625)^2+(6-5.625)^2+(7-5.625)^2+(8-5.625)^2=24.688$
where the average of $y$ is $5.625$.
The $sse$ is
$(1.5 - (2.05\times 1) - 0.5)^2+(6 - (2.05\times 2) - 0.5)^2\\+(7 - (2.05\times 3) - 0.5)^2+(8 - (2.05\times 4) - 0.5)^2=3.675$
So that the $rss$ is $24.688-3.675=21.013$.
The coefficient of determination is
$r^2=21.013/24.688=0.8511$
This result is quite high, and is fairly indicative of a linear model.
There is a Linear Regression Calculator in this blog, which can be found here
Let $e[n] = y[n] - ax[n]-b$ be the error. We can represent the samples $e[n]$, $x[n]$ and $y[n]$ as length $N$ column vectors $\underline{e}$,$\underline{x}$ and $\underline{y}$ respectively.
Thus the error sum of squares (which can be regarded as a cost function $C$ to minimise over) will be
$\underline{e}^T\underline{e}=(\underline{y}-a\underline{x}-b\underline{1})^T(\underline{y}-a\underline{x}-b\underline{1}$).
where $\underline{1}$ is a column vector with all elements set to 1.
Expanding this cost function results in:-
$C=\underline{y}^T\underline{y}-2a\underline{x}^T\underline{y}-2b\underline{y}^T\underline{1}+a^2(\underline{x}^T\underline{x})+2ab(\underline{x}^T\underline{1})+b^2(\underline{1}^T\underline{1})$
Differentiating $C$ w.r.t. to $a$ and $b$:-
$\frac{\partial C}{\partial a}=2a(\underline{x}^T\underline{x})+2b(\underline{x}^T\underline{1})-2\underline{x}^T\underline{y}$
$\frac{\partial C}{\partial b}=2a(\underline{x}^T\underline{1})+2b(\underline{1}^T\underline{1})-2\underline{y}^T\underline{1}$
In order to estimate $\hat{a}$ and $\hat{b}$ (the least squares estimates of $a$ and $a$ respectively) we need to set the above two partial differential equations to 0, and solve for $a$ and $b$. We need to solve the following matrix equation
$\left[\begin{array}{cc}\underline{x}^T\underline{x}&\underline{x}^T\underline{1}\\ \underline{x}^T\underline{1}&\underline{1}^T\underline{1}\end{array}\right]$$\left[\begin{array}{c}\hat{a}\\ \hat{b}\end{array}\right]$ = $\left[\begin{array}{c}\underline{x}^T\underline{y} \\ \underline{y}^T\underline{1} \end{array}\right]$
Inverting the matrix and pre multiplying both sides of the above matrix equation by the inverse results in:
$\left[\begin{array}{c}\hat{a} \\ \hat{b} \end{array}\right]$ = $\frac{1}{N(\underline{x}^T\underline{x})-(\underline{x}^T\underline{1})^2}$$\left[\begin{array}{cc}\underline{1}^T\underline{1}&-\underline{x}^T\underline{1}\\ -\underline{x}^T\underline{1}&\underline{x}^T\underline{x}\end{array}\right]$$\left[\begin{array}{c}\underline{x}^T\underline{y} \\ \underline{y}^T\underline{1} \end{array}\right]$
If we revert to a more explicit notation in terms of the samples, we have
$\hat{a}=\frac{1}{N(\sum_{n=0}^{N-1}(x[n])^2)-(\sum_{n=0}^{N-1}x[n])^2}[N(\sum_{n=0}^{N-1}x[n]y[n])-(\sum_{n=0}^{N-1}y[n])(\sum_{n=0}^{N-1}x[n])]$
$\hat{b}=\frac{1}{N(\sum_{n=0}^{N-1}(x[n])^2)-(\sum_{n=0}^{N-1}x[n])^2}[(\sum_{n=0}^{N-1}(x[n])^2)(\sum_{n=0}^{N-1}y[n])-(\sum_{n=0}^{N-1}x[n])(\sum_{n=0}^{N-1}x[n]y[n])]$
where the denominator of the fraction in the above two equations is the matrix determinant.
Coefficient of Determination
Once we have calculated $\hat{a}$ and $\hat{b}$, we can see how well the linear model accounts for the data. This is achieved by finding the Coefficient of Determination ($r^2$). This is the ratio of the residual sum of squares ($rss$) to the total sum of squares ($tss$),
$\Large r^2=\frac{rss}{tss}$
and has a maximum value of 1, which is attained when the model is exactly linear. The larger $r^2$ is, the better the linear model fits the data. We can think of $r^2$ as the proportion of variation in data explained by the linear model, or the "goodness of fit" of the linear model.
The total sum of squares is given by
$tss=\sum_{n=0}^{N-1}(y[n]-\bar{y})^2$
where
$\bar{y}=\frac{1}{N}\sum_{n=0}^{N-1}y[n]$ is the mean of $y$.
The residual sum of squares ($rss$) is the total sum of squares ($tss$) minus the sum of squared errors ($sse$)
where
$sse=\sum_{n=0}^{N-1}(y[n]-\hat{a}x[n]-\hat{b})^2$
If the data perfectly fits the linear model, $sse$ will equal 0, so that $tss$ will equal $rss$ - the coefficient of determination $r^2$ will be $1$.
Summarising, we have
$\Large r^2=\frac{\sum_{n=0}^{N-1}(y[n]-\bar{y})^2-\sum_{n=0}^{N-1}(y[n]-\hat{a}x[n]-\hat{b})^2}{\sum_{n=0}^{N-1}(y[n]-\bar{y})^2}$
A worked example
Now, let us consider the following example:-
$x=[1,2,3,4]$, $y=[1.5,6,7,8]$
The matrix determinant will be $(4\times (1 + 2^2+3^2+4^2))-(1+2+3+4)^2=20$.
The least squares estimate of $a$ is
$\hat{a}=[4\times((1\times 1.5) + (2\times 6) + (3\times 7) + (4\times 8)) \\ -( (1.5+6+7+8)\times(1+2+3+4))]/20 \\ =2.05$
The least squares estimate of $b$ is
$\hat{b}=[(1+2^2+3^2+4^2)\times(1.5+6+7+8)-((1+2+3+4)\times((1\times 1.5) \\+ (2\times 6) + (3\times 7) +(4\times 8)))]/20 \\ = (675 - 665)/20 \\ =0.5$
The $tss$ is
$tss=(1.5-5.625)^2+(6-5.625)^2+(7-5.625)^2+(8-5.625)^2=24.688$
where the average of $y$ is $5.625$.
The $sse$ is
$(1.5 - (2.05\times 1) - 0.5)^2+(6 - (2.05\times 2) - 0.5)^2\\+(7 - (2.05\times 3) - 0.5)^2+(8 - (2.05\times 4) - 0.5)^2=3.675$
So that the $rss$ is $24.688-3.675=21.013$.
The coefficient of determination is
$r^2=21.013/24.688=0.8511$
This result is quite high, and is fairly indicative of a linear model.
There is a Linear Regression Calculator in this blog, which can be found here
No comments:
Post a Comment