Basic Linear Regression
This post serves as an introduction to econometric theory. Unlike others available resources on the web, I would like to go through all the basic concepts with the R software. All the code of linear/multiple regression will be reinvented by yourself, rather than using the built-in functions lm. This approach will let you familiar with the software R, as well as to let you to understand the theory in a more concrete way.What is Linear Regression
The following linear equation defines a linear relationship between the variable \( x \) and \( y \):\[ y = 12 + 6x_1 + 4x_2 + 9x_3 \]
From the above equation, we can find out the value of \( y \) by a given set \( x=\{x_1, x_2, x_3\} \)
For example, if \( x_1 = 10, x_2=20, x_3 = 30 \)
\( y = 12 + 6\times10 + 4\times20 +9\times30 = 422 \)
Let's define the linear equation in R code:
equation01 <- function(x) {
b = c(12, 6, 4, 9)
x = c(1, x)
result = t(x) %*% b
as.numeric(result) #convert the datatype from matrix to numeric
}
Next, let's generate 100 observations of \( y \) by setting different values of vector \( x=\{x_1, x_2, x_3\} \). For simplicity, let's limited the values of vector \( x \) between 1 to 10.y = rep(0, 100)
x = matrix(sample(1:10, 300, replace = T), nrow = 100, ncol = 3, byrow = T)
for (i in 1:100) {
y[i] = equation01(x[i, ])
}
Now, the population data for the linear equation are:df = data.frame(y = y, x1 = x[, 1], x2 = x[, 2], x3 = x[, 3])
df
## y x1 x2 x3
## 1 161 8 5 9
## 2 118 2 10 6
## 3 120 3 9 6
## 4 107 2 5 7
## 5 98 2 5 6
## 6 118 5 1 8
## 7 81 5 3 3
## 8 73 3 4 3
## 9 102 10 3 2
## 10 131 7 8 5
## 11 172 5 10 10
## 12 101 7 5 3
## 13 58 1 1 4
## 14 102 10 3 2
## 15 120 9 9 2
## 16 106 8 7 2
## 17 155 7 5 9
## 18 186 8 9 10
## 19 127 10 7 3
## 20 87 5 9 1
## 21 63 1 9 1
## 22 109 6 4 5
## 23 152 7 2 10
## 24 185 10 8 9
## 25 113 5 2 7
## 26 77 6 5 1
## 27 172 7 7 10
## 28 123 8 9 3
## 29 116 3 8 6
## 30 153 8 3 9
## 31 71 7 2 1
## 32 139 4 10 7
## 33 157 7 10 7
## 34 173 8 8 9
## 35 133 6 10 5
## 36 102 4 3 6
## 37 121 3 7 7
## 38 116 8 5 4
## 39 119 10 5 3
## 40 141 9 3 7
## 41 156 9 9 6
## 42 131 9 5 5
## 43 57 1 3 3
## 44 139 5 4 9
## 45 61 2 7 1
## 46 113 4 8 5
## 47 186 8 9 10
## 48 120 2 6 8
## 49 174 6 9 10
## 50 126 2 3 10
## 51 141 8 9 5
## 52 137 9 2 7
## 53 134 3 8 8
## 54 120 9 9 2
## 55 124 3 10 6
## 56 135 8 3 7
## 57 52 1 4 2
## 58 77 1 8 3
## 59 114 7 6 4
## 60 166 10 10 6
## 61 108 3 6 6
## 62 107 2 5 7
## 63 129 8 6 5
## 64 150 2 9 10
## 65 122 4 8 6
## 66 137 2 8 9
## 67 176 9 5 10
## 68 124 3 1 10
## 69 163 10 7 7
## 70 64 2 1 4
## 71 104 8 2 4
## 72 141 9 3 7
## 73 100 5 1 6
## 74 79 9 1 1
## 75 67 3 7 1
## 76 139 1 10 9
## 77 65 1 5 3
## 78 152 3 8 10
## 79 180 9 6 10
## 80 116 1 2 10
## 81 72 3 6 2
## 82 61 3 1 3
## 83 55 1 7 1
## 84 184 10 10 8
## 85 125 6 8 5
## 86 99 2 3 7
## 87 120 2 6 8
## 88 92 4 5 4
## 89 118 9 4 4
## 90 150 2 9 10
## 91 142 10 4 6
## 92 150 9 3 8
## 93 165 9 9 7
## 94 135 3 6 9
## 95 178 8 7 10
## 96 123 1 6 9
## 97 60 1 6 2
## 98 126 6 6 6
## 99 105 7 6 3
## 100 115 9 10 1
Estimating the coefficients
Now, suppose that we do NOT know the parameters in the linear equation; we just have the 100 observations of \( y \) and \( x \) s, Is there any method for us to estimate the coefficients?Yes, the simplest and easliest method is Ordinary Least Squares
Ordinary Least Squares (OLS)
If \( y \) and \( x_{input} \) has a linear relationship between them, and the real linear coefficients is a vector \( \beta \), the linear relationship can be defined as:\( x_{input} \) is a \( k \) dimension column vector \( x_{input}=(x_1, x_2, \cdots, x_k)' \). The linear combination of \( x \) for arbitrary coefficients \( \beta = (\beta_0, \beta_1, \beta_2, \cdots, \beta_k )' \) is:
\[ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k \]
where \( \beta_0 \) represents a constant, and the dimension of vector \( \beta \) is \( k+1 \). The relationship can be represented in the following matrix form:
\[ x = (1, x_{input}') \]
\[ y = x' \beta \]
To estimate the true coefficients \( \beta \), we define a vector \( \hat{\beta} = (\hat{\beta_0}, \hat{\beta_1}, \cdots, \hat{\beta_k})' \)
For \( i \) observation, the estimation error \( e_i \) (residual) is defined as:
\[ e_i = y_i - x' \hat{\beta} \]
A good estimation of \( \hat{\beta} \) should results a minimum values of the squared estimation errors (residuals sum of squares) of the all observations, OLS estimate the linear coefficients by minimizing the following objective function:
\[ S(\hat{\beta}) = \sum_{i=1}^{N}{e_{i}^2} \]
\[ S(\hat{\beta}) = \sum_{i=1}^N{(y_i - x_i ' \hat{\beta})^2} \]
where \( N \) is the total number of observations.
The minimization can be done by differentiating the objective function \( S(\hat{\beta}) \) with respect to \( \beta \):
\[ -2 \sum_{i=1}^N x_i(y_i - x_i ' \hat{\beta}) = 0 \]
or
\[ (\sum_{i=1}^N x_i x_i ') \hat{\beta} = \sum_{i=1}^N x_i y_i \]
\[ \hat{\beta} = (\sum_{i=1}^N x_i x_i ')^{-1} \sum_{i=1}^N x_i y_i \]
The above equation is solved when the \( (K+1) \times (K+1) \) matrix \( \sum_{i=1}^N x_i x_i ' \) is invertible. No multicollinearity exists in the data \( x \) and the matrix is full rank.
A complete Example using Matrix Notation
For a linear system, the \( N \times K+1 \) matrix \( X \) represents the input data has \( N \) observation and \( K \) parameters (dependent variables).\[ \left( \begin{array}{cccc} 1 & x_{12} & \cdots & x_{1K} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{N2} & \cdots & x_{NK} \end{array} \right) = \left( \begin{array}{c} x_1 ' \\ \vdots \\ x_N ' \end{array} \right) , y = \left( \begin{array}{c} y_1 ' \\ \vdots \\ y_N ' \end{array} \right) \]
The \( N \) dimension column vector \( e \) is the residuals in the linear regression.
\[ e = y - X\hat{\beta} \]
The objective function in matrix notation is:
\[ S(\hat{\beta}) = e'e \] \[ S(\hat{\beta})= (y - X\hat{\beta})' (y - X\hat{\beta}) = y'y - 2y'X\hat{\beta} + \hat{\beta}'X'X\hat{\beta} \]
Differentiating the objective function with respect to \( \hat{\beta} \) yields:
\[ \frac{\delta S(\hat{\beta})}{\delta \hat{\beta}} = -2(X'y - X'X\hat{\beta}) = 0 \]
\[ \hat{\beta} = (X'X)^{-1}X'y \]
Let's try to build a function that carry out linear regression in R.
lr <- function(X, y, method = "OLS", intercept = T) {
beta = rep(0, ncol(X))
if (intercept == T) {
# reset the length of vector beta
X = cbind(1, X)
beta = rep(0, ncol(X))
}
beta = solve(t(X) %*% X) %*% t(X) %*% y
beta
}
Now we run a test for the lr() function, the data “df” we have is generated by the linear equation \( y = 12 + 6x_1 + 4x_2 + 9x_3 \) with the random generated values of \( x_1, x_2 and x_3 \)head(df)
## y x1 x2 x3
## 1 161 8 5 9
## 2 118 2 10 6
## 3 120 3 9 6
## 4 107 2 5 7
## 5 98 2 5 6
## 6 118 5 1 8
The true values of \( \beta \) are \( \beta = (12, 6, 4, 9) \)Suppose that we do not know the underlying relationship between \( y \) and \( x \) and we just have the data “df”. We believed that the relationship between the variables \( y \) and \( x \) are linear.
\[ y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 +\beta_3 x_3 \]
The coefficent \( \hat{\beta} \) estimated by OLS are:
X = as.matrix(df[, 2:4], dimnames = NULL)
y = df[, 1]
dimnames(X) <- list(NULL, NULL)
lr(X = X, y = y)
## [,1]
## [1,] 12
## [2,] 6
## [3,] 4
## [4,] 9
The true \( \beta \) and the estimated \( \hat{\beta} \) are the same.Linear Regression with Stochastic Component
In real life, it is hardly to find a perfect linear system that output is exactly following a linear relationship without errors. For example, you may think that individual wages has a linear relationship with the factors of education level and ages. However, even two persons with the same education level and ages, their wages are hardly the same. The differences may be yielded from difference job natures, different services of years etc. It refered as the omitted variables problem. Further even all the factors are the same, there may be a random error in the model that cannot be explained. This may caused by sampling errors or measurment errors, a stochastic component \( \epsilon \) should be added to the linear equation.\[ y = X \beta + \epsilon \]
Sometimes, the stochastic component \( \epsilon \) is refered as the unobservable error term. The unobservable error term should independent to any dependent variable \( x \), i.e. \( E\{\epsilon | X\} = 0 \). The variable \( x \) is considered as exogenous variables in this case.
Gauss-Markov Assumptions
OLS is a good estimator if and only if the following assumptions are hold: 1. \( E\{\epsilon_i\} = 0, i=1,\cdots,N \) 2. \( x_i \) and \( \epsilon_i \) are independent 3. \( V\{\epsilon_i\} = \sigma^2 \) (homoskedasticity) 4. \( cov\{\epsilon_i, \epsilon_j\} = 0, i \neq j \)The above four assumptions is called the Gauss-Markov conditions
How Good OLS does its job?
A good estimator should be unbiased and consistent (my old post). Let us examinate whether OLS provides a good estimation or not.###1. Unbiased Unbiased means the expected value of the estimator \( \hat{\beta} \) should be equal to the true value \( \beta \) in repeated sampling.
In OLS, \( \hat{\beta} \) is estimated as:
\[ \hat{\beta} = (X'X)^{-1}X'y \]
The expected value of \( \hat{\beta} \): \[ E\{\hat{\beta}\} = E\{(X'X)^{-1}X'y \} \] \[ E\{\hat{\beta}\} = E\{(X'X)^{-1}X' (X \beta + \epsilon) \} \] \[ E\{\hat{\beta}\} = E\{(X'X)^{-1}X'X \beta + (X'X)^{-1}X'\epsilon) \} \] \[ E\{\hat{\beta}\} = E\{ \beta + (X'X)^{-1}X'\epsilon) \} \] (from assumption 2) \[ E\{\hat{\beta}\} = \beta + E\{ (X'X)^{-1}X'\} E\{\epsilon \} \] (from assumption 1, \( E\{\epsilon\}=0 \)) \[ E\{\hat{\beta}\} = \beta \]
The variance of \( \hat{\beta} \):
\[ V\{\hat{\beta}\} = E\{(\hat{\beta} - \beta)(\hat{\beta} - \beta)'\} \] \[ V\{\hat{\beta}\} = E\{(X'X)^{-1}X'\epsilon \epsilon'X(X'X)^{-1}\} \] (from assumption 3 and 4) \[ V\{\hat{\beta}\} = E\{(X'X)^{-1}X'\sigma^2 I_N X(X'X)^{-1}\} \] \[ V\{\hat{\beta}\} = \sigma^2 (X'X)^{-1} \]
So, next, what is the value of \( \sigma \)? An unbiased estimator of it should be the sample variance of the residuals \( e_i = y_i - x_i ' \hat{\beta} \)
\[ s^2 = \frac{1}{N-(K+1)} \sum_{i=1}^{N} e_{i}^2 \]
If the regression contains no intercept, the degree of freedom adjustment should be \( (N-K) \) instead of \( (N-(K+1) \).
The estimated variance of an element \( \hat{\beta_k} \) is given by \( s^2 c_{kk} \), where \( c_{kk} \) is the \( (k,k) \) element in the matrix \( (X'X)^{-1} \).
The square root of the estimated variance is called the standard error of \( \hat{\beta_k} \), which denoted as \( se(\hat{\beta_k}) \)
Now, let's modify our lr() function in R to incorporate the variance and standard error measure:
lr <- function(X, y, method = "OLS", intercept = T) {
X = as.matrix(X)
K = ncol(X)
N = nrow(X)
beta = matrix()
result = list()
if (intercept == T) {
# reset the length of vector beta
K = K + 1
X = cbind(1, X)
}
beta = solve(t(X) %*% X) %*% t(X) %*% y
# calculate the residuals
resid = y - X %*% beta
# calculate the variance of residuals
residvar = (t(resid) %*% resid)/(N - K)
# convert it to value
residvar = residvar[1, 1]
betacovar = solve(t(X) %*% X) * residvar
# assign value to the list result and return it
result$beta$val = beta
result$resid$val = resid
result$resid$var = residvar
result$beta$covar = betacovar
result$beta$var = diag(betacovar)
result$beta$se = sqrt(diag(betacovar))
result$hasIntercept = T
result
}
lr.summary <- function(result) {
displayAdjust = 0
K = length(result$beta$val)
cat("Regression Results:\n\n")
cat("Residuals:\n")
cat(result$resid$val, "\n\n")
cat("Coefficients:\n")
cat(" \t Values \t \t Std. Error \t \n")
for (i in 1:K) {
if (result$hasIntercept && (i == 1)) {
text = sprintf("(Intercept)\t \t %-.6f \t \t %-.6f \t \n", result$beta$val[1],
result$beta$se[1])
cat(text)
displayAdjust = 1
} else {
text = sprintf("beta%-5d \t \t %-.6f \t \t %-.6f \t \n", i -
displayAdjust, result$beta$val[i], result$beta$se[i])
cat(text)
}
}
}
Let's try out our modified version of lr() function:
results = lr(X, y)
lr.summary(results)
## Regression Results:
##
## Residuals Summary:
## V1
## Min. :-8.53e-14
## 1st Qu.: 2.49e-14
## Median : 5.68e-14
## Mean : 5.90e-14
## 3rd Qu.: 9.95e-14
## Max. : 2.02e-13
##
## Coefficients:
## Values Std. Error
## (Intercept) 12.000000 0.000000
## beta1 6.000000 0.000000
## beta2 4.000000 0.000000
## beta3 9.000000 0.000000
I think that's enough for today, let's take a coffee and have a break!
No comments:
Post a Comment