Linear Regression - Part 2
Have you had your coffee yet? Okay, let's continue
2. Goodness-of-fit
To measure how good our linear regression model performs its job, R-square \( R^2 \) will be measured:
\[ R^2 = \frac{\hat{V}\{\hat{y}\}} {\hat{V}\{y\}} \]
where \( \hat{V}() \) denotes the sample variance
\[ R^2=\frac{1/(N-1) \sum_{i=1}^{N}(\hat{y_i}-\bar{y})^2}{1/(N-1) \sum_{i=1}^{N}(y_i-\bar{y})^2} \]
As:
\( TSS \) = Total sum of Squares = \( \sum_{i=1}^{N}(y_i - \bar{y})^2 \)
\( SSR \) = Regression Sum of Squares = \( \sum_{i=1}^{N}(\hat{y_i} - \bar{y})^2 \)
\( SSE \) = Error Sum of Squares = \( \sum_{i=1}^{N}(y_i - \hat{y_i})^2 \)
and \( TSS = SSR + SSE \)
\[ R^2 = \frac{SSR}{TSS} \]
\[ R^2 = 1 - \frac{SSE}{TSS} \]
\[ R^2 = 1 - \frac{\hat{V}\{e\}}{\hat{V}\{y\}} = 1 - \frac{\frac{1}{N-1} \sum_{i=1}^{N} e_{i}^{2}}{\frac{1}{N-1} \sum_{i=1}^{N} (y_i - \bar{y})^2} \]
The \( R^2 \) measure has a drawback that it will never decrease if the number of regressors is increased, even it has no explanatory power in the regression. To compensate, adjusted \( R^2 \) aka \( \bar{R}^2 \) is proposed
\[ \bar{R}^2 = 1 - \frac{\frac{1}{N-1} \sum_{i=1}^{N-K} e_{i}^{2}}{\frac{1}{N-1} \sum_{i=1}^{N} (y_i - \bar{y})^2} \]
Now, let's modify the regression and its summary functions by incorporating the \( R^2 \) and \( \bar{R}^2 \) measures:
lr <- function(X, y, method = "OLS", intercept = T) {
X = as.matrix(X)
K = ncol(X)
N = nrow(X)
beta = matrix()
result = list()
if (intercept == T) {
# reset the length of vector beta
K = K + 1
X = cbind(1, X)
}
beta = solve(t(X) %*% X) %*% t(X) %*% y
# calculate the residuals
resid = y - X %*% beta
# calculate the variance of residuals
residvar = (t(resid) %*% resid)/(N - K)
# convert it to value
residvar = residvar[1, 1]
# calcuate R-squared and adjusted R-sqaured calculate the variance of
# residuals (not adjusted for no. of regressors)
SSE = (t(resid) %*% resid)/(N - 1)
SSE = SSE[1, 1]
adjSSE = residvar
TSS = t(y - mean(y)) %*% (y - mean(y))/(N - 1)
TSS = TSS[1, 1]
Rsquared = 1 - (SSE/TSS)
adjRsquared = 1 - (adjSSE/TSS)
betacovar = solve(t(X) %*% X) * residvar
# assign value to the list result and return it
result$beta$val = beta
result$resid$val = resid
result$resid$var = residvar
result$beta$covar = betacovar
result$beta$var = diag(betacovar)
result$beta$se = sqrt(diag(betacovar))
result$hasIntercept = intercept
result$R = Rsquared
result$adjR = adjRsquared
result
}
lr.summary <- function(result) {
displayAdjust = 0
K = length(result$beta$val)
cat("Regression Results:\n\n")
cat("Residuals Summary:\n")
residsummary = summary(result$resid$val)
print(residsummary)
cat("\n")
cat("Coefficients:\n")
text = sprintf("%30s%15s\n", "Values", "Std. Error")
cat(text)
for (i in 1:K) {
if (result$hasIntercept && (i == 1)) {
text = sprintf("%-15s%15.6f%15.6f\n", "(Intercept)", result$beta$val[1],
result$beta$se[1])
cat(text)
displayAdjust = 1
} else {
tcoef = sprintf("beta%d", i - displayAdjust)
text = sprintf("%-15s%15.6f%15.6f\n", tcoef, result$beta$val[i],
result$beta$se[i])
cat(text)
}
}
text = sprintf("\nR-squared: %19.4f\n", result$R)
cat(text)
text = sprintf("Adjusted R-squared: %10.4f\n", result$adjR)
cat(text)
}
Run again:
results = lr(X, y)
lr.summary(results)
## Regression Results:
##
## Residuals Summary:
## V1
## Min. :-8.53e-14
## 1st Qu.: 2.49e-14
## Median : 5.68e-14
## Mean : 5.90e-14
## 3rd Qu.: 9.95e-14
## Max. : 2.02e-13
##
## Coefficients:
## Values Std. Error
## (Intercept) 12.000000 0.000000
## beta1 6.000000 0.000000
## beta2 4.000000 0.000000
## beta3 9.000000 0.000000
##
## R-squared: 1.0000
## Adjusted R-squared: 1.0000
Testing the coefficients
Remember that we estimated the variance of the residuals \( \sigma \) as \( s \):
\[ s^2 = \frac{1}{N-K-1} \sum_{i=1}^{N}e_i^2 \]
From the Gauss-Markov assumptions, the residuals should be normally distributed and the distribution of \( \sum_{i=1}^{N}e_i^2 \) should be in Chi-squared with \( N-K-1 \) degrees of freedom.
\[ (N-K-1) s^2 \sim \chi_{N-K-1}^2 \]
From the properties of Chi-squared distribution, if \( \sum_{j=1}^J e_j^2 \sim \chi_J^2 \), the distribution of \( \sum_{j=1}^J \frac{(e-\mu)^2}{\sigma^2} \sim \chi_J^2 \) is also in Chi-squared with \( E(\sum_{j=1}^J \frac{(e-\mu)^2}{\sigma^2}) = J \) and \( V(\sum_{j=1}^J \frac{(e-\mu)^2}{\sigma^2}) = 2J \)
Thus, with the variable \( e \) has a zero mean (i.e. \( \mu = 0 \)), it follows
\[ \xi = \frac{(N-K-1) s^2}{\sigma} \sim \chi_{N-K-1}^2 \]
Let's go to the t-distribution definitation, if a variable \( X \) has a standard normal distribution, \( X \sim N(0,1) \), and \( \xi \sim \chi_J^2 \), then the following ratio has a t-distribution:
\[ t = \frac{X}{\sqrt{\xi/J}} \]
In our case, to test whether the regressor \( \beta_k \) is significant or not, we let \( X = \frac{\hat{\beta_k} - \beta_k}{\sqrt{V(\hat{\beta_k})}} \). As \( V(\hat{\beta}) = \sigma^2 (X'X)^{-1} \), we let \( c_{kk} \) be the \( k \) row and \( k \) column of the matrix \( (X'X)^{-1} \). The variance of the regressor \( \hat{\beta_K} \) is \( \sigma^2 c_{kk} \), the standard deviation is \( \sigma \sqrt{c_{kk}} \)
So, \( X = \frac{\hat{\beta_k} - \beta_k}{\sigma \sqrt{c_{kk}}} \)
\[ t = \frac{X}{\xi/(N-k-1)} \]
\[ t = \frac{\hat{\beta_k} - \beta_k}{\sigma \sqrt{c_{kk}}} / \frac{s^2}{\sigma} \]
\[ t = \frac{\hat{\beta_k} - \beta_k}{s \sqrt{c_{kk}}} \]
\[ t = \frac{\hat{\beta_k} - \beta_k}{se(\hat{\beta_k})} \]
the value \( t \) has a t-distribution with \( N-K-1 \) degrees of freedom
To test whether the coefficient \( \beta_k \) is significantly different from zero, we will test:
\[ t_k = \frac{\hat{\beta_k}-0}{se(\hat{\beta_k})} \]
with dof = (N-K-1).
lr <- function(X, y, method = "OLS", intercept = T) {
X = as.matrix(X)
K = ncol(X)
N = nrow(X)
beta = matrix()
result = list()
if (intercept == T) {
# reset the length of vector beta
K = K + 1
X = cbind(1, X)
}
beta = solve(t(X) %*% X) %*% t(X) %*% y
# calculate the residuals
resid = y - X %*% beta
# calculate the variance of residuals
residvar = (t(resid) %*% resid)/(N - K)
# convert it to value
residvar = residvar[1, 1]
# calculate R-squared and adjusted R-sqaured calculate the variance of
# residuals (not adjusted for no. of regressors)
SSE = (t(resid) %*% resid)/(N - 1)
SSE = SSE[1, 1]
adjSSE = residvar
TSS = t(y - mean(y)) %*% (y - mean(y))/(N - 1)
TSS = TSS[1, 1]
Rsquared = 1 - (SSE/TSS)
adjRsquared = 1 - (adjSSE/TSS)
betacovar = solve(t(X) %*% X) * residvar
# assign value to the list result and return it
result$beta$val = beta
result$resid$val = resid
result$resid$var = residvar
result$beta$covar = betacovar
result$beta$var = diag(betacovar)
result$beta$se = sqrt(diag(betacovar))
# calculate t value for all the coefficients
result$beta$t = beta/sqrt(diag(betacovar))
result$hasIntercept = intercept
result$R = Rsquared
result$adjR = adjRsquared
result
}
lr.summary <- function(result) {
displayAdjust = 0
K = length(result$beta$val)
cat("Regression Results:\n\n")
cat("Residuals Summary:\n")
residsummary = summary(result$resid$val)
print(residsummary)
cat("\n")
cat("Coefficients:\n")
text = sprintf("%30s%15s%15s\n", "Values", "Std. Error", "t")
cat(text)
for (i in 1:K) {
if (result$hasIntercept && (i == 1)) {
text = sprintf("%-15s%15.6f%15.6f%15.6f\n", "(Intercept)", result$beta$val[1],
result$beta$se[1], result$beta$t[1])
cat(text)
displayAdjust = 1
} else {
tcoef = sprintf("beta%d", i - displayAdjust)
text = sprintf("%-15s%15.6f%15.6f%15.6f\n", tcoef, result$beta$val[i],
result$beta$se[i], result$beta$t[i])
cat(text)
}
}
text = sprintf("\nR-squared: %19.4f\n", result$R)
cat(text)
text = sprintf("Adjusted R-squared: %10.4f\n", result$adjR)
cat(text)
}
Run again:
results = lr(X, y)
lr.summary(results)
## Regression Results:
##
## Residuals Summary:
## V1
## Min. :-8.53e-14
## 1st Qu.: 2.49e-14
## Median : 5.68e-14
## Mean : 5.90e-14
## 3rd Qu.: 9.95e-14
## Max. : 2.02e-13
##
## Coefficients:
## Values Std. Error t
## (Intercept) 12.000000 0.000000415445722959532.375000
## beta1 6.000000 0.0000002012432297973165.250000
## beta2 4.000000 0.0000001328501284142088.000000
## beta3 9.000000 0.0000003244399974637064.500000
##
## R-squared: 1.0000
## Adjusted R-squared: 1.0000