Wednesday, 28 August 2013

Basic Linear Regression - Part 2

Linear Regression - Introduction

Linear Regression - Part 2

Have you had your coffee yet? Okay, let's continue

2. Goodness-of-fit

To measure how good our linear regression model performs its job, R-square \( R^2 \) will be measured:

\[ R^2 = \frac{\hat{V}\{\hat{y}\}} {\hat{V}\{y\}} \]

where \( \hat{V}() \) denotes the sample variance

\[ R^2=\frac{1/(N-1) \sum_{i=1}^{N}(\hat{y_i}-\bar{y})^2}{1/(N-1) \sum_{i=1}^{N}(y_i-\bar{y})^2} \]

As:
\( TSS \) = Total sum of Squares = \( \sum_{i=1}^{N}(y_i - \bar{y})^2 \)
\( SSR \) = Regression Sum of Squares = \( \sum_{i=1}^{N}(\hat{y_i} - \bar{y})^2 \)
\( SSE \) = Error Sum of Squares = \( \sum_{i=1}^{N}(y_i - \hat{y_i})^2 \)

and \( TSS = SSR + SSE \)

\[ R^2 = \frac{SSR}{TSS} \]
\[ R^2 = 1 - \frac{SSE}{TSS} \]
\[ R^2 = 1 - \frac{\hat{V}\{e\}}{\hat{V}\{y\}} = 1 - \frac{\frac{1}{N-1} \sum_{i=1}^{N} e_{i}^{2}}{\frac{1}{N-1} \sum_{i=1}^{N} (y_i - \bar{y})^2} \]

The \( R^2 \) measure has a drawback that it will never decrease if the number of regressors is increased, even it has no explanatory power in the regression. To compensate, adjusted \( R^2 \) aka \( \bar{R}^2 \) is proposed

\[ \bar{R}^2 = 1 - \frac{\frac{1}{N-1} \sum_{i=1}^{N-K} e_{i}^{2}}{\frac{1}{N-1} \sum_{i=1}^{N} (y_i - \bar{y})^2} \]

Now, let's modify the regression and its summary functions by incorporating the \( R^2 \) and \( \bar{R}^2 \) measures:


lr <- function(X, y, method = "OLS", intercept = T) {
    X = as.matrix(X)
    K = ncol(X)
    N = nrow(X)
    beta = matrix()
    result = list()


    if (intercept == T) {
        # reset the length of vector beta
        K = K + 1
        X = cbind(1, X)
    }

    beta = solve(t(X) %*% X) %*% t(X) %*% y

    # calculate the residuals
    resid = y - X %*% beta

    # calculate the variance of residuals
    residvar = (t(resid) %*% resid)/(N - K)
    # convert it to value
    residvar = residvar[1, 1]

    # calcuate R-squared and adjusted R-sqaured calculate the variance of
    # residuals (not adjusted for no. of regressors)
    SSE = (t(resid) %*% resid)/(N - 1)
    SSE = SSE[1, 1]
    adjSSE = residvar
    TSS = t(y - mean(y)) %*% (y - mean(y))/(N - 1)
    TSS = TSS[1, 1]
    Rsquared = 1 - (SSE/TSS)
    adjRsquared = 1 - (adjSSE/TSS)

    betacovar = solve(t(X) %*% X) * residvar

    # assign value to the list result and return it
    result$beta$val = beta
    result$resid$val = resid
    result$resid$var = residvar
    result$beta$covar = betacovar
    result$beta$var = diag(betacovar)
    result$beta$se = sqrt(diag(betacovar))
    result$hasIntercept = intercept
    result$R = Rsquared
    result$adjR = adjRsquared
    result
}
lr.summary <- function(result) {
    displayAdjust = 0
    K = length(result$beta$val)

    cat("Regression Results:\n\n")
    cat("Residuals Summary:\n")
    residsummary = summary(result$resid$val)
    print(residsummary)
    cat("\n")
    cat("Coefficients:\n")

    text = sprintf("%30s%15s\n", "Values", "Std. Error")
    cat(text)

    for (i in 1:K) {
        if (result$hasIntercept && (i == 1)) {
            text = sprintf("%-15s%15.6f%15.6f\n", "(Intercept)", result$beta$val[1], 
                result$beta$se[1])
            cat(text)
            displayAdjust = 1
        } else {
            tcoef = sprintf("beta%d", i - displayAdjust)
            text = sprintf("%-15s%15.6f%15.6f\n", tcoef, result$beta$val[i], 
                result$beta$se[i])
            cat(text)
        }

    }

    text = sprintf("\nR-squared: %19.4f\n", result$R)
    cat(text)
    text = sprintf("Adjusted R-squared: %10.4f\n", result$adjR)
    cat(text)
}

Run again:

results = lr(X, y)
lr.summary(results)
## Regression Results:
## 
## Residuals Summary:
##        V1           
##  Min.   :-8.53e-14  
##  1st Qu.: 2.49e-14  
##  Median : 5.68e-14  
##  Mean   : 5.90e-14  
##  3rd Qu.: 9.95e-14  
##  Max.   : 2.02e-13  
## 
## Coefficients:
##                         Values     Std. Error
## (Intercept)          12.000000       0.000000
## beta1                 6.000000       0.000000
## beta2                 4.000000       0.000000
## beta3                 9.000000       0.000000
## 
## R-squared:              1.0000
## Adjusted R-squared:     1.0000

Testing the coefficients

Remember that we estimated the variance of the residuals \( \sigma \) as \( s \):

\[ s^2 = \frac{1}{N-K-1} \sum_{i=1}^{N}e_i^2 \]

From the Gauss-Markov assumptions, the residuals should be normally distributed and the distribution of \( \sum_{i=1}^{N}e_i^2 \) should be in Chi-squared with \( N-K-1 \) degrees of freedom.

\[ (N-K-1) s^2 \sim \chi_{N-K-1}^2 \]

From the properties of Chi-squared distribution, if \( \sum_{j=1}^J e_j^2 \sim \chi_J^2 \), the distribution of \( \sum_{j=1}^J \frac{(e-\mu)^2}{\sigma^2} \sim \chi_J^2 \) is also in Chi-squared with \( E(\sum_{j=1}^J \frac{(e-\mu)^2}{\sigma^2}) = J \) and \( V(\sum_{j=1}^J \frac{(e-\mu)^2}{\sigma^2}) = 2J \)

Thus, with the variable \( e \) has a zero mean (i.e. \( \mu = 0 \)), it follows
\[ \xi = \frac{(N-K-1) s^2}{\sigma} \sim \chi_{N-K-1}^2 \]

Let's go to the t-distribution definitation, if a variable \( X \) has a standard normal distribution, \( X \sim N(0,1) \), and \( \xi \sim \chi_J^2 \), then the following ratio has a t-distribution:

\[ t = \frac{X}{\sqrt{\xi/J}} \]

In our case, to test whether the regressor \( \beta_k \) is significant or not, we let \( X = \frac{\hat{\beta_k} - \beta_k}{\sqrt{V(\hat{\beta_k})}} \). As \( V(\hat{\beta}) = \sigma^2 (X'X)^{-1} \), we let \( c_{kk} \) be the \( k \) row and \( k \) column of the matrix \( (X'X)^{-1} \). The variance of the regressor \( \hat{\beta_K} \) is \( \sigma^2 c_{kk} \), the standard deviation is \( \sigma \sqrt{c_{kk}} \)

So, \( X = \frac{\hat{\beta_k} - \beta_k}{\sigma \sqrt{c_{kk}}} \)

\[ t = \frac{X}{\xi/(N-k-1)} \]
\[ t = \frac{\hat{\beta_k} - \beta_k}{\sigma \sqrt{c_{kk}}} / \frac{s^2}{\sigma} \]
\[ t = \frac{\hat{\beta_k} - \beta_k}{s \sqrt{c_{kk}}} \]
\[ t = \frac{\hat{\beta_k} - \beta_k}{se(\hat{\beta_k})} \]

the value \( t \) has a t-distribution with \( N-K-1 \) degrees of freedom

To test whether the coefficient \( \beta_k \) is significantly different from zero, we will test:

\[ t_k = \frac{\hat{\beta_k}-0}{se(\hat{\beta_k})} \]
with dof = (N-K-1).


lr <- function(X, y, method = "OLS", intercept = T) {
    X = as.matrix(X)
    K = ncol(X)
    N = nrow(X)
    beta = matrix()
    result = list()


    if (intercept == T) {
        # reset the length of vector beta
        K = K + 1
        X = cbind(1, X)
    }

    beta = solve(t(X) %*% X) %*% t(X) %*% y

    # calculate the residuals
    resid = y - X %*% beta

    # calculate the variance of residuals
    residvar = (t(resid) %*% resid)/(N - K)
    # convert it to value
    residvar = residvar[1, 1]

    # calculate R-squared and adjusted R-sqaured calculate the variance of
    # residuals (not adjusted for no. of regressors)
    SSE = (t(resid) %*% resid)/(N - 1)
    SSE = SSE[1, 1]
    adjSSE = residvar
    TSS = t(y - mean(y)) %*% (y - mean(y))/(N - 1)
    TSS = TSS[1, 1]
    Rsquared = 1 - (SSE/TSS)
    adjRsquared = 1 - (adjSSE/TSS)

    betacovar = solve(t(X) %*% X) * residvar





    # assign value to the list result and return it
    result$beta$val = beta
    result$resid$val = resid
    result$resid$var = residvar
    result$beta$covar = betacovar
    result$beta$var = diag(betacovar)
    result$beta$se = sqrt(diag(betacovar))
    # calculate t value for all the coefficients
    result$beta$t = beta/sqrt(diag(betacovar))
    result$hasIntercept = intercept
    result$R = Rsquared
    result$adjR = adjRsquared
    result
}
lr.summary <- function(result) {
    displayAdjust = 0
    K = length(result$beta$val)

    cat("Regression Results:\n\n")
    cat("Residuals Summary:\n")
    residsummary = summary(result$resid$val)
    print(residsummary)
    cat("\n")
    cat("Coefficients:\n")

    text = sprintf("%30s%15s%15s\n", "Values", "Std. Error", "t")
    cat(text)

    for (i in 1:K) {
        if (result$hasIntercept && (i == 1)) {
            text = sprintf("%-15s%15.6f%15.6f%15.6f\n", "(Intercept)", result$beta$val[1], 
                result$beta$se[1], result$beta$t[1])
            cat(text)
            displayAdjust = 1
        } else {
            tcoef = sprintf("beta%d", i - displayAdjust)
            text = sprintf("%-15s%15.6f%15.6f%15.6f\n", tcoef, result$beta$val[i], 
                result$beta$se[i], result$beta$t[i])
            cat(text)
        }

    }

    text = sprintf("\nR-squared: %19.4f\n", result$R)
    cat(text)
    text = sprintf("Adjusted R-squared: %10.4f\n", result$adjR)
    cat(text)
}

Run again:

results = lr(X, y)
lr.summary(results)
## Regression Results:
## 
## Residuals Summary:
##        V1           
##  Min.   :-8.53e-14  
##  1st Qu.: 2.49e-14  
##  Median : 5.68e-14  
##  Mean   : 5.90e-14  
##  3rd Qu.: 9.95e-14  
##  Max.   : 2.02e-13  
## 
## Coefficients:
##                         Values     Std. Error              t
## (Intercept)          12.000000       0.000000415445722959532.375000
## beta1                 6.000000       0.0000002012432297973165.250000
## beta2                 4.000000       0.0000001328501284142088.000000
## beta3                 9.000000       0.0000003244399974637064.500000
## 
## R-squared:              1.0000
## Adjusted R-squared:     1.0000

No comments:

Post a Comment