Monday, 7 October 2013

A Brief Introduction on Vector Autoregression Model

Vector Autoregressions (VAR)

Vector Autoregressions (VAR)

What is VAR?

The vector autoregression (VAR) model is an extension of univariate autoregressive model to multivariate autoregressive model. It is one of the most widely used models to describe the dynamic interactions between varies variables in the system.

VAR Representations

Suppose you want to study the interactions between the GDP (\( y_1 \))and money supply (\( y_2 \)), and you beleive that they have contemporaneous effect to each other and their relationship can be modelled as:

Structural Form:

\[ y_{1,t} = a_1 - b_{12}y_{2,t} + c_{11}y_{1,t-1} + c_{12}y_{2,t-1} + \epsilon_{1,t} \] \[ y_{2,t} = a_2 - b_{21}y_{1,t} + c_{21}y_{1,t-1} + c_{22}y_{2,t-1} + \epsilon_{2,t} \]

The above 2 equations are in structural form as the equation contains contemporaneous value of other variables (coefficients \( b_{12} \) and \( b_{21} \))as regressors. The error terms in each regression are uncorrelated (\( cov(\epsilon_{1,t}, \epsilon_{2,t}) = 0 \)), the source of causing a shock in GDP should not related to the money supply's one.

The above 2 equations can be written in matrix form:

\[ \left[ \begin{array}{cc}1 & b_{12} \\ b_{21} & 1 \end{array} \right] \left[ \begin{array}{c}y_{1,t} \\ y_{2,t} \end{array} \right] = \left[ \begin{array}{c}a_1 \\ a_2 \end{array} \right] + \left[ \begin{array}{cc}c_{11} & c_{12} \\ c_{21} & c_{22} \end{array} \right] \left[ \begin{array}{c}y_{1,t-1} \\ y_{2,t-1} \end{array} \right] + \left[ \begin{array}{c}\epsilon_{1,t} \\ \epsilon_{2,t} \end{array} \right] \]

\[ By_t = a + C y_{t-1} + \epsilon_t \]

What do the coefficients mean?

The coefficients \( b_{12} \) and \( b_{21} \) represent the interactions between \( y_1 \) and \( y_2 \), if \( b_{12} \neq 0 \), it means that \( y_2 \) has a contemporaneous effect on \( y_1 \). The same applies for \( b_{12} \).

Now, for a shock in GDP (\( \epsilon_{1,t} \)), it will have an effect on \( y_{1,t} \) and sequently affect \( y_{2,t} \), thus \( \epsilon_{1,t} \) is correlated to \( y_{2,t} \). Due to this relation, the two equations above can NOT be estimated by OLS as it violated one of the Gauss-Markov assumptions - The error term and regressors are independent. (Please refer to my previous post on OLS)

Rewrite the equations in Reduced Form

The problem of the structural form VAR is it cannot be estimated by OLS, thus, you may tempted to rewrite it as reduced form VAR that can be estimated by OLS:

Assumming the matrix \( B = \left[ \begin{array}{cc}1 & b_{12} \\ b_{21} & 1 \end{array} \right] \) can be inverted: \( B^{-1} = \frac{1}{\Delta}\left[\begin{array}{cc}1 & -b_{12} \\ -b_{21} & 1 \end{array}\right] \), \( \Delta = 1-b_{12}b_{21} \)

\[ By_t = a + C y_{t-1} + \epsilon_t \] \[ y_t = B^{-1}a + B^{-1} C y_{t-1} + B^{-1}\epsilon_t \]

\[ \left[ \begin{array}{c}y_{1,t} \\ y_{2,t} \end{array} \right] = \left[ \begin{array}{cc}1 & b_{12} \\ b_{21} & 1 \end{array} \right]^{-1} \left[ \begin{array}{c}a_1 \\ a_2 \end{array} \right] + \left[ \begin{array}{cc}1 & b_{12} \\ b_{21} & 1 \end{array} \right]^{-1} \left[ \begin{array}{cc}c_{11} & c_{12} \\ c_{21} & c_{22} \end{array} \right] \left[ \begin{array}{c}y_{1,t-1} \\ y_{2,t-1} \end{array} \right] + \left[ \begin{array}{cc}1 & b_{12} \\ b_{21} & 1 \end{array} \right]^{-1} \left[ \begin{array}{c}\epsilon_{1,t} \\ \epsilon_{2,t} \end{array} \right] \]

We let \( B^{-1}a = \rho \), \( B^{-1}C = \gamma \), \( B^{-1}\epsilon_t = e \), The equation becomes:

\[ y_t = B^{-1}a + B^{-1} C y_{t-1} + B^{-1}\epsilon_t \] \[ y_t = \gamma+ \Gamma y_{t-1} + e_t \]

\[ \left[ \begin{array}{c}y_{1,t} \\ y_{2,t} \end{array} \right] = \left[ \begin{array}{c}\gamma_1 \\ \gamma_2 \end{array} \right] + \left[ \begin{array}{cc}\gamma_{11} & \gamma_{12} \\ \gamma_{21} & \gamma_{22} \end{array} \right] \left[ \begin{array}{c}y_{1,t-1} \\ y_{2,t-1} \end{array} \right] + \left[ \begin{array}{c}e_{1,t} \\ e_{2,t} \end{array} \right] \]

The equation now is in reduced form as the independent variables can be expressed in their own lags.

The residual terms \( e_t \) are correlated to each other as orginally, the covariances of error terms (\( \epsilon_t \))in structural form are: \[ Var(\epsilon_t) = E(\epsilon_t \epsilon_t') = \left[\begin{array}{cc}\sigma_{\epsilon_1}^2 & 0 \\ 0 & \sigma_{\epsilon_2}^2 \end{array} \right] = D \]

Since \( B^{-1}\epsilon_t = e_t \), the covariances of error terms (\( e_t \)) in reduced form are: \[ Var(e_t) = E(e_t e_t') = E(B^{-1}\epsilon_t (B^{-1}\epsilon_t)') = E(B^{-1} \epsilon_t \epsilon_t' B^{-1'} ) = B^{-1} D B^{-1'} = \Sigma \]

While the elements in \( \Sigma \) are : \( \left[ \begin{array}{cc}\sigma_{e_{11}}^2 & \sigma_{e_{12}}^2 \\ \sigma_{e_{21}}^2 & \sigma_{e_{22}}^2 \end{array}\right] \) = \( \frac{1}{\Delta^2} \left[\begin{array}{cc}\sigma_{\epsilon_1}^2+b_{12}^2\sigma_{\epsilon_2}^2 & -(b_{21}\sigma_{\epsilon_1}^2 + b_{12}\sigma_{\epsilon_2}^2) \\ -(b_{21}\sigma_{\epsilon_1}^2 + b_{12}\sigma_{\epsilon_2}^2) & \sigma_{\epsilon_2}^2+b_{21}^2\sigma_{\epsilon_1}^2 \end{array}\right] \)

The covariance between \( e_1 \) & \( e_2 \) will be zero when both \( b_{12} \) and \( b_{21} \) are zero, and if that's the case, it also implies that \( y_1 \) and \( y_2 \) do not have contemporaneous effect to each other.

Stationarity of VAR

In lag operator notation, the reduced form VAR model can be writted as: \[ y_t = \gamma+ \Gamma y_{t-1} + e_t \] \[ y_t - \Gamma y_{t-1} = \gamma + e_t \] \[ (1-\Gamma L) y_t = \gamma + e_t \] \[ \Gamma(L)y_t = \gamma + e_t \]

For stationarity, the formula should be able to express in moving average (MA) or Wold representation, that means the function \( \Gamma(L) \) should invertible. If \( \Gamma(L) \) is invertible, this means that characteristic roots of \( det(I - \Gamma z) = 0 \) lie outside the unit circle or equivalently, or the eigenvalues of the companion matrix F have modulues less than one : (More about companion matrix can be found in my other post)

\[ F = \left[\begin{array}{cccc} \Gamma_1 & \Gamma_2 & \cdots & \Gamma_n \\ I_n & 0 & \cdots & 0 \\ 0 & \ddots & 0 & \vdots \\ 0 & 0 & I_n & 0 \end{array} \right] \]

In our case, the companion matrix \( F \) for \( n=1 \) is:

\[ F = \Gamma \]

Estimation of VAR

So now the problem is, how can we estimate the structural VAR model? The steps are:

  1. Estimate the reduced form VAR model first.
  2. Find the Covariance matrix of the residuals of the estimated equations in 1.
  3. Apply Choleski decomposition for the covariance matrix to obtain matrix \( B^{-1} \).
  4. Multiply the matrix \( B \) to the reduced form VAR model to get the structural form back.

The True model is:

\[ y_t = \gamma+ \Gamma y_{t-1} + e_t \] \[ y_t = B^{-1}a + B^{-1} C y_{t-1} + B^{-1}\epsilon_t \]

If we can estimate the matrix \( B \), then we can multiple it to the above equation to retreive the TRUE model back. But in practical, it is no way the estimate the matrix \( B \) back by NOT making any restriction. Why? Let's consider the following method:

\[ \left[ \begin{array}{c}y_{1,t} \\ y_{2,t} \end{array} \right] = \left[ \begin{array}{c}\gamma_1 \\ \gamma_2 \end{array} \right] + \left[ \begin{array}{cc}\gamma_{11} & \gamma_{12} \\ \gamma_{21} & \gamma_{22} \end{array} \right] \left[ \begin{array}{c}y_{1,t-1} \\ y_{2,t-1} \end{array} \right] + \left[ \begin{array}{c}e_{1,t} \\ e_{2,t} \end{array} \right] \]

We can estimated the equations by OLS to obtain:

\[ y_t = \hat{\gamma}+ \hat{\Gamma} y_{t-1} + \hat{e_t} \]

The covariance matrix of the residuals is: \[ E(\hat{e_t} \hat{e_t}') = \hat{\Sigma} \]

Next, we perform Choleski decomposition for the matrix \( \hat{\Sigma} \) \[ \hat{\Sigma} = L \Lambda L' \]

As mentioned before, the True covariance matrix is \( \Sigma = B^{-1}DB^{-1'} \), and the elements of \( \Sigma \) are:

\[ \left[ \begin{array}{cc}\sigma_{e_{11}}^2 & \sigma_{e_{12}}^2 \\ \sigma_{e_{21}}^2 & \sigma_{e_{22}}^2 \end{array}\right] = \frac{1}{\Delta^2} \left[\begin{array}{cc}\sigma_{\epsilon_1}^2+b_{12}^2\sigma_{\epsilon_2}^2 & -(b_{21}\sigma_{\epsilon_1}^2 + b_{12}\sigma_{\epsilon_2}^2) \\ -(b_{21}\sigma_{\epsilon_1}^2 + b_{12}\sigma_{\epsilon_2}^2) & \sigma_{\epsilon_2}^2+b_{21}^2\sigma_{\epsilon_1}^2 \end{array}\right] \]

The problem we encounter now is, even we got the true convariance matrix \( \Sigma \), we cannot estimate the 4 parameters (\( b_{12}, b_{21}, \sigma_{\epsilon_1}, \sigma_{\epsilon_2} \)) by only has 3 values (\( \sigma_{e11} \), \( \sigma_{e22} \), \( \sigma_{e12}=\sigma_{e21} \)) observated. That implies that, some restrictions must be imposed in order to make estimation. The most commonly used restriction is to set \( b_{12} = 0 \), which assumes that \( y_2 \) has no contemporaneous effect on \( y_1 \) but \( y_1 \) do has contemporaneous effect on \( y_2 \). When making this assumption, you should focus on the economic meaning rather than the statistical meaning of the equation.

Try it out! Generating the data

To see whether how the estimation works, let's get our hands dirty now. We generate the data \( y_1 \) and \( y_2 \) by the following bivariate SVAR(1) model:

a1 <- 1.6
b12 <- 0
c11 <- 0.5
c12 <- 0.8

a2 <- -0.8
b21 <- 0.5
c21 <- 0.8
c22 <- 0.3

\[ y_{1,t} = 1.6 - 0y_{2,t} + 0.5y_{1,t-1} + 0.8y_{2,t-1} + \epsilon_{1,t} \] \[ y_{2,t} = -0.8 - 0.5y_{1,t} + 0.8y_{1,t-1} + 0.3y_{2,t-1} + \epsilon_{2,t} \]

\[ \left[ \begin{array}{cc}1 & 0 \\ 0.5 & 1 \end{array} \right] \left[ \begin{array}{c}y_{1,t} \\ y_{2,t} \end{array} \right] = \left[ \begin{array}{c}1.6 \\ -0.8 \end{array} \right] + \left[ \begin{array}{cc}0.5 & 0.8 \\ 0.8 & 0.3 \end{array} \right] \left[ \begin{array}{c}y_{1,t-1} \\ y_{2,t-1} \end{array} \right] + \left[ \begin{array}{c}\epsilon_{1,t} \\ \epsilon_{2,t} \end{array} \right] \]

*As we will impose the restriction of \( b_{12} =0 \) in the estimation process, the parameter \( b_{12} \) is set to zero in the data generating process.

For the data generation process of \( y_1 \) and \( y_2 \), we first did it by rewriting the structural form equations through subsitution.

\[ y_{1,t} = a_1 - b_{12}y_{2,t} + c_{11}y_{1,t-1} + c_{12}y_{2,t-1} + \epsilon_{1,t} \] \[ y_{2,t} = a_2 - b_{21}y_{1,t} + c_{21}y_{1,t-1} + c_{22}y_{2,t-1} + \epsilon_{2,t} \]

Let's subsitute \( y_{2,t} \) to \( y_{1,t} \):

\[ y_{1,t} = a_1 - a_2 b_{12} + b_{12}b_{21}y_{1,t} + (c_{11}-b_{12}c_{21})y_{1,t-1}+(c_{12}-b_{12}c_{22})y_{2,t-1} - b_{12}\epsilon_{2,t}+\epsilon_{1,t} \]

\[ y_{1,t} = \frac{1}{1-b_{12}b_{21}} (a_1 - a_2 b_{12} + (c_{11}-b_{12}c_{21})y_{1,t-1}+(c_{12}-b_{12}c_{22})y_{2,t-1} -b_{12}\epsilon_{2,t}+\epsilon_{1,t}) \]

One we have \( y_{1,t} \), we can subsitute \( y_{1,t} \) back to get \( y_{2,t} \).

Let's generate the data in R.


# generate 1,000 independent suprise terms in y1 and y2
set.seed(13)
y1.innov <- rnorm(1000, 0, 1)
y2.innov <- rnorm(1000, 0, 1)

y1.data <- rep(0, 1000)
y2.data <- rep(0, 1000)
y1.data[1] <- 0
y2.data[1] <- 0

for (i in 2:1000) {
    y1.data[i] <- (1/(1 - b12 * b21)) * (a1 - a2 * b12 + (c11 - b12 * c21) * 
        y1.data[i - 1] + (c12 - b12 * c22) * y2.data[i - 1] - b12 * y2.innov[i] + 
        y1.innov[i])

    y2.data[i] <- a2 - b21 * y1.data[i] + c21 * y1.data[i - 1] + c22 * y2.data[i - 
        1] + y2.innov[i]
}

Let's plot the last 100 observations of the 2 series :

library(ggplot2)
library(reshape2)

x <- seq(901:1000)
df <- data.frame(x, y1 = y1.data[901:1000], y2 = y2.data[901:1000])

df.melted <- melt(df, id.vars = "x")

g <- ggplot(data = df.melted, aes(x = x, y = value, color = variable)) + geom_line()

g

plot of chunk unnamed-chunk-3

Ok, now we have the data \( y1 \) and \( y2 \) and they seem correlated to each other, let's try out the estimation procedure and see whether we can retrieve the structural form or not.

Estimation of VAR in Brute Force

As mentioned above, the estimation procedures of VAR are:

  1. Estimate the reduced form VAR model first.
  2. Find the Covariance matrix of the residuals of the estimated equations in 1.
  3. Apply Choleski decomposition for the covariance matrix to obtain matrix \( L \), which is the estimator of \( B^{-1} \).
  4. Multiply the matrix $L{-1} to the estimated reduced form VAR model to get the structural form back.

Setp 1 - Estimated the reduced form of VAR model

So, let's do the step 1

n <- 1000

t <- 2:n

df <- data.frame(y1 = y1.data, y2 = y2.data)
# reduced form
y1.r.fit <- lm(y1[t] ~ y1[t - 1] + y2[t - 1], df)
y2.r.fit <- lm(y2[t] ~ y1[t - 1] + y2[t - 1], df)

Let's check the regression results:

y1.r.fit
## 
## Call:
## lm(formula = y1[t] ~ y1[t - 1] + y2[t - 1], data = df)
## 
## Coefficients:
## (Intercept)    y1[t - 1]    y2[t - 1]  
##       1.592        0.505        0.777
y2.r.fit
## 
## Call:
## lm(formula = y2[t] ~ y1[t - 1] + y2[t - 1], data = df)
## 
## Coefficients:
## (Intercept)    y1[t - 1]    y2[t - 1]  
##      -1.575        0.546       -0.110

And the matrices in reduced form are:

gamma1 <- y1.r.fit$coef[1]
gamma2 <- y2.r.fit$coef[1]

gamma.hat <- matrix(c(gamma1, gamma2), ncol = 1, nrow = 2, byrow = F)

gamma11 <- y1.r.fit$coef[2]
gamma12 <- y1.r.fit$coef[3]
gamma21 <- y2.r.fit$coef[2]
gamma22 <- y2.r.fit$coef[3]

Gamma.hat <- matrix(c(gamma11, gamma12, gamma21, gamma22), ncol = 2, byrow = T)

Now, let's check the stationarity of the series. For VAR(1) be stationary, the eigenvalues of the matrix \( \Gamma \) must in the unit circle

eigen(Gamma.hat, only.values = T)$values
## [1]  0.9179 -0.5228

Step 2 - get the covariance matrix \( \hat{\Sigma} \) of the estimated residuals:

res.mat <- cbind(y1.resid = y1.r.fit$residuals, y2.resid = y2.r.fit$residuals)
Sigma.hat <- var(res.mat)
colnames(Sigma.hat) <- NULL
rownames(Sigma.hat) <- NULL
Sigma.hat
##         [,1]    [,2]
## [1,]  1.0003 -0.5039
## [2,] -0.5039  1.2332

Step 3 - Do the Choleski decomposition to obrain the matrix \( L \), which is the estimator of \( B^{-1} \).

ch <- chol(Sigma.hat)
dd <- diag(ch)
L <- t(ch/dd)
D <- diag(dd^2)

Linv = solve(L)

Step 4 - Multiply \( L^{-1} \) to get back the structural form

a.hat = Linv %*% gamma.hat
C.hat = Linv %*% Gamma.hat

Now, Let's compare the True value with the estimated value

\( B=\left[ \begin{array}{cc}1 & 0 \\ 0.5 & 1 \end{array} \right] \), \( \hat{B} = \left[ \begin{array}{cc}1 & 0 \\ 0.5037 & 1 \end{array} \right] \)

\( a=\left[ \begin{array}{c}1.6 \\ -0.8 \end{array} \right] \), \( \hat{a}=\left[ \begin{array}{c}1.5921 \\ -0.7733 \end{array} \right] \),

\( C=\left[ \begin{array}{cc}0.5 & 0.8 \\ 0.8 & 0.3 \end{array} \right] \), \( \hat{C}=\left[ \begin{array}{cc}0.5047 & 0.7768 \\ 0.8007 & 0.2817 \end{array} \right] \)

Check the covariance matrix of \( \hat{\epsilon} \) in the structual form, there should not be any covariance between the 2 shocks.

epsilon.hat = list()
for (i in 1:1000) {
    epsilon.hat[[i]] = Linv %*% as.matrix(c(y1.r.fit$residuals[i], y2.r.fit$residuals[i]))
}

e.mat = matrix(0, nrow = 999, ncol = 2)
for (i in 1:999) {
    e.mat[i, 1] = epsilon.hat[[i]][1]
    e.mat[i, 2] = epsilon.hat[[i]][2]
}
cov(e.mat)
##           [,1]       [,2]
## [1,]  1.00e+00 -5.540e-17
## [2,] -5.54e-17  9.794e-01

Everything seems pretty good. Now, let's talk about how to make analysis by using the VAR model.

Moving Average (MA) or Wold Representation

At the time of discussing stationarity, I mentioned that the reduced form VAR model can be rewrite to wold representation, now consider a VAR(1) model

\[ y_t = \gamma+ \Gamma y_{t-1} + e_t \] \[ y_t - \Gamma y_{t-1} = \gamma + e_t \] \[ (1-\Gamma L) y_t = \gamma + e_t \] \[ y_t = \frac{1}{(1-\Gamma L)} \gamma + \frac{1}{(1-\Gamma L)} e_t \] \[ y_t = \phi + \Phi(L)e_t \]

where: \[ \Phi(L) = (1-\Gamma L)^{-1} = \sum_{i=0}^{\infty} \Gamma^i L^i \]

So, \[ y_t = \phi + e_t + \Gamma^1 e_{t-1} + \Gamma^2 e_{t-2} + \cdots \]

Since the vector \( e_{t} \) is correlated, the above equation does not give any insight. But, recall that \( B^{-1}\epsilon_t = e_t \), it can be further written as:

\[ y_t = \phi + B^{-1}\epsilon_t + \Gamma^1 B^{-1}\epsilon_{t-1} + \Gamma^2 B^{-1}\epsilon_{t-2} + \cdots \] \[ y_t = \phi + \Upsilon(L)\epsilon_t \]

where

\[ \Upsilon(L) = \sum_{k=0}^{\infty} \Upsilon_k L^k \] \[ \Upsilon(L) = B^{-1} + \Gamma^1 B^{-1} L + \Gamma^2 B^{-1} L^2 + \cdots \]

Since \( \epsilon_{t-i} \) is the structural innovations, the above equation is call the structural moving average (SMA) form.

Let's have a look at the SMA representation for the bivariate system:

\[ \left[\begin{array}{c} y_{1,t} \\ y_{2,t} \end{array}\right] = \left[\begin{array}{c} \phi_1 \\ \phi_2 \end{array}\right] + \left[\begin{array}{cc} \upsilon_{11}^{(0)} & \upsilon_{12}^{(0)}\\ \upsilon_{21}^{(0)} & \upsilon_{22}^{(0)} \end{array}\right] \left[\begin{array}{c} \epsilon_{1,t} \\ \epsilon_{2,t} \end{array}\right] + \left[\begin{array}{cc} \upsilon_{11}^{(1)} & \upsilon_{12}^{(1)}\\ \upsilon_{21}^{(1)} & \upsilon_{22}^{(1)} \end{array}\right] \left[\begin{array}{c} \epsilon_{1,t-1} \\ \epsilon_{2,t-1} \end{array}\right] + \cdots \]

The coefficient \( \upsilon_{ij}^{(k)} \) represent the dynamic multipler or impulse responses of \( y_{i,t} \) to changes in \( \epsilon_{j,t} \)

The Impulse Response Function (IRF)

Consider the SMA representation of the equation at time \( t+s \):

\[ \left[\begin{array}{c} y_{1,t+s} \\ y_{2,t+s} \end{array}\right] = \left[\begin{array}{c} \phi_1 \\ \phi_2 \end{array}\right] + \left[\begin{array}{cc} \upsilon_{11}^{(0)} & \upsilon_{12}^{(0)}\\ \upsilon_{21}^{(0)} & \upsilon_{22}^{(0)} \end{array}\right] \left[\begin{array}{c} \epsilon_{1,t+s} \\ \epsilon_{2,t+s} \end{array}\right] + \cdots + \left[\begin{array}{cc} \upsilon_{11}^{(s)} & \upsilon_{12}^{(s)}\\ \upsilon_{21}^{(s)} & \upsilon_{22}^{(s)} \end{array}\right] \left[\begin{array}{c} \epsilon_{1,t} \\ \epsilon_{2,t} \end{array}\right] \]

If we differentiate the above equation with respect to indidivual structural innvoation, we can obtain the structural dynamic multipliers:

\[ \begin{array}{cc} \frac{\delta y_{1,t+s}}{\delta \epsilon_{1,t}} = \upsilon_{11}^{(s)} & \frac{\delta y_{1,t+s}}{\delta \epsilon_{2,t}} = \upsilon_{12}^{(s)} \\ \frac{\delta y_{2,t+s}}{\delta \epsilon_{1,t}} = \upsilon_{21}^{(s)} & \frac{\delta y_{2,t+s}}{\delta \epsilon_{2,t}} = \upsilon_{22}^{(s)} \end{array} \]

And if the 2 series \( y_t \) are covariance stationary:

\[ lim_{s\rightarrow \infty} \upsilon_{ij}^{(s)} = 0, \text{for }i, j \in \{1,2\} \]

Thus, the long run cumulative impact of the structural innovations is captured by the long-run impact matrix:

\[ \Upsilon(L) = \left[\begin{array}{cc} \sum_{s=0}^{\infty}\upsilon_{11}^{(s)}L^s & \sum_{s=0}^{\infty}\upsilon_{12}^{(s)} L^s\\ \sum_{s=0}^{\infty}\upsilon_{21}^{(s)} L^s & \sum_{s=0}^{\infty}\upsilon_{22}^{(s)} L^s \end{array}\right] \]

Now, let's calculate and plot the 4 cummulative IPFs in R up to 100 lags.

nLags = 100
Upsilon.list <- list()

for (i in 1:nLags) {

    tempGamma <- Gamma.hat
    if (i == 1) 
        tempGamma <- diag(2) else if (i > 2) {
        for (j in 3:i) {
            tempGamma <- tempGamma %*% Gamma.hat
        }
    }

    Upsilon.list[[i]] <- L %*% tempGamma
}

Culsum.Upsilon.list <- Upsilon.list
for (i in 1:nLags) {
    j <- 1
    while (j < i) {
        Culsum.Upsilon.list[[i]] <- Culsum.Upsilon.list[[i]] + Upsilon.list[[j]]
        j <- j + 1
    }
}

upsilon11 = array()
upsilon12 = array()
upsilon21 = array()
upsilon22 = array()

for (i in 1:nLags) {
    upsilon11[i] = Culsum.Upsilon.list[[i]][1, 1]
    upsilon12[i] = Culsum.Upsilon.list[[i]][1, 2]
    upsilon21[i] = Culsum.Upsilon.list[[i]][2, 1]
    upsilon22[i] = Culsum.Upsilon.list[[i]][2, 2]
}

par(mfrow = c(2, 2))
# tex = 'Shock in '+expression(epsilon_(t))
plot(upsilon11, type = "l", xlab = "Time", ylab = expression(paste("Shock to ", 
    epsilon[1])), main = "Response of Y1")
plot(upsilon12, type = "l", xlab = "Time", ylab = expression(paste("Shock to ", 
    epsilon[2])), main = "Response of Y2")
plot(upsilon21, type = "l", xlab = "Time", ylab = expression(paste("Shock to ", 
    epsilon[1])), main = "Response of Y1")
plot(upsilon22, type = "l", xlab = "Time", ylab = expression(paste("Shock to ", 
    epsilon[2])), main = "Response of Y2")

plot of chunk unnamed-chunk-13

IMpulse responses tell you the response of the current and fture values of the variables to a unit increase in the current value of VAR structural innovations.

Forecast Error Variance Decompositions

Forecast error variance decompositions help us to determine the proportion of the variability of the errors due to the structural shocks of \( \epsilon_1 \) and \( \epsilon_2 \) in forecasting \( y_1 \) and \( y_2 \) at time \( t+s \), based on the information at time \( t \).

Again, the Wold representation of \( y_{t+s} \) is:

\[ y_t = \phi + B^{-1}\epsilon_t + \Gamma^1 B^{-1}\epsilon_{t-1} + \Gamma^2 B^{-1}\epsilon_{t-2} + \cdots \] \[ y_{t+s} = \phi + \Upsilon_0\epsilon_{t+s} + \Upsilon_1\epsilon_{t+s-1} + \Upsilon_2\epsilon_{t+s-2} + \cdots +\Upsilon_s\epsilon_{t} + \Upsilon_{s+1}\epsilon_{t-1} + \cdots \]

Based on the information available at time \( t \), the best linear forecast of \( y_{t+s} \) is:

\[ \hat{y_{t+s}} = \phi + \Upsilon_s\epsilon_{t} + \Upsilon_{s+1}\epsilon_{t-1} + \Upsilon_{s+2}\epsilon_{t-2} + \cdots \]

The forecast error is:

\[ y_{t+s} -\hat{y_{t+s}} = \Upsilon_0\epsilon_{t+s} + \Upsilon_1\epsilon_{t+s-1} + \Upsilon_2\epsilon_{t+s-2} + \cdots +\Upsilon_{s-1}\epsilon_{t+1} \]

Thus, the forecast error by equations is:

\[ \left[\begin{array}{c} y_{1,t+s} - \hat{y_{1,t+s}} \\ y_{2,t+s} - \hat{y_{2,t+s}} \end{array}\right] = \left[\begin{array}{cc} \upsilon_{11}^{(0)} & \upsilon_{12}^{(0)}\\ \upsilon_{21}^{(0)} & \upsilon_{22}^{(0)} \end{array}\right] \left[\begin{array}{c} \epsilon_{1,t+s} \\ \epsilon_{2,t+s} \end{array}\right] + \cdots + \left[\begin{array}{cc} \upsilon_{11}^{(s-1)} & \upsilon_{12}^{(s-1)}\\ \upsilon_{21}^{(s-1)} & \upsilon_{22}^{(s-1)} \end{array}\right] \left[\begin{array}{c} \epsilon_{1,t+1} \\ \epsilon_{2,t+1} \end{array}\right] \]

For the first equation \[ y_{1,t+s} - \hat{y_{1,t+s}} = \begin{array}{l} \upsilon_{11}^{(0)}\epsilon_{1,t+s} + \cdots + \upsilon_{11}^{(s-1)}\epsilon_{1,t+1} \\ + \upsilon_{21}^{(0)}\epsilon_{2,t+s} + \cdots + \upsilon_{21}^{(s-1)}\epsilon_{2,t+1} \end{array} \]

\[ \sigma(s)_{1}^{2} = Var(y_{1,t+s} - \hat{y_{1,t+s}}) = \sigma_{\epsilon_1}^2( (\upsilon_{11}^{(0)})^2 + (\upsilon_{11}^{(1)})^2 + \dots + (\upsilon_{11}^{(s-1)})^2) + \sigma_{\epsilon_2}^2( (\upsilon_{21}^{(0)})^2 + (\upsilon_{21}^{(1)})^2 + \dots + (\upsilon_{21}^{(s-1)})^2) \]

Thus, the proportion of the total variance of forecasting series \( y_1 \) \( s \) period later due to shock in \( \epsilon_1 \) is:

\[ \frac{\sigma_{\epsilon_1}^2( (\upsilon_{11}^{(0)})^2 + (\upsilon_{11}^{(1)})^2 + \dots + (\upsilon_{11}^{(s-1)})^2)}{\sigma(s)_{1}^2} \]

The same idea applies for series \( y_2 \), shock \( \epsilon_2 \) as well.

My next post will talk about the VAR package available in R, stay tuned.

Friday, 6 September 2013

Introduction to Principal Component Analysis (PCA)

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis (PCA) is very useful in finance to reduce number of risk factors to a managerable dimension.

Alright, what is PCA? Let's have a look on the following examples.

Let X and Y be random variables where \( X \sim N(2,2) \), \( Y \sim N(4,4) \), \( corr(X,Y)=0.8, cov(X,Y) = corr(X,Y)*sqrt(2)*sqrt(4) \)

We generate 1000 observations of X and Y in R by using function 'mvrnorm' in the package 'MASS'

library("MASS")
set.seed(7)
mu.x <- 0
mu.y <- 0
sigma.x <- sqrt(2)
sigma.y <- sqrt(4)
corr.xy <- 0.8
cov.xy <- corr.xy * sigma.x * sigma.y
Sigma.xy <- matrix(c(sigma.x^2, cov.xy, cov.xy, sigma.y^2), nrow = 2, byrow = T)
# generate the data
data <- mvrnorm(1000, mu = c(mu.x, mu.y), Sigma = Sigma.xy)
colnames(data) <- c("X", "Y")
head(data)
##            X       Y
## [1,]  2.1630  4.9775
## [2,] -1.0629 -2.6492
## [3,] -1.0180 -1.2755
## [4,] -0.3050 -0.9525
## [5,] -0.7407 -2.2278
## [6,] -1.5636 -1.6265
df = data.frame(data)

Let's have a quick check on the generated data

# mean and variance of X, should equal to 2,2
mean(data[, 1])
## [1] -0.007433
var(data[, 1])
## [1] 1.918

# mean and variance of Y, should equal to 4,4
mean(data[, 2])
## [1] 0.01335
var(data[, 2])
## [1] 3.915

# Correlation Matrix of X and Y, should be equal to 0.8
cor(data)
##       X     Y
## X 1.000 0.783
## Y 0.783 1.000

Now, let's plot the generated random variables X and Y.

library(ggplot2)
library(grid)

g = ggplot(df, aes(X, Y)) + geom_point() + geom_point(data = NULL, aes(x = 0, 
    y = 0, colour = "red"), show_guide = FALSE)
plot(g)

plot of chunk unnamed-chunk-3

Assuming that we don't know the underlying distriubtion of both X and Y, and we want to find out the common risk factors of X and Y. PCA is one of the solutions for you.

The first step of performing PCA is to find out the covariance matrix \( \Sigma \) of the variables (or correlation matrix), followed by finding out their eigenvalues and eigenvectors In R, covariance of X and Y can be obtained by:

Sigmahat = cov(data)
e = eigen(Sigmahat)

The two eigenvalues (\( \lambda_1 \), \( \lambda_2 \)) and their eigenvectors (\( w_1, w_2 \)) are:

e$values[1]
## [1] 5.282
e$values[2]
## [1] 0.5498

e$vectors[, 1]
## [1] 0.5376 0.8432
e$vectors[, 2]
## [1] -0.8432  0.5376

Next, we rank the eigenvalue from largest to lowest and sort the correposing eigenvector to create a matrix \( W \) (Actually, R sorted the result for you already, the following code is NOT necessary but for illustration purpose only.

s = sort(e$values, decreasing = T, index.return = T)
# find out the dimension of eignvectors
n = length(e$vectors[, 1])
W = matrix(nrow = n, ncol = n)
for (i in 1:n) {
    W[, i] = e$vectors[, s$ix[i]]
}
W
##        [,1]    [,2]
## [1,] 0.5376 -0.8432
## [2,] 0.8432  0.5376

Principal Component Representation

We denote the values of X and Y in a matrix \( D \), the principal components of the volatility matrix \( \Sigma \) is defined by: \[ P = DW \]

D <- data
P <- D %*% W

The total variation that in the data \( D \) is the sum of the eigenvalues of \( \Sigma \), which is \( \lambda_1+\lambda_2 \). The total variation explained by the mth principal component is \( \frac{\lambda_m}{\sum \lambda_i} \). In our case, the 1st principal component explained about 90.5728 percent of the variation of X.

e$values[1]/(e$values[1] + e$values[2])
## [1] 0.9057

So, to reduce the dimension of our data, we may reconstruct the data \( D \) by multiply P with the orthogonal eigenvectors matrix \( W \) (\( W^{-1} = W' \)): \[ D = PW^{-1} =PW' \] We pick the first largest \( k \) columns of \( P \) be \( P^* \) and take the \( k \) columns of \( W \) be \( W^* \). \[ D \sim P^* W^{*'} \]

In our example:

\( D = PW \) consider the 1st and 2nd rows of \( D \):

\[ [\begin{array}{cc} 2.1630 & 4.9774\end{array}] = [\begin{array}{cc} 5.3598 & 0.8519 \end{array}] \left[\begin{array}{cc} 0.5375 & 0.8432 \\ -0.8432 & 0.5375 \end{array} \right] \]

\[ [\begin{array}{cc} -1.0629 & -2.6492\end{array}] = [\begin{array}{cc} -2.8052 & -0.5279 \end{array}] \left[\begin{array}{cc} 0.5375 & 0.8432 \\ -0.8432 & 0.5375 \end{array} \right] \]

\( D \sim P^*W^{*'} \) consider the 1st and 2nd rows of \( D \):

\[ [\begin{array}{cc} 2.1630 & 4.9774\end{array}] \sim [5.3598] * [\begin{array}{cc} 0.5375 & 0.8432 \end{array} ] \sim [\begin{array}{cc} 2.8809 & 4.5194 \end{array} ] \]

\[ [\begin{array}{cc} -1.0629 & -2.6492\end{array}] \sim [-2.8052] * [\begin{array}{cc} 0.5375 & 0.8432 \end{array}] \sim [\begin{array}{cc} -1.5078 & -2.3653 \end{array} ] \]

The matrices W and W* are:

W
##        [,1]    [,2]
## [1,] 0.5376 -0.8432
## [2,] 0.8432  0.5376
as.matrix(W[, 1])
##        [,1]
## [1,] 0.5376
## [2,] 0.8432

PCA for 10 Random Variables with 1,000 Observations

PCA on 2 variables may not convince you it is a useful technique, let's have a look on the example of 10 RVs, you may change your mind :)

Since PCA is useful in highly correlated data, we generate 10 RVs with random variances and high correlations among them.

library("MASS")
set.seed(7)

# limit sigma between 2 to 6, mu between -5 to 5
sigma.10 <- runif(10, 2, 6)
mu.10 <- runif(10, -5, 5)
# correlations are set to be high, from 0.9 to 1
corr.data <- runif((10 * 9)/2, 0.9, 1)


corr.mat <- diag(10)
corr.mat[lower.tri(corr.mat)] <- corr.data
corr.mat <- corr.mat %*% t(corr.mat)  #LU decomposition
var.mat <- diag(10) * sigma.10
cov.mat <- var.mat %*% corr.mat %*% var.mat

D <- mvrnorm(1000, mu = mu.10, Sigma = cov.mat)
colnames(D) <- c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10")

We now have 1000 observations of random variables X1, X2, …, X10, the first 6 rows of the observations are:

head(D)
##            X1      X2      X3       X4      X5       X6     X7       X8
## [1,]  -4.6118  -8.206 -3.7575 -12.4149 -8.3256 -13.3406 -3.575 -10.7387
## [2,]   3.1745  -2.198  6.5436   0.2722  5.9752   3.1262 -1.500 -14.9713
## [3,]  -0.2836   6.053  9.8119   0.4989  5.4662  -0.3006  2.764   0.9408
## [4,] -10.4104 -15.783 -1.2564  -9.9002 -8.9553 -16.8600 -4.604   3.3858
## [5,]   3.9650  -5.453 -1.1502  -6.4966 -7.8201 -11.3640 -3.997  -7.4090
## [6,]  -7.3194  -6.590 -0.4325  -6.3764  0.4998  -3.2336 -5.920  -9.4978
##         X9       X10
## [1,] 4.446   4.46822
## [2,] 3.341   2.96328
## [3,] 1.167 -11.08713
## [4,] 6.911  -1.26148
## [5,] 3.730  -0.89635
## [6,] 6.476   0.04366

Now, we apply PCA to reduce the dimension of the data

Sigmahat = cov(D)
e = eigen(Sigmahat)
W = cbind(e$vectors)

Let's examine the eigenvalues

e$values
##  [1] 691.743  58.670  30.988  16.657  11.989   6.396   4.847   3.054
##  [9]   3.007   1.593

With the first 4 eigenvectors, we can explain more than 96.2742 percent of the variation of the data D.

sum(e$values[1:4])/sum(e$values) * 100
## [1] 96.27

Let's compare the first 4 columns in \( P \) and \( W \) be \( P^* \) and \( W^* \)

P = D %*% W

D.hat = P[, 1:4] %*% t(W[, 1:4])
colnames(D.hat) <- c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10")

Now, let's look at the difference between the orginal data \( D \) and regenerate data \( \hat{D} \)

head(D)
##            X1      X2      X3       X4      X5       X6     X7       X8
## [1,]  -4.6118  -8.206 -3.7575 -12.4149 -8.3256 -13.3406 -3.575 -10.7387
## [2,]   3.1745  -2.198  6.5436   0.2722  5.9752   3.1262 -1.500 -14.9713
## [3,]  -0.2836   6.053  9.8119   0.4989  5.4662  -0.3006  2.764   0.9408
## [4,] -10.4104 -15.783 -1.2564  -9.9002 -8.9553 -16.8600 -4.604   3.3858
## [5,]   3.9650  -5.453 -1.1502  -6.4966 -7.8201 -11.3640 -3.997  -7.4090
## [6,]  -7.3194  -6.590 -0.4325  -6.3764  0.4998  -3.2336 -5.920  -9.4978
##         X9       X10
## [1,] 4.446   4.46822
## [2,] 3.341   2.96328
## [3,] 1.167 -11.08713
## [4,] 6.911  -1.26148
## [5,] 3.730  -0.89635
## [6,] 6.476   0.04366
head(D.hat)
##             X1     X2     X3     X4      X5      X6     X7      X8      X9
## [1,]  -7.51019 -6.290 -5.389 -5.777  -8.319 -13.837 -7.337  -8.501  0.4882
## [2,]   1.83168  1.640  1.991  2.595   3.742   4.632 -2.706 -13.709 -1.4306
## [3,]   4.55053  3.337  2.220  1.886   2.290   3.053  1.767   1.893 -3.1759
## [4,] -12.21744 -9.699 -7.920 -8.310 -11.048 -15.698 -3.805   4.080  1.6485
## [5,]  -0.02942 -1.028 -2.337 -3.491  -6.366 -12.421 -6.164  -5.554 -0.9276
## [6,]  -7.54581 -5.473 -3.311 -2.639  -2.574  -3.151 -3.451  -8.756 -0.6881
##          X10
## [1,]  5.6071
## [2,]  4.6427
## [3,] -9.4474
## [4,]  0.6940
## [5,]  0.6069
## [6,]  2.6988

And let's plot the first 4 eigenvectors

Final remark

Actually, R has a built-in function 'princomp' for performing PCA, but I think it is worthwhile to learn PCA in brute force programming.

p = princomp(D)
plot(p)

plot of chunk unnamed-chunk-17

summary(p)
## Importance of components:
##                         Comp.1  Comp.2  Comp.3  Comp.4  Comp.5   Comp.6
## Standard deviation     26.2879 7.65580 5.56387 4.07930 3.46075 2.527669
## Proportion of Variance  0.8345 0.07078 0.03738 0.02009 0.01446 0.007715
## Cumulative Proportion   0.8345 0.90527 0.94265 0.96274 0.97720 0.984920
##                          Comp.7   Comp.8   Comp.9  Comp.10
## Standard deviation     2.200474 1.746596 1.733065 1.261686
## Proportion of Variance 0.005847 0.003684 0.003627 0.001922
## Cumulative Proportion  0.990767 0.994451 0.998078 1.000000

What we have done so far is: We reduced the data dimensions while preserved around 96% of variability the orginal data.

I'll show you how PCA be used in finance area in the coming future. Let's take a break now.

Wednesday, 28 August 2013

Basic Linear Regression - Part 2

Linear Regression - Introduction

Linear Regression - Part 2

Have you had your coffee yet? Okay, let's continue

2. Goodness-of-fit

To measure how good our linear regression model performs its job, R-square \( R^2 \) will be measured:

\[ R^2 = \frac{\hat{V}\{\hat{y}\}} {\hat{V}\{y\}} \]

where \( \hat{V}() \) denotes the sample variance

\[ R^2=\frac{1/(N-1) \sum_{i=1}^{N}(\hat{y_i}-\bar{y})^2}{1/(N-1) \sum_{i=1}^{N}(y_i-\bar{y})^2} \]

As:
\( TSS \) = Total sum of Squares = \( \sum_{i=1}^{N}(y_i - \bar{y})^2 \)
\( SSR \) = Regression Sum of Squares = \( \sum_{i=1}^{N}(\hat{y_i} - \bar{y})^2 \)
\( SSE \) = Error Sum of Squares = \( \sum_{i=1}^{N}(y_i - \hat{y_i})^2 \)

and \( TSS = SSR + SSE \)

\[ R^2 = \frac{SSR}{TSS} \]
\[ R^2 = 1 - \frac{SSE}{TSS} \]
\[ R^2 = 1 - \frac{\hat{V}\{e\}}{\hat{V}\{y\}} = 1 - \frac{\frac{1}{N-1} \sum_{i=1}^{N} e_{i}^{2}}{\frac{1}{N-1} \sum_{i=1}^{N} (y_i - \bar{y})^2} \]

The \( R^2 \) measure has a drawback that it will never decrease if the number of regressors is increased, even it has no explanatory power in the regression. To compensate, adjusted \( R^2 \) aka \( \bar{R}^2 \) is proposed

\[ \bar{R}^2 = 1 - \frac{\frac{1}{N-1} \sum_{i=1}^{N-K} e_{i}^{2}}{\frac{1}{N-1} \sum_{i=1}^{N} (y_i - \bar{y})^2} \]

Now, let's modify the regression and its summary functions by incorporating the \( R^2 \) and \( \bar{R}^2 \) measures:


lr <- function(X, y, method = "OLS", intercept = T) {
    X = as.matrix(X)
    K = ncol(X)
    N = nrow(X)
    beta = matrix()
    result = list()


    if (intercept == T) {
        # reset the length of vector beta
        K = K + 1
        X = cbind(1, X)
    }

    beta = solve(t(X) %*% X) %*% t(X) %*% y

    # calculate the residuals
    resid = y - X %*% beta

    # calculate the variance of residuals
    residvar = (t(resid) %*% resid)/(N - K)
    # convert it to value
    residvar = residvar[1, 1]

    # calcuate R-squared and adjusted R-sqaured calculate the variance of
    # residuals (not adjusted for no. of regressors)
    SSE = (t(resid) %*% resid)/(N - 1)
    SSE = SSE[1, 1]
    adjSSE = residvar
    TSS = t(y - mean(y)) %*% (y - mean(y))/(N - 1)
    TSS = TSS[1, 1]
    Rsquared = 1 - (SSE/TSS)
    adjRsquared = 1 - (adjSSE/TSS)

    betacovar = solve(t(X) %*% X) * residvar

    # assign value to the list result and return it
    result$beta$val = beta
    result$resid$val = resid
    result$resid$var = residvar
    result$beta$covar = betacovar
    result$beta$var = diag(betacovar)
    result$beta$se = sqrt(diag(betacovar))
    result$hasIntercept = intercept
    result$R = Rsquared
    result$adjR = adjRsquared
    result
}
lr.summary <- function(result) {
    displayAdjust = 0
    K = length(result$beta$val)

    cat("Regression Results:\n\n")
    cat("Residuals Summary:\n")
    residsummary = summary(result$resid$val)
    print(residsummary)
    cat("\n")
    cat("Coefficients:\n")

    text = sprintf("%30s%15s\n", "Values", "Std. Error")
    cat(text)

    for (i in 1:K) {
        if (result$hasIntercept && (i == 1)) {
            text = sprintf("%-15s%15.6f%15.6f\n", "(Intercept)", result$beta$val[1], 
                result$beta$se[1])
            cat(text)
            displayAdjust = 1
        } else {
            tcoef = sprintf("beta%d", i - displayAdjust)
            text = sprintf("%-15s%15.6f%15.6f\n", tcoef, result$beta$val[i], 
                result$beta$se[i])
            cat(text)
        }

    }

    text = sprintf("\nR-squared: %19.4f\n", result$R)
    cat(text)
    text = sprintf("Adjusted R-squared: %10.4f\n", result$adjR)
    cat(text)
}

Run again:

results = lr(X, y)
lr.summary(results)
## Regression Results:
## 
## Residuals Summary:
##        V1           
##  Min.   :-8.53e-14  
##  1st Qu.: 2.49e-14  
##  Median : 5.68e-14  
##  Mean   : 5.90e-14  
##  3rd Qu.: 9.95e-14  
##  Max.   : 2.02e-13  
## 
## Coefficients:
##                         Values     Std. Error
## (Intercept)          12.000000       0.000000
## beta1                 6.000000       0.000000
## beta2                 4.000000       0.000000
## beta3                 9.000000       0.000000
## 
## R-squared:              1.0000
## Adjusted R-squared:     1.0000

Testing the coefficients

Remember that we estimated the variance of the residuals \( \sigma \) as \( s \):

\[ s^2 = \frac{1}{N-K-1} \sum_{i=1}^{N}e_i^2 \]

From the Gauss-Markov assumptions, the residuals should be normally distributed and the distribution of \( \sum_{i=1}^{N}e_i^2 \) should be in Chi-squared with \( N-K-1 \) degrees of freedom.

\[ (N-K-1) s^2 \sim \chi_{N-K-1}^2 \]

From the properties of Chi-squared distribution, if \( \sum_{j=1}^J e_j^2 \sim \chi_J^2 \), the distribution of \( \sum_{j=1}^J \frac{(e-\mu)^2}{\sigma^2} \sim \chi_J^2 \) is also in Chi-squared with \( E(\sum_{j=1}^J \frac{(e-\mu)^2}{\sigma^2}) = J \) and \( V(\sum_{j=1}^J \frac{(e-\mu)^2}{\sigma^2}) = 2J \)

Thus, with the variable \( e \) has a zero mean (i.e. \( \mu = 0 \)), it follows
\[ \xi = \frac{(N-K-1) s^2}{\sigma} \sim \chi_{N-K-1}^2 \]

Let's go to the t-distribution definitation, if a variable \( X \) has a standard normal distribution, \( X \sim N(0,1) \), and \( \xi \sim \chi_J^2 \), then the following ratio has a t-distribution:

\[ t = \frac{X}{\sqrt{\xi/J}} \]

In our case, to test whether the regressor \( \beta_k \) is significant or not, we let \( X = \frac{\hat{\beta_k} - \beta_k}{\sqrt{V(\hat{\beta_k})}} \). As \( V(\hat{\beta}) = \sigma^2 (X'X)^{-1} \), we let \( c_{kk} \) be the \( k \) row and \( k \) column of the matrix \( (X'X)^{-1} \). The variance of the regressor \( \hat{\beta_K} \) is \( \sigma^2 c_{kk} \), the standard deviation is \( \sigma \sqrt{c_{kk}} \)

So, \( X = \frac{\hat{\beta_k} - \beta_k}{\sigma \sqrt{c_{kk}}} \)

\[ t = \frac{X}{\xi/(N-k-1)} \]
\[ t = \frac{\hat{\beta_k} - \beta_k}{\sigma \sqrt{c_{kk}}} / \frac{s^2}{\sigma} \]
\[ t = \frac{\hat{\beta_k} - \beta_k}{s \sqrt{c_{kk}}} \]
\[ t = \frac{\hat{\beta_k} - \beta_k}{se(\hat{\beta_k})} \]

the value \( t \) has a t-distribution with \( N-K-1 \) degrees of freedom

To test whether the coefficient \( \beta_k \) is significantly different from zero, we will test:

\[ t_k = \frac{\hat{\beta_k}-0}{se(\hat{\beta_k})} \]
with dof = (N-K-1).


lr <- function(X, y, method = "OLS", intercept = T) {
    X = as.matrix(X)
    K = ncol(X)
    N = nrow(X)
    beta = matrix()
    result = list()


    if (intercept == T) {
        # reset the length of vector beta
        K = K + 1
        X = cbind(1, X)
    }

    beta = solve(t(X) %*% X) %*% t(X) %*% y

    # calculate the residuals
    resid = y - X %*% beta

    # calculate the variance of residuals
    residvar = (t(resid) %*% resid)/(N - K)
    # convert it to value
    residvar = residvar[1, 1]

    # calculate R-squared and adjusted R-sqaured calculate the variance of
    # residuals (not adjusted for no. of regressors)
    SSE = (t(resid) %*% resid)/(N - 1)
    SSE = SSE[1, 1]
    adjSSE = residvar
    TSS = t(y - mean(y)) %*% (y - mean(y))/(N - 1)
    TSS = TSS[1, 1]
    Rsquared = 1 - (SSE/TSS)
    adjRsquared = 1 - (adjSSE/TSS)

    betacovar = solve(t(X) %*% X) * residvar





    # assign value to the list result and return it
    result$beta$val = beta
    result$resid$val = resid
    result$resid$var = residvar
    result$beta$covar = betacovar
    result$beta$var = diag(betacovar)
    result$beta$se = sqrt(diag(betacovar))
    # calculate t value for all the coefficients
    result$beta$t = beta/sqrt(diag(betacovar))
    result$hasIntercept = intercept
    result$R = Rsquared
    result$adjR = adjRsquared
    result
}
lr.summary <- function(result) {
    displayAdjust = 0
    K = length(result$beta$val)

    cat("Regression Results:\n\n")
    cat("Residuals Summary:\n")
    residsummary = summary(result$resid$val)
    print(residsummary)
    cat("\n")
    cat("Coefficients:\n")

    text = sprintf("%30s%15s%15s\n", "Values", "Std. Error", "t")
    cat(text)

    for (i in 1:K) {
        if (result$hasIntercept && (i == 1)) {
            text = sprintf("%-15s%15.6f%15.6f%15.6f\n", "(Intercept)", result$beta$val[1], 
                result$beta$se[1], result$beta$t[1])
            cat(text)
            displayAdjust = 1
        } else {
            tcoef = sprintf("beta%d", i - displayAdjust)
            text = sprintf("%-15s%15.6f%15.6f%15.6f\n", tcoef, result$beta$val[i], 
                result$beta$se[i], result$beta$t[i])
            cat(text)
        }

    }

    text = sprintf("\nR-squared: %19.4f\n", result$R)
    cat(text)
    text = sprintf("Adjusted R-squared: %10.4f\n", result$adjR)
    cat(text)
}

Run again:

results = lr(X, y)
lr.summary(results)
## Regression Results:
## 
## Residuals Summary:
##        V1           
##  Min.   :-8.53e-14  
##  1st Qu.: 2.49e-14  
##  Median : 5.68e-14  
##  Mean   : 5.90e-14  
##  3rd Qu.: 9.95e-14  
##  Max.   : 2.02e-13  
## 
## Coefficients:
##                         Values     Std. Error              t
## (Intercept)          12.000000       0.000000415445722959532.375000
## beta1                 6.000000       0.0000002012432297973165.250000
## beta2                 4.000000       0.0000001328501284142088.000000
## beta3                 9.000000       0.0000003244399974637064.500000
## 
## R-squared:              1.0000
## Adjusted R-squared:     1.0000