Saturday, 18 January 2014

Difference Equation and Stationarity of Time Series Process

Difference Equation and Stationarity of Time Series Process

Since many time series process is expressed in difference equation. So, what is difference equation?

Let's have a look in the following difference equation:

\[ z_t - \phi_1 z_{t-1} - \phi_2 z_{t-2} = e_t \]

The difference equation itself is a time series process that the present value depends on its historical values. By using the lag operator \( L \), we can rewrite the difference equation by defining \( \phi(L)=1-\phi_1 L - \phi_2 L^2 - \cdots - \phi_p L^p \), while \( L y_t = y_{t-1} \).

Now, for difference euqation \( \phi(L)z_t = e_t \), we have the following to remember:

Lemma 1:

If \( z_{t}^{(1)} \) and \( z_{t}^{(2)} \) are the homogeneous solution (which mean \( e_t = 0 \)),

then \( (b_{1}z_{t}^{(1)}+b_2 z_{t}^{(2)}) \) is also a solution for arbitrary constant \( b_1 \) and \( b_2 \).

Lemma 2:

If \( z_{t}^{(P)} \) is the particular solution and \( z_{t}^{(H)} \) is the homogeneous solution,

then \( (z_{t}^{(P)}+ z_{t}^{(H)}) \) is the general solution.

Lemma 3:

If \( (1-L)^m z_{t}= 0 \),

then a solution is: \( z_t = b t^j \)

where \( b \) is an arbitrary constant and \( 0 \leqq j < m \).

The general solution is: \( z_t = \sum_{j=0}^{m-1}b_j t^j \)

Lemma 4:

If \( (1-RL)^m z_t = 0 \)

then a solution is: \( z_t = t^j R^t \)

The general solution is \( z_t = (\sum_{j=0}^{m-1}b_j t^j) R^t \)

Stationariry Conditions:

Now, it's time to talk about the conditions for the time series difference equations to become stationarity.

Consider the following AR(2) time series process:

\[ z_t = \phi_1 z_{t-1} + \phi_2 z{t-2} + e_t \] \[ (1-\phi_1 L - \phi_2 L^2)z_t = e_t \]

If \( e_t \) is a random error with mean zero, we take the expectation for both side, the process becomes:

\[ (1-\phi_1 L - \phi_2 L^2)z_t = 0 \]

Which is a difference equation.

If \( z_t \) is not equal to zero, the first factor should be:

\[ (1-\phi_1 L-\phi_2 L^2) = 0 \] \[ (\phi_2 L^2 + \phi_1 L - 1) = 0 \]

The roots of the equation \( \phi(L) \) are:

\[ L = \frac{-\phi_1 \pm \sqrt{\phi_{1}^2 - 4 \phi_2}}{2\phi_2} \]

let the roots be \( r_1 \) and \( r_2 \), then: \[ (\phi_2 L^2 + \phi_1 L - 1) = (L-r_1)(L-r_2)=0 \]

Dividing both side by \( r_1 \) and \( r_2 \): \[ (\frac{L}{r_1}-1)(\frac{L}{r_2}-1)=0 \] \[ (1-\frac{L}{r_1})(1-\frac{L}{r_2})=0 \]

From lemma 4, if the difference equation expressed in the form of \( (1-RL)^m z_t = 0 \), then a solution is: \( z_t = t^j R^t \)

Now, let's think about stationarity, if \( R>1 \), then the series is explosive as \( z_t = t^j R^t \).

So the necessary condition for a stationary series is \( R<1 \), which means the inverse of the roots in the polynominal \( \phi(L) \) should be smaller than 1; put it into another way; the roots should be greater than 1.

Let two roots be \( r_1 \) and \( r_2 \).

If the roots are complex:

\[ r_1 = a + bi \] \[ r_2 = a - bi \]

We can express the complex number in polar form:

\[ a + bi = r(cos(\theta)+isin(\theta)) \] \[ a - bi = r(cos(\theta)-isin(\theta)) \]

Graphically,

plot of chunk unnamed-chunk-1

If the roots are complex, from lemma 1 and 4:

\[ r_1 = a + bi = r(cos(\theta)+isin(\theta)) \] \[ r_2 = a - bi = r(cos(\theta)-isin(\theta)) \]

\[ z_t = b_{01}\frac{1}{r_1^t} + b_{02}\frac{1}{r_2^t} \]

Because \( (r(cos(\theta)-isin(\theta)))^k= r^k(cos(k\theta)-isin(k\theta)) \)

So, \( r \) must large than 1 in order to make the process stationary, so we often refers the stationary conditions is the roots of the equation \( \phi(L) \) must be lie outside the unit circle.

Now, consider the roots again: \[ L = \frac{-\phi_1 \pm \sqrt{\phi_{1}^2 - 4 \phi_2}}{2\phi_2} \]

as \( |r_1| \) and \( |r_2| \) should greater than 1.

\[ |\frac{1}{r_1} \times \frac{1}{r_2}| = |\phi_2| < 1 \]

and

\[ |\frac{1}{r_1} + \frac{1}{r_2}| = |\phi_1| < 2 \]

So the conditions for stationarity regardless of the roots are real or complex are:

  1. \( -1<\phi_2<1 \)
  2. \( -2<\phi_1<2 \)

\[ \frac{-\phi_1 + \sqrt{\phi_{1}^2 - 4 \phi_2}}{2\phi_2} < 1 \] \[ \sqrt{\phi_{1}^2 - 4 \phi_2} < 2 + \phi_1 \] \[ \phi_2 < 1 + \phi_1 \]

Also:

\[ \frac{-\phi_1 - \sqrt{\phi_{1}^2 - 4 \phi_2}}{2\phi_2} > -1 \] \[ -\sqrt{\phi_{1}^2 - 4 \phi_2} < \phi_1 - 2 \] \[ \phi_2 > 1 - \phi_1 \]

For real roots, \( \phi_{1}^2 - 4 \phi_2 > 0 \) For complex roots, \( \phi_{1}^2 - 4 \phi_2 < 0 \)

plot of chunk unnamed-chunk-2

Examples:

The following examples are extracted from Wei's textbook:

E.g.1

\[ z_t-2 z_{t-1}+z_{t-2}=0 \] \[ (1-2L+L^2)z_t = 0 \] \[ (1-L)^2 z_t = 0 \]

From lemma 3, the general solution is:

\[ z_t = b_0 + b_1 t \]

Eg.2

\[ z_t - 2z_{t-1}+1.5z_{t-2}-0.5z_{t-3}=0 \] \[ (1-2L+1.5L^2 -0.5L^3)z_t = 0 \] \[ (1-L+0.5L^2)(1-L)=0 \]

The roots of \( (1-L+0.5L^2)=0 \) are: \[ r_1 = 1 + 1i \] \[ r_2 = 1 - 1i \]

Inverse of the roots are:

\[ R = 0.5 \pm 0.5 i \]

\[ r = \sqrt(0.5^2 + 0.5^2) = \sqrt(0.5) \] \[ \theta = tan^{-1}(0.5/0.5) = \frac{\pi}{4} \]

The general solution is:

\[ (1-L+0.5L^2)(1-L)z_t=0 \]

From lemma 3 and lemma 4:

\[ z_t = b_0 + b_1 \sqrt(0.5)^t cos(\frac{\pi}{4}t) + b_1 \sqrt(0.5)^t sin(\frac{\pi}{4}t) \]

Condition for Stationary using Companion Matrix

Using companion matrix is yet another convenience way to identify a non-stationary time series.

Consider the following AR(p) series:

\[ y_t = \phi_1 y_{t-1} + \phi_2 y_{t-2} + \cdots + \phi_p y_{t-p} + e_t \]

We can rewrite the above AR(p) series to AR(1) series by using the companion matrix \( F \).

Let \( z \) be the column vector containing the time series \( y_t \) with dimension \( p\times 1 \), \( F \) be the companion matrix with dimension \( p \times p \), and \( v \) be the column vector containing the shock \( e_t \) with dimension \( p \times 1 \).

\[ z = \left[ \begin{array}{c} y_{t} \\ y_{t-1} \\ \vdots \\ y_{t-(p-1)} \end{array} \right] \]

\[ F = \left[ \begin{array}{ccccc} \phi_1 & \phi_2 & \cdots & \phi_{p-1} & \phi_{p} \\ 1 & 0 & \cdots & 0 & 0 \\ 0 & 1 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & \cdots & 1 & 0 \end{array} \right] \]

\[ v = \left[ \begin{array}{c} a_{t} \\ a_{t-1} \\ \vdots \\ a_{t-(p-1)} \end{array} \right] \]

The AR(P) process can be expressed as AR(1) by using the companion matrix \( F \):

\[ z_t = F z_{t-1} + v_t \]

\[ \left[ \begin{array}{c} y_{t} \\ y_{t-1} \\ \vdots \\ y_{t-(p-2)} \\ y_{t-(p-1)} \end{array} \right] = \left[ \begin{array}{ccccc} \phi_1 & \phi_2 & \cdots & \phi_{p-1} & \phi_{p} \\ 1 & 0 & \cdots & 0 & 0 \\ 0 & 1 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & \cdots & 1 & 0 \end{array} \right] \left[ \begin{array}{c} y_{t-1} \\ y_{t-2} \\ \vdots \\ y_{t-(p-3)} \\ y_{t-(p-2)} \end{array} \right] + \left[ \begin{array}{c} a_{t} \\ a_{t-1} \\ \vdots \\ a_{t-(p-2)} \\ a_{t-(p-1)} \end{array} \right] \]

When we recursivly subsitute \( z_{t} \):

\[ z_t = F (F z_{t-2} + v_{t-1}) + v_t \] \[ z_t = F^2 z_{t-2} + F v_{t-1} + v_t \] \[ z_t = F^2 (F z_{t-3} + v_{t-2}) + F v_{t-1} + v_t \] \[ z_t = F^3 z_{t-3} + F^2 v_{t-2} + F v_{t-1} + v_t \] \[ z_t = F^t z_{0} + F^{t-1} v_{1} + \cdots + F v_{t-1} + v_t \] \[ z_t = F^{t+1} z_{-1} + F^t v_0 + F^{t-1} v_{1} + \cdots + F v_{t-1} + v_t \]

So, \( z_{t+j} \) can be expressed in:

\[ z_{t+j} = F^{j+1} z_{t-1} + F^j v_t + F^{j-1} v_{t+1} + \cdots + F v_{t+j-1} + v_{t+j} \]

Let's us now considering the 1st row of the above matrix

\[ y_{t+j} = f_{11}^{(j+1)} y_{t-1} + f_{12}^{(j+1)} y_{t-2} + f_{13}^{(j+1)} y_{t-3} + \cdots + f_{1p}^{(j+1)} y_{t-p} + f_{11}^{(j)} v_{t} + f_{11}^{(j-1)} v_{t+1} + \cdots + + f_{11}^{(1)} v_{t+j-1} + f_{11}^{(0)} v_{t+j} \]

where \( f_{ij}^(k) \) is the element that in the \( i \) row and \( j \) column in the matrix \( F^k \).

Now, consider the effect of a random shock \( v_t \) affect the time series in long term.

\[ \frac{d y_{t+j}}{d v_t} = f_{11}^{(j)} \]

If the shock is permanent (wont die out), the value \( f_{11}^{(j)} \) is larger than 1. So, for a stationary series, the value of the companion matrix \( f_{11}^{(j)} \) should be smaller than 1.

Calcuating the companion matrix \( F^j \) can be complex, but if the eigenvalue of a matrix \( F \) are distinct, we can decompose the matrix \( F \) to:

\[ F = T\Lambda T^{-1} \]

where

\( T \) is a nonsingluar matrix

\( \Lambda \) is (\( p \times p \)) matrix with eigenvalues of \( F \) be the principal diagonal \[ \Lambda = \left[ \begin{array}{llll}\lambda_1 & 0 & \cdots &0 \\ 0 & \lambda_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_p \end{array} \right] \]

\[ F^2 = T \Lambda T^{-1} T \Lambda T^{-1} = T \Lambda^2 T^{-1} \]

\[ F^j = T \left[ \begin{array}{llll}\lambda_1^j & 0 & \cdots &0 \\ 0 & \lambda_2^j & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_p^j \end{array} \right] T^{-1} \]

Now, let's consider \( f_{11}^{(j)} \) again:

\[ f_{11}^{(j)} = [t_{11} t^{11}] \lambda_1^j + [t_{12} t^{21}] \lambda_2^j + \cdots + [t_{1p} t^{p1}] \lambda_p^j \]

\[ f_{11}^{(j)} = c_1 \lambda_1^j + c_2 \lambda_2^j + \cdots + c_p \lambda_p^j \]

Since \( T T^{-1} \) is an identity matrix, thus

\[ c_1 + c_2 + \cdots + c_p = 1 \]

The effect of a random shock \( v_j \) to the series is a weighted average of eigenvalues to the jth power:

\[ \frac{d y_{t+j}}{d v_t} = c_1 \lambda_1^j + c_2 \lambda_2^j + \cdots + c_p \lambda_p^j \]

So, the necessary condition for a time series to be stationary, all the eigenvalues \( \lambda_i \) must be smaller than one.

Summary

Now, we have two methods for checking whether a time series process is stationary or not. 1. All the roots in the equation \( \phi(L) \) should lie outside the unit circle. 2. The eigenvalues of its companion matrix \( F \) is smaller than one.

As finding the roots of a high degree polynomial equation can be tedious, checking the eigenvalues \( \lambda_i < 1 \) is an easier method to check stationary.

Monday, 7 October 2013

A Brief Introduction on Vector Autoregression Model

Vector Autoregressions (VAR)

Vector Autoregressions (VAR)

What is VAR?

The vector autoregression (VAR) model is an extension of univariate autoregressive model to multivariate autoregressive model. It is one of the most widely used models to describe the dynamic interactions between varies variables in the system.

VAR Representations

Suppose you want to study the interactions between the GDP (\( y_1 \))and money supply (\( y_2 \)), and you beleive that they have contemporaneous effect to each other and their relationship can be modelled as:

Structural Form:

\[ y_{1,t} = a_1 - b_{12}y_{2,t} + c_{11}y_{1,t-1} + c_{12}y_{2,t-1} + \epsilon_{1,t} \] \[ y_{2,t} = a_2 - b_{21}y_{1,t} + c_{21}y_{1,t-1} + c_{22}y_{2,t-1} + \epsilon_{2,t} \]

The above 2 equations are in structural form as the equation contains contemporaneous value of other variables (coefficients \( b_{12} \) and \( b_{21} \))as regressors. The error terms in each regression are uncorrelated (\( cov(\epsilon_{1,t}, \epsilon_{2,t}) = 0 \)), the source of causing a shock in GDP should not related to the money supply's one.

The above 2 equations can be written in matrix form:

\[ \left[ \begin{array}{cc}1 & b_{12} \\ b_{21} & 1 \end{array} \right] \left[ \begin{array}{c}y_{1,t} \\ y_{2,t} \end{array} \right] = \left[ \begin{array}{c}a_1 \\ a_2 \end{array} \right] + \left[ \begin{array}{cc}c_{11} & c_{12} \\ c_{21} & c_{22} \end{array} \right] \left[ \begin{array}{c}y_{1,t-1} \\ y_{2,t-1} \end{array} \right] + \left[ \begin{array}{c}\epsilon_{1,t} \\ \epsilon_{2,t} \end{array} \right] \]

\[ By_t = a + C y_{t-1} + \epsilon_t \]

What do the coefficients mean?

The coefficients \( b_{12} \) and \( b_{21} \) represent the interactions between \( y_1 \) and \( y_2 \), if \( b_{12} \neq 0 \), it means that \( y_2 \) has a contemporaneous effect on \( y_1 \). The same applies for \( b_{12} \).

Now, for a shock in GDP (\( \epsilon_{1,t} \)), it will have an effect on \( y_{1,t} \) and sequently affect \( y_{2,t} \), thus \( \epsilon_{1,t} \) is correlated to \( y_{2,t} \). Due to this relation, the two equations above can NOT be estimated by OLS as it violated one of the Gauss-Markov assumptions - The error term and regressors are independent. (Please refer to my previous post on OLS)

Rewrite the equations in Reduced Form

The problem of the structural form VAR is it cannot be estimated by OLS, thus, you may tempted to rewrite it as reduced form VAR that can be estimated by OLS:

Assumming the matrix \( B = \left[ \begin{array}{cc}1 & b_{12} \\ b_{21} & 1 \end{array} \right] \) can be inverted: \( B^{-1} = \frac{1}{\Delta}\left[\begin{array}{cc}1 & -b_{12} \\ -b_{21} & 1 \end{array}\right] \), \( \Delta = 1-b_{12}b_{21} \)

\[ By_t = a + C y_{t-1} + \epsilon_t \] \[ y_t = B^{-1}a + B^{-1} C y_{t-1} + B^{-1}\epsilon_t \]

\[ \left[ \begin{array}{c}y_{1,t} \\ y_{2,t} \end{array} \right] = \left[ \begin{array}{cc}1 & b_{12} \\ b_{21} & 1 \end{array} \right]^{-1} \left[ \begin{array}{c}a_1 \\ a_2 \end{array} \right] + \left[ \begin{array}{cc}1 & b_{12} \\ b_{21} & 1 \end{array} \right]^{-1} \left[ \begin{array}{cc}c_{11} & c_{12} \\ c_{21} & c_{22} \end{array} \right] \left[ \begin{array}{c}y_{1,t-1} \\ y_{2,t-1} \end{array} \right] + \left[ \begin{array}{cc}1 & b_{12} \\ b_{21} & 1 \end{array} \right]^{-1} \left[ \begin{array}{c}\epsilon_{1,t} \\ \epsilon_{2,t} \end{array} \right] \]

We let \( B^{-1}a = \rho \), \( B^{-1}C = \gamma \), \( B^{-1}\epsilon_t = e \), The equation becomes:

\[ y_t = B^{-1}a + B^{-1} C y_{t-1} + B^{-1}\epsilon_t \] \[ y_t = \gamma+ \Gamma y_{t-1} + e_t \]

\[ \left[ \begin{array}{c}y_{1,t} \\ y_{2,t} \end{array} \right] = \left[ \begin{array}{c}\gamma_1 \\ \gamma_2 \end{array} \right] + \left[ \begin{array}{cc}\gamma_{11} & \gamma_{12} \\ \gamma_{21} & \gamma_{22} \end{array} \right] \left[ \begin{array}{c}y_{1,t-1} \\ y_{2,t-1} \end{array} \right] + \left[ \begin{array}{c}e_{1,t} \\ e_{2,t} \end{array} \right] \]

The equation now is in reduced form as the independent variables can be expressed in their own lags.

The residual terms \( e_t \) are correlated to each other as orginally, the covariances of error terms (\( \epsilon_t \))in structural form are: \[ Var(\epsilon_t) = E(\epsilon_t \epsilon_t') = \left[\begin{array}{cc}\sigma_{\epsilon_1}^2 & 0 \\ 0 & \sigma_{\epsilon_2}^2 \end{array} \right] = D \]

Since \( B^{-1}\epsilon_t = e_t \), the covariances of error terms (\( e_t \)) in reduced form are: \[ Var(e_t) = E(e_t e_t') = E(B^{-1}\epsilon_t (B^{-1}\epsilon_t)') = E(B^{-1} \epsilon_t \epsilon_t' B^{-1'} ) = B^{-1} D B^{-1'} = \Sigma \]

While the elements in \( \Sigma \) are : \( \left[ \begin{array}{cc}\sigma_{e_{11}}^2 & \sigma_{e_{12}}^2 \\ \sigma_{e_{21}}^2 & \sigma_{e_{22}}^2 \end{array}\right] \) = \( \frac{1}{\Delta^2} \left[\begin{array}{cc}\sigma_{\epsilon_1}^2+b_{12}^2\sigma_{\epsilon_2}^2 & -(b_{21}\sigma_{\epsilon_1}^2 + b_{12}\sigma_{\epsilon_2}^2) \\ -(b_{21}\sigma_{\epsilon_1}^2 + b_{12}\sigma_{\epsilon_2}^2) & \sigma_{\epsilon_2}^2+b_{21}^2\sigma_{\epsilon_1}^2 \end{array}\right] \)

The covariance between \( e_1 \) & \( e_2 \) will be zero when both \( b_{12} \) and \( b_{21} \) are zero, and if that's the case, it also implies that \( y_1 \) and \( y_2 \) do not have contemporaneous effect to each other.

Stationarity of VAR

In lag operator notation, the reduced form VAR model can be writted as: \[ y_t = \gamma+ \Gamma y_{t-1} + e_t \] \[ y_t - \Gamma y_{t-1} = \gamma + e_t \] \[ (1-\Gamma L) y_t = \gamma + e_t \] \[ \Gamma(L)y_t = \gamma + e_t \]

For stationarity, the formula should be able to express in moving average (MA) or Wold representation, that means the function \( \Gamma(L) \) should invertible. If \( \Gamma(L) \) is invertible, this means that characteristic roots of \( det(I - \Gamma z) = 0 \) lie outside the unit circle or equivalently, or the eigenvalues of the companion matrix F have modulues less than one : (More about companion matrix can be found in my other post)

\[ F = \left[\begin{array}{cccc} \Gamma_1 & \Gamma_2 & \cdots & \Gamma_n \\ I_n & 0 & \cdots & 0 \\ 0 & \ddots & 0 & \vdots \\ 0 & 0 & I_n & 0 \end{array} \right] \]

In our case, the companion matrix \( F \) for \( n=1 \) is:

\[ F = \Gamma \]

Estimation of VAR

So now the problem is, how can we estimate the structural VAR model? The steps are:

  1. Estimate the reduced form VAR model first.
  2. Find the Covariance matrix of the residuals of the estimated equations in 1.
  3. Apply Choleski decomposition for the covariance matrix to obtain matrix \( B^{-1} \).
  4. Multiply the matrix \( B \) to the reduced form VAR model to get the structural form back.

The True model is:

\[ y_t = \gamma+ \Gamma y_{t-1} + e_t \] \[ y_t = B^{-1}a + B^{-1} C y_{t-1} + B^{-1}\epsilon_t \]

If we can estimate the matrix \( B \), then we can multiple it to the above equation to retreive the TRUE model back. But in practical, it is no way the estimate the matrix \( B \) back by NOT making any restriction. Why? Let's consider the following method:

\[ \left[ \begin{array}{c}y_{1,t} \\ y_{2,t} \end{array} \right] = \left[ \begin{array}{c}\gamma_1 \\ \gamma_2 \end{array} \right] + \left[ \begin{array}{cc}\gamma_{11} & \gamma_{12} \\ \gamma_{21} & \gamma_{22} \end{array} \right] \left[ \begin{array}{c}y_{1,t-1} \\ y_{2,t-1} \end{array} \right] + \left[ \begin{array}{c}e_{1,t} \\ e_{2,t} \end{array} \right] \]

We can estimated the equations by OLS to obtain:

\[ y_t = \hat{\gamma}+ \hat{\Gamma} y_{t-1} + \hat{e_t} \]

The covariance matrix of the residuals is: \[ E(\hat{e_t} \hat{e_t}') = \hat{\Sigma} \]

Next, we perform Choleski decomposition for the matrix \( \hat{\Sigma} \) \[ \hat{\Sigma} = L \Lambda L' \]

As mentioned before, the True covariance matrix is \( \Sigma = B^{-1}DB^{-1'} \), and the elements of \( \Sigma \) are:

\[ \left[ \begin{array}{cc}\sigma_{e_{11}}^2 & \sigma_{e_{12}}^2 \\ \sigma_{e_{21}}^2 & \sigma_{e_{22}}^2 \end{array}\right] = \frac{1}{\Delta^2} \left[\begin{array}{cc}\sigma_{\epsilon_1}^2+b_{12}^2\sigma_{\epsilon_2}^2 & -(b_{21}\sigma_{\epsilon_1}^2 + b_{12}\sigma_{\epsilon_2}^2) \\ -(b_{21}\sigma_{\epsilon_1}^2 + b_{12}\sigma_{\epsilon_2}^2) & \sigma_{\epsilon_2}^2+b_{21}^2\sigma_{\epsilon_1}^2 \end{array}\right] \]

The problem we encounter now is, even we got the true convariance matrix \( \Sigma \), we cannot estimate the 4 parameters (\( b_{12}, b_{21}, \sigma_{\epsilon_1}, \sigma_{\epsilon_2} \)) by only has 3 values (\( \sigma_{e11} \), \( \sigma_{e22} \), \( \sigma_{e12}=\sigma_{e21} \)) observated. That implies that, some restrictions must be imposed in order to make estimation. The most commonly used restriction is to set \( b_{12} = 0 \), which assumes that \( y_2 \) has no contemporaneous effect on \( y_1 \) but \( y_1 \) do has contemporaneous effect on \( y_2 \). When making this assumption, you should focus on the economic meaning rather than the statistical meaning of the equation.

Try it out! Generating the data

To see whether how the estimation works, let's get our hands dirty now. We generate the data \( y_1 \) and \( y_2 \) by the following bivariate SVAR(1) model:

a1 <- 1.6
b12 <- 0
c11 <- 0.5
c12 <- 0.8

a2 <- -0.8
b21 <- 0.5
c21 <- 0.8
c22 <- 0.3

\[ y_{1,t} = 1.6 - 0y_{2,t} + 0.5y_{1,t-1} + 0.8y_{2,t-1} + \epsilon_{1,t} \] \[ y_{2,t} = -0.8 - 0.5y_{1,t} + 0.8y_{1,t-1} + 0.3y_{2,t-1} + \epsilon_{2,t} \]

\[ \left[ \begin{array}{cc}1 & 0 \\ 0.5 & 1 \end{array} \right] \left[ \begin{array}{c}y_{1,t} \\ y_{2,t} \end{array} \right] = \left[ \begin{array}{c}1.6 \\ -0.8 \end{array} \right] + \left[ \begin{array}{cc}0.5 & 0.8 \\ 0.8 & 0.3 \end{array} \right] \left[ \begin{array}{c}y_{1,t-1} \\ y_{2,t-1} \end{array} \right] + \left[ \begin{array}{c}\epsilon_{1,t} \\ \epsilon_{2,t} \end{array} \right] \]

*As we will impose the restriction of \( b_{12} =0 \) in the estimation process, the parameter \( b_{12} \) is set to zero in the data generating process.

For the data generation process of \( y_1 \) and \( y_2 \), we first did it by rewriting the structural form equations through subsitution.

\[ y_{1,t} = a_1 - b_{12}y_{2,t} + c_{11}y_{1,t-1} + c_{12}y_{2,t-1} + \epsilon_{1,t} \] \[ y_{2,t} = a_2 - b_{21}y_{1,t} + c_{21}y_{1,t-1} + c_{22}y_{2,t-1} + \epsilon_{2,t} \]

Let's subsitute \( y_{2,t} \) to \( y_{1,t} \):

\[ y_{1,t} = a_1 - a_2 b_{12} + b_{12}b_{21}y_{1,t} + (c_{11}-b_{12}c_{21})y_{1,t-1}+(c_{12}-b_{12}c_{22})y_{2,t-1} - b_{12}\epsilon_{2,t}+\epsilon_{1,t} \]

\[ y_{1,t} = \frac{1}{1-b_{12}b_{21}} (a_1 - a_2 b_{12} + (c_{11}-b_{12}c_{21})y_{1,t-1}+(c_{12}-b_{12}c_{22})y_{2,t-1} -b_{12}\epsilon_{2,t}+\epsilon_{1,t}) \]

One we have \( y_{1,t} \), we can subsitute \( y_{1,t} \) back to get \( y_{2,t} \).

Let's generate the data in R.


# generate 1,000 independent suprise terms in y1 and y2
set.seed(13)
y1.innov <- rnorm(1000, 0, 1)
y2.innov <- rnorm(1000, 0, 1)

y1.data <- rep(0, 1000)
y2.data <- rep(0, 1000)
y1.data[1] <- 0
y2.data[1] <- 0

for (i in 2:1000) {
    y1.data[i] <- (1/(1 - b12 * b21)) * (a1 - a2 * b12 + (c11 - b12 * c21) * 
        y1.data[i - 1] + (c12 - b12 * c22) * y2.data[i - 1] - b12 * y2.innov[i] + 
        y1.innov[i])

    y2.data[i] <- a2 - b21 * y1.data[i] + c21 * y1.data[i - 1] + c22 * y2.data[i - 
        1] + y2.innov[i]
}

Let's plot the last 100 observations of the 2 series :

library(ggplot2)
library(reshape2)

x <- seq(901:1000)
df <- data.frame(x, y1 = y1.data[901:1000], y2 = y2.data[901:1000])

df.melted <- melt(df, id.vars = "x")

g <- ggplot(data = df.melted, aes(x = x, y = value, color = variable)) + geom_line()

g

plot of chunk unnamed-chunk-3

Ok, now we have the data \( y1 \) and \( y2 \) and they seem correlated to each other, let's try out the estimation procedure and see whether we can retrieve the structural form or not.

Estimation of VAR in Brute Force

As mentioned above, the estimation procedures of VAR are:

  1. Estimate the reduced form VAR model first.
  2. Find the Covariance matrix of the residuals of the estimated equations in 1.
  3. Apply Choleski decomposition for the covariance matrix to obtain matrix \( L \), which is the estimator of \( B^{-1} \).
  4. Multiply the matrix $L{-1} to the estimated reduced form VAR model to get the structural form back.

Setp 1 - Estimated the reduced form of VAR model

So, let's do the step 1

n <- 1000

t <- 2:n

df <- data.frame(y1 = y1.data, y2 = y2.data)
# reduced form
y1.r.fit <- lm(y1[t] ~ y1[t - 1] + y2[t - 1], df)
y2.r.fit <- lm(y2[t] ~ y1[t - 1] + y2[t - 1], df)

Let's check the regression results:

y1.r.fit
## 
## Call:
## lm(formula = y1[t] ~ y1[t - 1] + y2[t - 1], data = df)
## 
## Coefficients:
## (Intercept)    y1[t - 1]    y2[t - 1]  
##       1.592        0.505        0.777
y2.r.fit
## 
## Call:
## lm(formula = y2[t] ~ y1[t - 1] + y2[t - 1], data = df)
## 
## Coefficients:
## (Intercept)    y1[t - 1]    y2[t - 1]  
##      -1.575        0.546       -0.110

And the matrices in reduced form are:

gamma1 <- y1.r.fit$coef[1]
gamma2 <- y2.r.fit$coef[1]

gamma.hat <- matrix(c(gamma1, gamma2), ncol = 1, nrow = 2, byrow = F)

gamma11 <- y1.r.fit$coef[2]
gamma12 <- y1.r.fit$coef[3]
gamma21 <- y2.r.fit$coef[2]
gamma22 <- y2.r.fit$coef[3]

Gamma.hat <- matrix(c(gamma11, gamma12, gamma21, gamma22), ncol = 2, byrow = T)

Now, let's check the stationarity of the series. For VAR(1) be stationary, the eigenvalues of the matrix \( \Gamma \) must in the unit circle

eigen(Gamma.hat, only.values = T)$values
## [1]  0.9179 -0.5228

Step 2 - get the covariance matrix \( \hat{\Sigma} \) of the estimated residuals:

res.mat <- cbind(y1.resid = y1.r.fit$residuals, y2.resid = y2.r.fit$residuals)
Sigma.hat <- var(res.mat)
colnames(Sigma.hat) <- NULL
rownames(Sigma.hat) <- NULL
Sigma.hat
##         [,1]    [,2]
## [1,]  1.0003 -0.5039
## [2,] -0.5039  1.2332

Step 3 - Do the Choleski decomposition to obrain the matrix \( L \), which is the estimator of \( B^{-1} \).

ch <- chol(Sigma.hat)
dd <- diag(ch)
L <- t(ch/dd)
D <- diag(dd^2)

Linv = solve(L)

Step 4 - Multiply \( L^{-1} \) to get back the structural form

a.hat = Linv %*% gamma.hat
C.hat = Linv %*% Gamma.hat

Now, Let's compare the True value with the estimated value

\( B=\left[ \begin{array}{cc}1 & 0 \\ 0.5 & 1 \end{array} \right] \), \( \hat{B} = \left[ \begin{array}{cc}1 & 0 \\ 0.5037 & 1 \end{array} \right] \)

\( a=\left[ \begin{array}{c}1.6 \\ -0.8 \end{array} \right] \), \( \hat{a}=\left[ \begin{array}{c}1.5921 \\ -0.7733 \end{array} \right] \),

\( C=\left[ \begin{array}{cc}0.5 & 0.8 \\ 0.8 & 0.3 \end{array} \right] \), \( \hat{C}=\left[ \begin{array}{cc}0.5047 & 0.7768 \\ 0.8007 & 0.2817 \end{array} \right] \)

Check the covariance matrix of \( \hat{\epsilon} \) in the structual form, there should not be any covariance between the 2 shocks.

epsilon.hat = list()
for (i in 1:1000) {
    epsilon.hat[[i]] = Linv %*% as.matrix(c(y1.r.fit$residuals[i], y2.r.fit$residuals[i]))
}

e.mat = matrix(0, nrow = 999, ncol = 2)
for (i in 1:999) {
    e.mat[i, 1] = epsilon.hat[[i]][1]
    e.mat[i, 2] = epsilon.hat[[i]][2]
}
cov(e.mat)
##           [,1]       [,2]
## [1,]  1.00e+00 -5.540e-17
## [2,] -5.54e-17  9.794e-01

Everything seems pretty good. Now, let's talk about how to make analysis by using the VAR model.

Moving Average (MA) or Wold Representation

At the time of discussing stationarity, I mentioned that the reduced form VAR model can be rewrite to wold representation, now consider a VAR(1) model

\[ y_t = \gamma+ \Gamma y_{t-1} + e_t \] \[ y_t - \Gamma y_{t-1} = \gamma + e_t \] \[ (1-\Gamma L) y_t = \gamma + e_t \] \[ y_t = \frac{1}{(1-\Gamma L)} \gamma + \frac{1}{(1-\Gamma L)} e_t \] \[ y_t = \phi + \Phi(L)e_t \]

where: \[ \Phi(L) = (1-\Gamma L)^{-1} = \sum_{i=0}^{\infty} \Gamma^i L^i \]

So, \[ y_t = \phi + e_t + \Gamma^1 e_{t-1} + \Gamma^2 e_{t-2} + \cdots \]

Since the vector \( e_{t} \) is correlated, the above equation does not give any insight. But, recall that \( B^{-1}\epsilon_t = e_t \), it can be further written as:

\[ y_t = \phi + B^{-1}\epsilon_t + \Gamma^1 B^{-1}\epsilon_{t-1} + \Gamma^2 B^{-1}\epsilon_{t-2} + \cdots \] \[ y_t = \phi + \Upsilon(L)\epsilon_t \]

where

\[ \Upsilon(L) = \sum_{k=0}^{\infty} \Upsilon_k L^k \] \[ \Upsilon(L) = B^{-1} + \Gamma^1 B^{-1} L + \Gamma^2 B^{-1} L^2 + \cdots \]

Since \( \epsilon_{t-i} \) is the structural innovations, the above equation is call the structural moving average (SMA) form.

Let's have a look at the SMA representation for the bivariate system:

\[ \left[\begin{array}{c} y_{1,t} \\ y_{2,t} \end{array}\right] = \left[\begin{array}{c} \phi_1 \\ \phi_2 \end{array}\right] + \left[\begin{array}{cc} \upsilon_{11}^{(0)} & \upsilon_{12}^{(0)}\\ \upsilon_{21}^{(0)} & \upsilon_{22}^{(0)} \end{array}\right] \left[\begin{array}{c} \epsilon_{1,t} \\ \epsilon_{2,t} \end{array}\right] + \left[\begin{array}{cc} \upsilon_{11}^{(1)} & \upsilon_{12}^{(1)}\\ \upsilon_{21}^{(1)} & \upsilon_{22}^{(1)} \end{array}\right] \left[\begin{array}{c} \epsilon_{1,t-1} \\ \epsilon_{2,t-1} \end{array}\right] + \cdots \]

The coefficient \( \upsilon_{ij}^{(k)} \) represent the dynamic multipler or impulse responses of \( y_{i,t} \) to changes in \( \epsilon_{j,t} \)

The Impulse Response Function (IRF)

Consider the SMA representation of the equation at time \( t+s \):

\[ \left[\begin{array}{c} y_{1,t+s} \\ y_{2,t+s} \end{array}\right] = \left[\begin{array}{c} \phi_1 \\ \phi_2 \end{array}\right] + \left[\begin{array}{cc} \upsilon_{11}^{(0)} & \upsilon_{12}^{(0)}\\ \upsilon_{21}^{(0)} & \upsilon_{22}^{(0)} \end{array}\right] \left[\begin{array}{c} \epsilon_{1,t+s} \\ \epsilon_{2,t+s} \end{array}\right] + \cdots + \left[\begin{array}{cc} \upsilon_{11}^{(s)} & \upsilon_{12}^{(s)}\\ \upsilon_{21}^{(s)} & \upsilon_{22}^{(s)} \end{array}\right] \left[\begin{array}{c} \epsilon_{1,t} \\ \epsilon_{2,t} \end{array}\right] \]

If we differentiate the above equation with respect to indidivual structural innvoation, we can obtain the structural dynamic multipliers:

\[ \begin{array}{cc} \frac{\delta y_{1,t+s}}{\delta \epsilon_{1,t}} = \upsilon_{11}^{(s)} & \frac{\delta y_{1,t+s}}{\delta \epsilon_{2,t}} = \upsilon_{12}^{(s)} \\ \frac{\delta y_{2,t+s}}{\delta \epsilon_{1,t}} = \upsilon_{21}^{(s)} & \frac{\delta y_{2,t+s}}{\delta \epsilon_{2,t}} = \upsilon_{22}^{(s)} \end{array} \]

And if the 2 series \( y_t \) are covariance stationary:

\[ lim_{s\rightarrow \infty} \upsilon_{ij}^{(s)} = 0, \text{for }i, j \in \{1,2\} \]

Thus, the long run cumulative impact of the structural innovations is captured by the long-run impact matrix:

\[ \Upsilon(L) = \left[\begin{array}{cc} \sum_{s=0}^{\infty}\upsilon_{11}^{(s)}L^s & \sum_{s=0}^{\infty}\upsilon_{12}^{(s)} L^s\\ \sum_{s=0}^{\infty}\upsilon_{21}^{(s)} L^s & \sum_{s=0}^{\infty}\upsilon_{22}^{(s)} L^s \end{array}\right] \]

Now, let's calculate and plot the 4 cummulative IPFs in R up to 100 lags.

nLags = 100
Upsilon.list <- list()

for (i in 1:nLags) {

    tempGamma <- Gamma.hat
    if (i == 1) 
        tempGamma <- diag(2) else if (i > 2) {
        for (j in 3:i) {
            tempGamma <- tempGamma %*% Gamma.hat
        }
    }

    Upsilon.list[[i]] <- L %*% tempGamma
}

Culsum.Upsilon.list <- Upsilon.list
for (i in 1:nLags) {
    j <- 1
    while (j < i) {
        Culsum.Upsilon.list[[i]] <- Culsum.Upsilon.list[[i]] + Upsilon.list[[j]]
        j <- j + 1
    }
}

upsilon11 = array()
upsilon12 = array()
upsilon21 = array()
upsilon22 = array()

for (i in 1:nLags) {
    upsilon11[i] = Culsum.Upsilon.list[[i]][1, 1]
    upsilon12[i] = Culsum.Upsilon.list[[i]][1, 2]
    upsilon21[i] = Culsum.Upsilon.list[[i]][2, 1]
    upsilon22[i] = Culsum.Upsilon.list[[i]][2, 2]
}

par(mfrow = c(2, 2))
# tex = 'Shock in '+expression(epsilon_(t))
plot(upsilon11, type = "l", xlab = "Time", ylab = expression(paste("Shock to ", 
    epsilon[1])), main = "Response of Y1")
plot(upsilon12, type = "l", xlab = "Time", ylab = expression(paste("Shock to ", 
    epsilon[2])), main = "Response of Y2")
plot(upsilon21, type = "l", xlab = "Time", ylab = expression(paste("Shock to ", 
    epsilon[1])), main = "Response of Y1")
plot(upsilon22, type = "l", xlab = "Time", ylab = expression(paste("Shock to ", 
    epsilon[2])), main = "Response of Y2")

plot of chunk unnamed-chunk-13

IMpulse responses tell you the response of the current and fture values of the variables to a unit increase in the current value of VAR structural innovations.

Forecast Error Variance Decompositions

Forecast error variance decompositions help us to determine the proportion of the variability of the errors due to the structural shocks of \( \epsilon_1 \) and \( \epsilon_2 \) in forecasting \( y_1 \) and \( y_2 \) at time \( t+s \), based on the information at time \( t \).

Again, the Wold representation of \( y_{t+s} \) is:

\[ y_t = \phi + B^{-1}\epsilon_t + \Gamma^1 B^{-1}\epsilon_{t-1} + \Gamma^2 B^{-1}\epsilon_{t-2} + \cdots \] \[ y_{t+s} = \phi + \Upsilon_0\epsilon_{t+s} + \Upsilon_1\epsilon_{t+s-1} + \Upsilon_2\epsilon_{t+s-2} + \cdots +\Upsilon_s\epsilon_{t} + \Upsilon_{s+1}\epsilon_{t-1} + \cdots \]

Based on the information available at time \( t \), the best linear forecast of \( y_{t+s} \) is:

\[ \hat{y_{t+s}} = \phi + \Upsilon_s\epsilon_{t} + \Upsilon_{s+1}\epsilon_{t-1} + \Upsilon_{s+2}\epsilon_{t-2} + \cdots \]

The forecast error is:

\[ y_{t+s} -\hat{y_{t+s}} = \Upsilon_0\epsilon_{t+s} + \Upsilon_1\epsilon_{t+s-1} + \Upsilon_2\epsilon_{t+s-2} + \cdots +\Upsilon_{s-1}\epsilon_{t+1} \]

Thus, the forecast error by equations is:

\[ \left[\begin{array}{c} y_{1,t+s} - \hat{y_{1,t+s}} \\ y_{2,t+s} - \hat{y_{2,t+s}} \end{array}\right] = \left[\begin{array}{cc} \upsilon_{11}^{(0)} & \upsilon_{12}^{(0)}\\ \upsilon_{21}^{(0)} & \upsilon_{22}^{(0)} \end{array}\right] \left[\begin{array}{c} \epsilon_{1,t+s} \\ \epsilon_{2,t+s} \end{array}\right] + \cdots + \left[\begin{array}{cc} \upsilon_{11}^{(s-1)} & \upsilon_{12}^{(s-1)}\\ \upsilon_{21}^{(s-1)} & \upsilon_{22}^{(s-1)} \end{array}\right] \left[\begin{array}{c} \epsilon_{1,t+1} \\ \epsilon_{2,t+1} \end{array}\right] \]

For the first equation \[ y_{1,t+s} - \hat{y_{1,t+s}} = \begin{array}{l} \upsilon_{11}^{(0)}\epsilon_{1,t+s} + \cdots + \upsilon_{11}^{(s-1)}\epsilon_{1,t+1} \\ + \upsilon_{21}^{(0)}\epsilon_{2,t+s} + \cdots + \upsilon_{21}^{(s-1)}\epsilon_{2,t+1} \end{array} \]

\[ \sigma(s)_{1}^{2} = Var(y_{1,t+s} - \hat{y_{1,t+s}}) = \sigma_{\epsilon_1}^2( (\upsilon_{11}^{(0)})^2 + (\upsilon_{11}^{(1)})^2 + \dots + (\upsilon_{11}^{(s-1)})^2) + \sigma_{\epsilon_2}^2( (\upsilon_{21}^{(0)})^2 + (\upsilon_{21}^{(1)})^2 + \dots + (\upsilon_{21}^{(s-1)})^2) \]

Thus, the proportion of the total variance of forecasting series \( y_1 \) \( s \) period later due to shock in \( \epsilon_1 \) is:

\[ \frac{\sigma_{\epsilon_1}^2( (\upsilon_{11}^{(0)})^2 + (\upsilon_{11}^{(1)})^2 + \dots + (\upsilon_{11}^{(s-1)})^2)}{\sigma(s)_{1}^2} \]

The same idea applies for series \( y_2 \), shock \( \epsilon_2 \) as well.

My next post will talk about the VAR package available in R, stay tuned.

Friday, 6 September 2013

Introduction to Principal Component Analysis (PCA)

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis (PCA) is very useful in finance to reduce number of risk factors to a managerable dimension.

Alright, what is PCA? Let's have a look on the following examples.

Let X and Y be random variables where \( X \sim N(2,2) \), \( Y \sim N(4,4) \), \( corr(X,Y)=0.8, cov(X,Y) = corr(X,Y)*sqrt(2)*sqrt(4) \)

We generate 1000 observations of X and Y in R by using function 'mvrnorm' in the package 'MASS'

library("MASS")
set.seed(7)
mu.x <- 0
mu.y <- 0
sigma.x <- sqrt(2)
sigma.y <- sqrt(4)
corr.xy <- 0.8
cov.xy <- corr.xy * sigma.x * sigma.y
Sigma.xy <- matrix(c(sigma.x^2, cov.xy, cov.xy, sigma.y^2), nrow = 2, byrow = T)
# generate the data
data <- mvrnorm(1000, mu = c(mu.x, mu.y), Sigma = Sigma.xy)
colnames(data) <- c("X", "Y")
head(data)
##            X       Y
## [1,]  2.1630  4.9775
## [2,] -1.0629 -2.6492
## [3,] -1.0180 -1.2755
## [4,] -0.3050 -0.9525
## [5,] -0.7407 -2.2278
## [6,] -1.5636 -1.6265
df = data.frame(data)

Let's have a quick check on the generated data

# mean and variance of X, should equal to 2,2
mean(data[, 1])
## [1] -0.007433
var(data[, 1])
## [1] 1.918

# mean and variance of Y, should equal to 4,4
mean(data[, 2])
## [1] 0.01335
var(data[, 2])
## [1] 3.915

# Correlation Matrix of X and Y, should be equal to 0.8
cor(data)
##       X     Y
## X 1.000 0.783
## Y 0.783 1.000

Now, let's plot the generated random variables X and Y.

library(ggplot2)
library(grid)

g = ggplot(df, aes(X, Y)) + geom_point() + geom_point(data = NULL, aes(x = 0, 
    y = 0, colour = "red"), show_guide = FALSE)
plot(g)

plot of chunk unnamed-chunk-3

Assuming that we don't know the underlying distriubtion of both X and Y, and we want to find out the common risk factors of X and Y. PCA is one of the solutions for you.

The first step of performing PCA is to find out the covariance matrix \( \Sigma \) of the variables (or correlation matrix), followed by finding out their eigenvalues and eigenvectors In R, covariance of X and Y can be obtained by:

Sigmahat = cov(data)
e = eigen(Sigmahat)

The two eigenvalues (\( \lambda_1 \), \( \lambda_2 \)) and their eigenvectors (\( w_1, w_2 \)) are:

e$values[1]
## [1] 5.282
e$values[2]
## [1] 0.5498

e$vectors[, 1]
## [1] 0.5376 0.8432
e$vectors[, 2]
## [1] -0.8432  0.5376

Next, we rank the eigenvalue from largest to lowest and sort the correposing eigenvector to create a matrix \( W \) (Actually, R sorted the result for you already, the following code is NOT necessary but for illustration purpose only.

s = sort(e$values, decreasing = T, index.return = T)
# find out the dimension of eignvectors
n = length(e$vectors[, 1])
W = matrix(nrow = n, ncol = n)
for (i in 1:n) {
    W[, i] = e$vectors[, s$ix[i]]
}
W
##        [,1]    [,2]
## [1,] 0.5376 -0.8432
## [2,] 0.8432  0.5376

Principal Component Representation

We denote the values of X and Y in a matrix \( D \), the principal components of the volatility matrix \( \Sigma \) is defined by: \[ P = DW \]

D <- data
P <- D %*% W

The total variation that in the data \( D \) is the sum of the eigenvalues of \( \Sigma \), which is \( \lambda_1+\lambda_2 \). The total variation explained by the mth principal component is \( \frac{\lambda_m}{\sum \lambda_i} \). In our case, the 1st principal component explained about 90.5728 percent of the variation of X.

e$values[1]/(e$values[1] + e$values[2])
## [1] 0.9057

So, to reduce the dimension of our data, we may reconstruct the data \( D \) by multiply P with the orthogonal eigenvectors matrix \( W \) (\( W^{-1} = W' \)): \[ D = PW^{-1} =PW' \] We pick the first largest \( k \) columns of \( P \) be \( P^* \) and take the \( k \) columns of \( W \) be \( W^* \). \[ D \sim P^* W^{*'} \]

In our example:

\( D = PW \) consider the 1st and 2nd rows of \( D \):

\[ [\begin{array}{cc} 2.1630 & 4.9774\end{array}] = [\begin{array}{cc} 5.3598 & 0.8519 \end{array}] \left[\begin{array}{cc} 0.5375 & 0.8432 \\ -0.8432 & 0.5375 \end{array} \right] \]

\[ [\begin{array}{cc} -1.0629 & -2.6492\end{array}] = [\begin{array}{cc} -2.8052 & -0.5279 \end{array}] \left[\begin{array}{cc} 0.5375 & 0.8432 \\ -0.8432 & 0.5375 \end{array} \right] \]

\( D \sim P^*W^{*'} \) consider the 1st and 2nd rows of \( D \):

\[ [\begin{array}{cc} 2.1630 & 4.9774\end{array}] \sim [5.3598] * [\begin{array}{cc} 0.5375 & 0.8432 \end{array} ] \sim [\begin{array}{cc} 2.8809 & 4.5194 \end{array} ] \]

\[ [\begin{array}{cc} -1.0629 & -2.6492\end{array}] \sim [-2.8052] * [\begin{array}{cc} 0.5375 & 0.8432 \end{array}] \sim [\begin{array}{cc} -1.5078 & -2.3653 \end{array} ] \]

The matrices W and W* are:

W
##        [,1]    [,2]
## [1,] 0.5376 -0.8432
## [2,] 0.8432  0.5376
as.matrix(W[, 1])
##        [,1]
## [1,] 0.5376
## [2,] 0.8432

PCA for 10 Random Variables with 1,000 Observations

PCA on 2 variables may not convince you it is a useful technique, let's have a look on the example of 10 RVs, you may change your mind :)

Since PCA is useful in highly correlated data, we generate 10 RVs with random variances and high correlations among them.

library("MASS")
set.seed(7)

# limit sigma between 2 to 6, mu between -5 to 5
sigma.10 <- runif(10, 2, 6)
mu.10 <- runif(10, -5, 5)
# correlations are set to be high, from 0.9 to 1
corr.data <- runif((10 * 9)/2, 0.9, 1)


corr.mat <- diag(10)
corr.mat[lower.tri(corr.mat)] <- corr.data
corr.mat <- corr.mat %*% t(corr.mat)  #LU decomposition
var.mat <- diag(10) * sigma.10
cov.mat <- var.mat %*% corr.mat %*% var.mat

D <- mvrnorm(1000, mu = mu.10, Sigma = cov.mat)
colnames(D) <- c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10")

We now have 1000 observations of random variables X1, X2, …, X10, the first 6 rows of the observations are:

head(D)
##            X1      X2      X3       X4      X5       X6     X7       X8
## [1,]  -4.6118  -8.206 -3.7575 -12.4149 -8.3256 -13.3406 -3.575 -10.7387
## [2,]   3.1745  -2.198  6.5436   0.2722  5.9752   3.1262 -1.500 -14.9713
## [3,]  -0.2836   6.053  9.8119   0.4989  5.4662  -0.3006  2.764   0.9408
## [4,] -10.4104 -15.783 -1.2564  -9.9002 -8.9553 -16.8600 -4.604   3.3858
## [5,]   3.9650  -5.453 -1.1502  -6.4966 -7.8201 -11.3640 -3.997  -7.4090
## [6,]  -7.3194  -6.590 -0.4325  -6.3764  0.4998  -3.2336 -5.920  -9.4978
##         X9       X10
## [1,] 4.446   4.46822
## [2,] 3.341   2.96328
## [3,] 1.167 -11.08713
## [4,] 6.911  -1.26148
## [5,] 3.730  -0.89635
## [6,] 6.476   0.04366

Now, we apply PCA to reduce the dimension of the data

Sigmahat = cov(D)
e = eigen(Sigmahat)
W = cbind(e$vectors)

Let's examine the eigenvalues

e$values
##  [1] 691.743  58.670  30.988  16.657  11.989   6.396   4.847   3.054
##  [9]   3.007   1.593

With the first 4 eigenvectors, we can explain more than 96.2742 percent of the variation of the data D.

sum(e$values[1:4])/sum(e$values) * 100
## [1] 96.27

Let's compare the first 4 columns in \( P \) and \( W \) be \( P^* \) and \( W^* \)

P = D %*% W

D.hat = P[, 1:4] %*% t(W[, 1:4])
colnames(D.hat) <- c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10")

Now, let's look at the difference between the orginal data \( D \) and regenerate data \( \hat{D} \)

head(D)
##            X1      X2      X3       X4      X5       X6     X7       X8
## [1,]  -4.6118  -8.206 -3.7575 -12.4149 -8.3256 -13.3406 -3.575 -10.7387
## [2,]   3.1745  -2.198  6.5436   0.2722  5.9752   3.1262 -1.500 -14.9713
## [3,]  -0.2836   6.053  9.8119   0.4989  5.4662  -0.3006  2.764   0.9408
## [4,] -10.4104 -15.783 -1.2564  -9.9002 -8.9553 -16.8600 -4.604   3.3858
## [5,]   3.9650  -5.453 -1.1502  -6.4966 -7.8201 -11.3640 -3.997  -7.4090
## [6,]  -7.3194  -6.590 -0.4325  -6.3764  0.4998  -3.2336 -5.920  -9.4978
##         X9       X10
## [1,] 4.446   4.46822
## [2,] 3.341   2.96328
## [3,] 1.167 -11.08713
## [4,] 6.911  -1.26148
## [5,] 3.730  -0.89635
## [6,] 6.476   0.04366
head(D.hat)
##             X1     X2     X3     X4      X5      X6     X7      X8      X9
## [1,]  -7.51019 -6.290 -5.389 -5.777  -8.319 -13.837 -7.337  -8.501  0.4882
## [2,]   1.83168  1.640  1.991  2.595   3.742   4.632 -2.706 -13.709 -1.4306
## [3,]   4.55053  3.337  2.220  1.886   2.290   3.053  1.767   1.893 -3.1759
## [4,] -12.21744 -9.699 -7.920 -8.310 -11.048 -15.698 -3.805   4.080  1.6485
## [5,]  -0.02942 -1.028 -2.337 -3.491  -6.366 -12.421 -6.164  -5.554 -0.9276
## [6,]  -7.54581 -5.473 -3.311 -2.639  -2.574  -3.151 -3.451  -8.756 -0.6881
##          X10
## [1,]  5.6071
## [2,]  4.6427
## [3,] -9.4474
## [4,]  0.6940
## [5,]  0.6069
## [6,]  2.6988

And let's plot the first 4 eigenvectors

Final remark

Actually, R has a built-in function 'princomp' for performing PCA, but I think it is worthwhile to learn PCA in brute force programming.

p = princomp(D)
plot(p)

plot of chunk unnamed-chunk-17

summary(p)
## Importance of components:
##                         Comp.1  Comp.2  Comp.3  Comp.4  Comp.5   Comp.6
## Standard deviation     26.2879 7.65580 5.56387 4.07930 3.46075 2.527669
## Proportion of Variance  0.8345 0.07078 0.03738 0.02009 0.01446 0.007715
## Cumulative Proportion   0.8345 0.90527 0.94265 0.96274 0.97720 0.984920
##                          Comp.7   Comp.8   Comp.9  Comp.10
## Standard deviation     2.200474 1.746596 1.733065 1.261686
## Proportion of Variance 0.005847 0.003684 0.003627 0.001922
## Cumulative Proportion  0.990767 0.994451 0.998078 1.000000

What we have done so far is: We reduced the data dimensions while preserved around 96% of variability the orginal data.

I'll show you how PCA be used in finance area in the coming future. Let's take a break now.