# Data Analysis of Monthly CO2

- 18 mins

## 1 Pre-analysis

The data is the Monthly CO2 Level at Alert, Canada (01/1994 - 12/2004). First we can observe the overall trend of the CO2 level during 01/1994 - 12/2004.

Judging from the plot, there is no significant outliers in this plot. Also, there is no need to transform the data. We can find that there’s an obvious seasonal component in the picture. Also we can count that there are 11 complete cycles during this period. Therefore, we can calculate that there are 12 time points in each cycle.

## 2 Remove Seasonal Components

We remove the seasonal components by taking the difference using lag = 12. We can see the plot of the residuals after removing the seasonal component. So in seasonal ARIMA model, the D = 1 here.

The residuals looks not stationary. It seems like there’s no constant mean in these residuals. So we conduct the ADF and KPSS test to test whether the residuals are stationary.

The Dickey-Fuller test’s null hypothesis is that the series are not stationary(The time series has a unit root). The p-value of Dickey-Fuller Test is 0.1575. So we accept the null hypothesis that the residuals are not stationary.

KPSS Test’s null hypothesis is that the time series does not have a unit root, i.e. the time series are stationary. The p-value is 0.1 which means we accept the null hypothesis that the residuals are stationary.

Combined these two results, we tend to believe that there might be some non-stationality left in the residuals.

## 3 Find Stationary Series

So we take the difference of the residuals after removing the seasonal component. Here in seasonal ARIMA model, the d = 1. Then we conduct the ADF test and KPSS test again after taking the difference of lag = 1.

The Dickey-Fuller test’s null hypothesis is that the series are not stationary(The time series has a unit root). The p-value of Dickey-Fuller Test is 0.01 here. So we reject the null hypothesis and conclude that the residuals are stationary.

KPSS Test’s null hypothesis is that the time series does not have a unit root, i.e. the time series are stationary. The p-value is 0.1 which means we can accept the null hypothesis that the residuals are stationary.

Therefore, we find that the residuals after taking difference of lag = 1 are stationary. Then we can fit the seasonal ARIMA model to these residuals.

## 4 Found the Model

We can plot the ACF and PACF to find the model to fit the stationary residuals.

First of all, we looked at the lags which are multiples of d = 12(i.e. 12,24,36,48,60) to find the ARMA model to fit the seasonal component. As we can see from the plot, the ACF at lag = 12 is significant and then drop to 0 after Lag = 12, and Lag = 12 is the first element of seasonal components’ lags. And PACF seems to trail off to 0 with lag at 12, 24,36 are significant. So we can fit a MA(1) model to the seasonal component. So the seasonal ARIMA model here so far are $ARIMA(p,1,q)(0,1,1)$

Then we can look at the non-seasonal part of the ACF and PACF plot whose lag are not multiples of 12. We found that in the ACF part, the ACF drop to 0 after lag = 1 and there are 2 other significant lags at 11 and 13. Here, we regard them as the type I error. In the PACF part, the PACF drop to 0 after lag = 2. It is significant at lag = 1,2,11,22. But we regard the lag = 11 and 22 as type I error here. So the possible model for the residuals might be MA(1) or AR(2) model. So we fit these two models to compare which one is better for forecast.

## 5 Model Selection

### 5.1 ARIMA(2,1,0)(0,1,1)[12]

We fit the AR(2) model to fit the non-seasonal components. The fit results are shown below.

Then we plot the ACF and PACF of the residuals after we fit the ARIMA(2,1,0)(0,1,1)[12] model and conduct the Ljung-Box test.

ACF and PACF are not significant expcept for the PACF at lag 18. This significant lags might be type I error. So we can conclude that there’s no dependence structure remaining in the residuals. They are uncorrelated. Ljung-Box test interprets the p-value of 0.005462 < 0.05. We should reject the null hypothesis and say the noises are not independent. As a matter of fact, this conclusion are not totally contradictory to the ACF and PACF results, since uncorrelation cannot imply independence. As long as the residuals are uncorrelated, the white noise assumption holds. Therefore, we accept that the residuals after fitting the seasonal ARIMA model are white noise.

Then we conduct the Shapiro-Wilk test to check the normality assumption.

The p-value is 0.1235 > 0.05. We accept the null hypothesis that the normality holds for the noises.

### 5.2 ARIMA(0,1,1)(0,1,1)[12]

We fit the MA(1) model to fit the non-seasonal components. The fit results are shown below.

Then we plot the ACF and PACF of the residuals after we fit the ARIMA(0,1,1)(0,1,1)[12] model and conduct the Ljung-Box test.

ACF and PACF are not significant expcept for the PACF at lag 18. This significant lags might be type I error. So we can conclude that there’s no dependence structure remaining in the residuals. They are uncorrelated. Ljung-Box test interprets the p-value of 0.01112 < 0.05. We should reject the null hypothesis and say the noises are not independent. As a matter of fact, this conclusion are not totally contradictory to the ACF and PACF results, since uncorrelation cannot imply independence. As long as the residuals are uncorrelated, the white noise assumption holds. Therefore, we accept that the residuals after fitting the seasonal ARIMA model are white noise.

Then we conduct the Shapiro-Wilk test to check the normality assumption.

The p-value is 0.04 < 0.05. We reject the null hypothesis and say that the normality does not hold for the noises of MA(1) model.

### 5.3 “auto.arima” Function Model

We use the “auto.arima” function to fit the CO2 data. The results are shown below.

The model here is ARIMA(2,0,1)(1,1,0)$_{12}$. Then we plot the ACF and PACF of the residuals after we fit the model and conduct the Ljung-Box test.

ACF and PACF are not significant. So we can conclude that there’s no dependence structure remaining in the residuals. They are uncorrelated. Ljung-Box test interprets the p-value of 0.001398 < 0.05. We should reject the null hypothesis and say the noises are not independent. As a matter of fact, this conclusion are not totally contradictory to the ACF and PACF results, since uncorrelation cannot imply independence. As long as the residuals are uncorrelated, the white noise assumption holds. Therefore, we accept that the residuals after fitting the seasonal ARIMA model are white noise.

Then we conduct the Shapiro-Wilk test to check the normality assumption.

The p-value is 0.04276 < 0.05. We reject the null hypothesis and say that the normality does not hold for the noises of ARIMA(2,0,1)(1,1,0)$_{12}$ model.

### 5.4 Comparison

Compare these three model in terms of AIC, white noise assumptions of residuals and normality assumption of residuals.

• In terms of AIC, the last one model seems much larger than the previous two models.

• Their residuals are all white noise.

• In terms of normality, which is of great importance for forecasting, only the first model holds the normality assumption.

After some trade-offs, we choose the first model to forecast the result because the first model’s normality assumption holds. When the normality assumption holds, the forecast intervals in the next step would be more reliabe with less bias.

Hence we choose ARIMA(2,1,0)(0,1,1)$_{12}$

## Forecast in 2005

I use the “forecast” function to forecast the CO2 level in 2005. The forecast plot, points and the confidence intervals are shown as below.

The 95% forecast intervals are reliable since the normality assumption of the residuals holds.