Data Analysis of Monthly CO2
- 18 mins1 Pre-analysis
The data is the Monthly CO2 Level at Alert, Canada (01/1994 - 12/2004). First we can observe the overall trend of the CO2 level during 01/1994 - 12/2004.
Judging from the plot, there is no significant outliers in this plot. Also, there is no need to transform the data. We can find that there’s an obvious seasonal component in the picture. Also we can count that there are 11 complete cycles during this period. Therefore, we can calculate that there are 12 time points in each cycle.
2 Remove Seasonal Components
We remove the seasonal components by taking the difference using lag = 12. We can see the plot of the residuals after removing the seasonal component. So in seasonal ARIMA model, the D = 1 here.
The residuals looks not stationary. It seems like there’s no constant mean in these residuals. So we conduct the ADF and KPSS test to test whether the residuals are stationary.
##
## Augmented Dickey-Fuller Test
##
## data: x1
## Dickey-Fuller = -3.0091, Lag order = 4, p-value = 0.1575
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: x1
## KPSS Level = 0.11656, Truncation lag parameter = 2, p-value = 0.1
The Dickey-Fuller test’s null hypothesis is that the series are not stationary(The time series has a unit root). The p-value of Dickey-Fuller Test is 0.1575. So we accept the null hypothesis that the residuals are not stationary.
KPSS Test’s null hypothesis is that the time series does not have a unit root, i.e. the time series are stationary. The p-value is 0.1 which means we accept the null hypothesis that the residuals are stationary.
Combined these two results, we tend to believe that there might be some non-stationality left in the residuals.
3 Find Stationary Series
So we take the difference of the residuals after removing the seasonal component. Here in seasonal ARIMA model, the d = 1. Then we conduct the ADF test and KPSS test again after taking the difference of lag = 1.
##
## Augmented Dickey-Fuller Test
##
## data: x2
## Dickey-Fuller = -5.9106, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: x2
## KPSS Level = 0.035736, Truncation lag parameter = 2, p-value = 0.1
The Dickey-Fuller test’s null hypothesis is that the series are not stationary(The time series has a unit root). The p-value of Dickey-Fuller Test is 0.01 here. So we reject the null hypothesis and conclude that the residuals are stationary.
KPSS Test’s null hypothesis is that the time series does not have a unit root, i.e. the time series are stationary. The p-value is 0.1 which means we can accept the null hypothesis that the residuals are stationary.
Therefore, we find that the residuals after taking difference of lag = 1 are stationary. Then we can fit the seasonal ARIMA model to these residuals.
4 Found the Model
We can plot the ACF and PACF to find the model to fit the stationary residuals.
First of all, we looked at the lags which are multiples of d = 12(i.e. 12,24,36,48,60) to find the ARMA model to fit the seasonal component. As we can see from the plot, the ACF at lag = 12 is significant and then drop to 0 after Lag = 12, and Lag = 12 is the first element of seasonal components’ lags. And PACF seems to trail off to 0 with lag at 12, 24,36 are significant. So we can fit a MA(1) model to the seasonal component. So the seasonal ARIMA model here so far are $ ARIMA(p,1,q)(0,1,1) $
Then we can look at the non-seasonal part of the ACF and PACF plot whose lag are not multiples of 12. We found that in the ACF part, the ACF drop to 0 after lag = 1 and there are 2 other significant lags at 11 and 13. Here, we regard them as the type I error. In the PACF part, the PACF drop to 0 after lag = 2. It is significant at lag = 1,2,11,22. But we regard the lag = 11 and 22 as type I error here. So the possible model for the residuals might be MA(1) or AR(2) model. So we fit these two models to compare which one is better for forecast.
5 Model Selection
5.1 ARIMA(2,1,0)(0,1,1)[12]
We fit the AR(2) model to fit the non-seasonal components. The fit results are shown below.
## Series: co2
## ARIMA(2,1,0)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 sma1
## -0.5279 -0.1837 -0.7713
## s.e. 0.0926 0.0898 0.1081
##
## sigma^2 estimated as 0.6041: log likelihood=-141.45
## AIC=290.89 AICc=291.24 BIC=302.01
Then we plot the ACF and PACF of the residuals after we fit the ARIMA(2,1,0)(0,1,1)[12] model and conduct the Ljung-Box test.
##
## Box-Ljung test
##
## data: y1
## X-squared = 28.038, df = 12, p-value = 0.005462
ACF and PACF are not significant expcept for the PACF at lag 18. This significant lags might be type I error. So we can conclude that there’s no dependence structure remaining in the residuals. They are uncorrelated. Ljung-Box test interprets the p-value of 0.005462 < 0.05. We should reject the null hypothesis and say the noises are not independent. As a matter of fact, this conclusion are not totally contradictory to the ACF and PACF results, since uncorrelation cannot imply independence. As long as the residuals are uncorrelated, the white noise assumption holds. Therefore, we accept that the residuals after fitting the seasonal ARIMA model are white noise.
Then we conduct the Shapiro-Wilk test to check the normality assumption.
##
## Shapiro-Wilk normality test
##
## data: y1
## W = 0.98397, p-value = 0.1235
The p-value is 0.1235 > 0.05. We accept the null hypothesis that the normality holds for the noises.
5.2 ARIMA(0,1,1)(0,1,1)[12]
We fit the MA(1) model to fit the non-seasonal components. The fit results are shown below.
## Series: co2
## ARIMA(0,1,1)(0,1,1)[12]
##
## Coefficients:
## ma1 sma1
## -0.5792 -0.8206
## s.e. 0.0791 0.1137
##
## sigma^2 estimated as 0.5683: log likelihood=-139.54
## AIC=285.08 AICc=285.29 BIC=293.41
Then we plot the ACF and PACF of the residuals after we fit the ARIMA(0,1,1)(0,1,1)[12] model and conduct the Ljung-Box test.
##
## Box-Ljung test
##
## data: y2
## X-squared = 25.891, df = 12, p-value = 0.01112
ACF and PACF are not significant expcept for the PACF at lag 18. This significant lags might be type I error. So we can conclude that there’s no dependence structure remaining in the residuals. They are uncorrelated. Ljung-Box test interprets the p-value of 0.01112 < 0.05. We should reject the null hypothesis and say the noises are not independent. As a matter of fact, this conclusion are not totally contradictory to the ACF and PACF results, since uncorrelation cannot imply independence. As long as the residuals are uncorrelated, the white noise assumption holds. Therefore, we accept that the residuals after fitting the seasonal ARIMA model are white noise.
Then we conduct the Shapiro-Wilk test to check the normality assumption.
##
## Shapiro-Wilk normality test
##
## data: y2
## W = 0.97914, p-value = 0.04
The p-value is 0.04 < 0.05. We reject the null hypothesis and say that the normality does not hold for the noises of MA(1) model.
5.3 “auto.arima” Function Model
We use the “auto.arima” function to fit the CO2 data. The results are shown below.
## Series: co2
## ARIMA(2,0,1)(1,1,0)[12]
##
## Coefficients:
## ar1 ar2 ma1 sar1
## 0.7832 0.2024 -0.4252 -0.4547
## s.e. 0.1641 0.1592 0.1548 0.0817
##
## sigma^2 estimated as 0.7286: log likelihood=-150.53
## AIC=311.06 AICc=311.59 BIC=325
The model here is ARIMA(2,0,1)(1,1,0)$_{12}$. Then we plot the ACF and PACF of the residuals after we fit the model and conduct the Ljung-Box test.
##
## Box-Ljung test
##
## data: y3
## X-squared = 31.971, df = 12, p-value = 0.001398
ACF and PACF are not significant. So we can conclude that there’s no dependence structure remaining in the residuals. They are uncorrelated. Ljung-Box test interprets the p-value of 0.001398 < 0.05. We should reject the null hypothesis and say the noises are not independent. As a matter of fact, this conclusion are not totally contradictory to the ACF and PACF results, since uncorrelation cannot imply independence. As long as the residuals are uncorrelated, the white noise assumption holds. Therefore, we accept that the residuals after fitting the seasonal ARIMA model are white noise.
Then we conduct the Shapiro-Wilk test to check the normality assumption.
##
## Shapiro-Wilk normality test
##
## data: y3
## W = 0.97943, p-value = 0.04276
The p-value is 0.04276 < 0.05. We reject the null hypothesis and say that the normality does not hold for the noises of ARIMA(2,0,1)(1,1,0)$_{12}$ model.
5.4 Comparison
Compare these three model in terms of AIC, white noise assumptions of residuals and normality assumption of residuals.
## AIC RESIDUAL NORMALITY
## ARIMA(2,1,0)*(0,1,1) 290.8930 White Noise Hold
## ARIMA(0,1,1)(0,1,1) 285.0769 White Noise Not Hold
## ARIMA(2,0,1)(1,1,0) 311.0631 White Noise Not Hold
-
In terms of AIC, the last one model seems much larger than the previous two models.
-
Their residuals are all white noise.
-
In terms of normality, which is of great importance for forecasting, only the first model holds the normality assumption.
After some trade-offs, we choose the first model to forecast the result because the first model’s normality assumption holds. When the normality assumption holds, the forecast intervals in the next step would be more reliabe with less bias.
Hence we choose ARIMA(2,1,0)(0,1,1)$_{12}$
Forecast in 2005
I use the “forecast” function to forecast the CO2 level in 2005. The forecast plot, points and the confidence intervals are shown as below.
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2005 382.9487 381.9516 383.9458 381.4238 384.4737
## Feb 2005 383.6491 382.5465 384.7517 381.9629 385.3354
## Mar 2005 384.0068 382.7678 385.2459 382.1119 385.9018
## Apr 2005 384.6997 383.3178 386.0817 382.5863 386.8132
## May 2005 385.2176 383.7228 386.7124 382.9314 387.5037
## Jun 2005 383.2848 381.6801 384.8895 380.8306 385.7390
## Jul 2005 376.4233 374.7157 378.1310 373.8117 379.0350
## Aug 2005 370.4264 368.6225 372.2304 367.6676 373.1853
## Sep 2005 371.1600 369.2644 373.0557 368.2609 374.0592
## Oct 2005 375.7957 373.8126 377.7788 372.7628 378.8287
## Nov 2005 380.4828 378.4159 382.5496 377.3218 383.6437
## Dec 2005 383.2483 381.1010 385.3956 379.9642 386.5323
The 95% forecast intervals are reliable since the normality assumption of the residuals holds.
Appendix
library(TSA)
library(forecast)
data(co2)
t<- as.vector(time(co2))
xt <- as.vector(co2)
n <- length(xt)
plot(co2, main = "Monthly CO2 Level at Alert, Canada (01/1994 - 12/2004)",ylab = "CO2 Level")
#seasonal
d = n / 11
x1 <- diff(xt,lag = d) #remove seasonal componet
plot(x1,type = "l", main = "Residuals after Removing the Seasonal Component",ylab ="",xlab = "")
adf.test(x1)#accept H0 and x1 is not stationary
kpss.test(x1)#accept H0 that x1 is stationary
#not stationary
x2 = diff(x1, lag = 1)
plot(x2,type = "l", main = "Residuals after taking the difference of Lag = 1",ylab ="",xlab = "")
adf.test(x2)#reject H0 and x2 is stationary
kpss.test(x2)#accept H0 that x2 is stationary
par(mfrow=c(2,1))
acf(x2,lag.max = 60) #just look at the seasonal component drops to 0 after 1
pacf(x2,lag.max = 60) #just look at the seasonal component is 0
#ma(1) for seasonal componet
# Choice of AR(2) or MA(1)
#arima model with AR(2)
fit2 <- Arima(co2,order = c(2,1,0),seasonal = c(0,1,1))
fit2
y1 <- resid(fit2)
par(mfrow = c(2,1))
acf(y1)
pacf(y1)
Box.test(y1,lag = min(d*2,n/5),type = "Ljung-Box",fitdf = 12) #not independent
shapiro.test(y1) # normal dist'n
#arima model with MA(1)
fit3 <- Arima(co2,order = c(0,1,1),seasonal = c(0,1,1))
fit3
y2 <- resid(fit3)
acf(y2)
pacf(y2)
Box.test(y2,lag = min(d*2,n/5),type = "Ljung-Box",fitdf = 12) #not independent
shapiro.test(y2) # not normal
fit4 <- auto.arima(co2,allowdrift = FALSE,stepwise = F)
fit4
y3 <- resid(fit4)
par(mfrow = c(2,1))
acf(y3)
pacf(y3)
Box.test(y3,lag = min(d*2,n/5),type = "Ljung-Box",fitdf = 12) #not independent
#Compared with ARIMA(2,0,1)*(1,1,0)[12], this model has a lower AIC.
shapiro.test(y3) # not normal
#model selection
##AIC
comp<- data.frame(row.names = c("AIC","RESIDUAL","NORMALITY"))
rownames(comp) <- c("ARIMA(2,1,0)*(0,1,1)","ARIMA(0,1,1)(0,1,1)","ARIMA(2,0,1)(1,1,0)")
comp$AIC <- c(fit2$aic,fit3$aic,fit4$aic)
comp$RESIDUAL <- c("White Noise","White Noise","White Noise")
comp$NORMALITY <- c("Hold","Not Hold","Not Hold")
comp
#In terms of AIC, the last one model seems larger than the previous two model.
#Their residuals are all white noise
#In terms of normality, which is of great importance for forecasting, only the first model holds the normality assumption
#After some trade-offs, we choose the first model to forecast the result
#Hence we choose ARIMA(2,1,0)\ast(0,1,1)
fc = forecast(fit2)
plot(fc)
fc1 = as.data.frame(fc)
fc1 = fc1[1:12,]
fc1
#use this model to forecast 2005
par(mfrow = c(1,1))
plot(130:132,xt[130:132],type = "l", xlim = c(130,145), ylim = c(366, 390),main = "Forecast CO2 Level in 2005 (12 months)",xlab = "Time",ylab = "Co2 Level")
lines(132:144, c(xt[132], fc$mean[1:12]), col="red")
lines(132:144, c(xt[132], fc$lower[1:12]), col="darkgrey",lty = 4)
lines(132:144, c(xt[132], fc$upper[1:12]), col="darkgrey",lty = 4)
axis(1,at = rownames(fc)[1:12])
legend(x = 130, y = 370, legend = c("Forecast Mean", "95% Confidence Forecast Interval"),
col = c("red", "darkgrey"), lty = c(1,4))