Loading the dataset:
f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/bioanth-stats/module-F21-Group2/Dataset_R_2.csv")
d <- read.csv(f, header = TRUE)
head(d)
## Year_AD Fe Zn Ga Th Rb Sr Y Zr Nb
## 1 120 9079 36 22 12 117 122 19 131 11
## 2 130 10145 71 35 19 173 6 50 218 50
## 3 140 8908 63 27 19 165 8 49 197 42
## 4 150 8849 50 27 16 165 4 49 222 39
## 5 160 8842 59 30 20 171 3 52 208 39
## 6 170 9033 49 16 21 165 7 46 195 44
names(d)
## [1] "Year_AD" "Fe" "Zn" "Ga" "Th" "Rb" "Sr"
## [8] "Y" "Zr" "Nb"
Graphing the dataset
plot(Rb ~ Year_AD, data=d, type="l", pch=16,
ylab="ppm", xlab="Year")
ggplot(data = d, aes(x=Year_AD, y= Rb)) + geom_line()
#If you obtain an error "object 'Year_AD' not found" try to write "X.U.FEFF.Year_AD" instead. We think it has to do with mac / windows.
Analysis of the autocorrelation in the data
This is the pacf() function. This stands for partial auto-correlation function which is used to find auto-correlation in the data
pacf1 <- pacf(d$Rb, main="PACF")
As you can see, there is auto correlation present in the data. It is shown by the data points which go beyond the blue confidence interval. Specifically the auto-correlation is outside the confidence interval at a lag of 1 and 2. This will guide later analysis.
Creating a simple linear model of the data
The next step in the analysis is to make a simple linear model of the data. We know this model does not accurately capture the variation in the data. However, it is helpful to look at the error present in this model to confirm that auto-correlation is present.
lm1 <- lm(Rb ~ Year_AD, data=d)
summary (lm1)
##
## Call:
## lm(formula = Rb ~ Year_AD, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.593 -12.086 3.933 13.544 121.477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.579e+02 4.832e+00 32.683 <2e-16 ***
## Year_AD 1.823e-03 5.265e-03 0.346 0.73
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.53 on 140 degrees of freedom
## Multiple R-squared: 0.0008551, Adjusted R-squared: -0.006282
## F-statistic: 0.1198 on 1 and 140 DF, p-value: 0.7298
Creating ARIMA Models
ARIMA stands for autoregressive intergrated moving average models. The values p, d and q are different values which help describe the data we are trying to fit to the ARIMA model.
p is the number of auto-regressive terms. This is the number of lag observations in the model; also known as the lag order. Lag is defined as Lag is defined as the shift of a time series back by a given number of observations. Mathmatically lag means the current value is predicted from a linear combination of the previous values.
AR(p): Yt = u + B(1) x Y(t-1) + B(2) x Y (t-2) + … + B(p) x Y(t-p) + e
u = the mean of the time series, Y = a given value in the time series, B = a weight given to each value in the time series, e = the error
d is the number of non-seasonal differences. The number of times that the raw observations are differenced, also called the degree of differencing. (The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary).
q is the number of moving average terms. Mathmatically this means each value in a time series is predicted by the a linear combination of q previous errors.
MA(q): Yt = u + B(1) x e(t-1) + B(2) x e(t-2) + … + B(p) x e(t-q) + e
u = the mean of the time series, Y = a given value in the time series, B = a weight given to each value in the time series, e = the error
Source: R in Action by Rob Kabacoff
The arima() function in r is the model we will use to fit the data based on the parameters predicted by the pacf(). Specifically, since the values outside the confidence interval were a lag term of 1 and 2 we will use auto-regressive terms and moving average terms near 1-2. In general, it is good to expand a little beyond what is beyond the confidence interval because pacf() is not perfect.
#AR
ar.1 <- arima(d$mod.resid, order=c(1,0,0))
ar.2 <- arima(d$mod.resid, order=c(2,0,0))
ar.3 <- arima(d$mod.resid, order=c(3,0,0))
ar.4 <- arima(d$mod.resid, order=c(4,0,0))
# MA
ma.1 <- arima(d$mod.resid, order=c(0,0,1))
ma.2 <- arima(d$mod.resid, order=c(0,0,2))
ma.3 <- arima(d$mod.resid, order=c(0,0,3))
ma.4 <- arima(d$mod.resid, order=c(0,0,4))
Checking our work
Let’s make a summary for the first linear model and see if the residuals still have auto-correlation using pacf()
summary(gls.mod1)
## Generalized least squares fit by REML
## Model: Rb ~ Year_AD
## Data: d
## AIC BIC logLik
## 1314.565 1329.273 -652.2823
##
## Correlation Structure: ARMA(2,0)
## Formula: ~1
## Parameter estimate(s):
## Phi1 Phi2
## 0.2320468 0.2550797
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 157.47854 8.562384 18.391902 0.0000
## Year_AD 0.00186 0.009292 0.200333 0.8415
##
## Correlation:
## (Intr)
## Year_AD -0.893
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.2820376 -0.4497547 0.1684956 0.5384068 4.7031405
##
## Residual standard error: 25.91925
## Degrees of freedom: 142 total; 140 residual
confint(gls.mod1)
## 2.5 % 97.5 %
## (Intercept) 140.69657166 174.26050139
## Year_AD -0.01635131 0.02007449
d$new.residuals <- resid(gls.mod1, type="normalized")
pacf3 <- pacf(d$new.residuals, main="PACF - GLS model residuals")
Poof! The auto-correlation is gone! Notice how there is now no values outside the confidence intervals. That means the data has been modeled in such a way that there is no autocorrelation error that we are not accounting for in the way we model the data.
To make sure that the model is appropriate, we check if the residuals are normally distributed using a qq-plot
qqnorm(d$new.residuals)
qqline(d$new.residuals)
The residuals adjust pretty well to the line in the Q-Q plot, which means they are normally distributed.