Introduction:

Time-Series analysis is a mathematical analysis that can predict future outputs from previous past measurements. We use time-series analysis when we have only one variable to measure (sales of a product for example) + timing variable. Ex: If last month I sold 20 cars and this month I have sold 25 cars, how many cars will I sell next month? My only variables are car sales and time, measured by months.
Thus, time-series is a series of data points organized by time order (commonly a sequence of successive equally spaced events in time). We use these time series algorithms to create a model, and once we have our model we can use it to predict future values.


Look for the best model -> introduce the “past” inputs -> predict future values


We don’t use time series analysis when the values are constant (if I sold 20 cars last month, 20 cars this month, it makes no sense to do an analysis for next month). Moreover, we can only use it when the data is stationary. When is the data stationary? There are some conditions:
  • The mean should be constant according to the time.
  • The variance should be equal at different time intervals from the mean: each point’s distance to the mean should be equal at equal intervals of time.
If these 3 conditions are fulfilled that means that my series is stationary and that we can use time-series analysis


Components of time-series analysis:

  • General trend: The “behavior” of our values (how are they going); are they increasing (upwards) or decreasing (downwards).
  • Seasonal component: How many peaks do we see? Example: We have a flying company. We know that at some times of the year when there are celebrations, more people will take a plane like on Christmas, Thanksgiving or summer. So, in a year, how many peaks do we have? If there are 3 important celebrations per year, we are going to see 3 peaks. This is called the seasonal component of the time-series analysis.
  • Irregular fluctuation: Random component (uncontrolled circumstances). Example: during winter there is snow and strong winds in Boston and some flights will need to be concealed. However, these conditions are not the same every year (one year there might be a lot of and another there might be quite less). This factor can make our analysis useless because is not under our control (hard or impossible to predict)


How do we transform our data into stationary?

  • If we apply the log function ( log(variable) ) we will obtain equal variance.
  • The mean: If we apply the command “diff” (used to differentiate) the mean will become constant according to time. It is possible that the data needs no transformation (already stationary) or that there is only need to apply one of these functions!


Let’s see what does this mean using the R dataset “AirPassengers” from the “datasets” package that contains the information about the number of people who took the plane per month from 1949 to 1960. To do so and future analysis, make sure you have downloaded the following packages: nlme, MuMIn, MASS, ggplot2 and curl.


Loading libraries

library(nlme) # gls (generalized least squares)
library(MuMIn) #AICc
library(MASS) # acf(), pacf(), ARIMA()
library(ggplot2) # for graphing
library(curl) # for loading the data
## Using libcurl 7.64.1 with LibreSSL/2.8.3
library(forecast) # for choosing arima coefficients
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## 
## Attaching package: 'forecast'
## The following object is masked from 'package:nlme':
## 
##     getResponse


First let’s take a look to our data:
AirPassengers #Each value represents the number of people ("Air Passengers") that took the plane the month indicated by the column and the year indicated by the row.
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432


Now let’s plot the data:
plot(AirPassengers)
abline(reg=lm(AirPassengers~time(AirPassengers)))

With the previous line of code we can see the mean ploted as a line.
We see the mean is increasing every year. In addition, dispersion is increasing with time, which means the variance is not equal at different time points. Thus, our data is not stationary. Let’s transform our data using “log” and “dif” as we learnt before so we can apply on it a time-series analysis!
plot(log(AirPassengers)) #Using "log" function to change the variance.
abline(reg=lm(log(AirPassengers)~time(log(AirPassengers)))) 

Now the variance is constant according to time (we can see that the distance from each peak to the line is similar).


plot(diff(log(AirPassengers))) #Using "diff" function to change the mean.
abline(reg=lm(diff(log(AirPassengers))~time(diff(log(AirPassengers))))) 

We can see that the mean is constant according to time.


Now that the data is stationary and we could apply different models for time series analysis like the ArIMa model (Autoregressive Integrated Moving average). We can apply the ARIMA model to make predictions based on our stationary data.
We will do so at the end of the module, but before we will learn how we can also use the ARIMA model to eliminate the autocorrelation of one data set.


ARIMA Model:


Now, we are going to use ARIMA models to get rid of autocorrelation in the residuals of models of a given dataset, to make our errors random and independent from each other. For this exercise, we are going to use the hypothetical case of an obsidian analysis by portable X-ray fluorescence. The samples were taken from one archaeological site that was active from AD 120 to 1520 (Classic to Postclassic) in Central Mexico.
The values are presented in parts per million (ppm) and represent the mean of different obsidians analyzed with XRFp from the same site with very well-dated contexts. Each value represents the mean of all obsidians analyzed with pXRF that were recovered from various contexts dating ten-year terms.


Loading the dataset:

f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/bioanth-stats/module-F21-Group2/Dataset_R_2.csv")
d <- read.csv(f, header = TRUE)
head(d)
##   Year_AD    Fe Zn Ga Th  Rb  Sr  Y  Zr Nb
## 1     120  9079 36 22 12 117 122 19 131 11
## 2     130 10145 71 35 19 173   6 50 218 50
## 3     140  8908 63 27 19 165   8 49 197 42
## 4     150  8849 50 27 16 165   4 49 222 39
## 5     160  8842 59 30 20 171   3 52 208 39
## 6     170  9033 49 16 21 165   7 46 195 44
names(d)
##  [1] "Year_AD" "Fe"      "Zn"      "Ga"      "Th"      "Rb"      "Sr"     
##  [8] "Y"       "Zr"      "Nb"


Graphing the dataset

plot(Rb ~ Year_AD, data=d, type="l", pch=16, 
     ylab="ppm", xlab="Year")

ggplot(data = d, aes(x=Year_AD, y= Rb)) + geom_line()

#If you obtain an error "object 'Year_AD' not found" try to write "X.U.FEFF.Year_AD" instead. We think it has to do with mac / windows.


Analysis of the autocorrelation in the data

This is the pacf() function. This stands for partial auto-correlation function which is used to find auto-correlation in the data
pacf1 <- pacf(d$Rb, main="PACF")

As you can see, there is auto correlation present in the data. It is shown by the data points which go beyond the blue confidence interval. Specifically the auto-correlation is outside the confidence interval at a lag of 1 and 2. This will guide later analysis.

Creating a simple linear model of the data

The next step in the analysis is to make a simple linear model of the data. We know this model does not accurately capture the variation in the data. However, it is helpful to look at the error present in this model to confirm that auto-correlation is present.
lm1 <- lm(Rb ~ Year_AD, data=d)
summary (lm1)
## 
## Call:
## lm(formula = Rb ~ Year_AD, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.593 -12.086   3.933  13.544 121.477 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.579e+02  4.832e+00  32.683   <2e-16 ***
## Year_AD     1.823e-03  5.265e-03   0.346     0.73    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.53 on 140 degrees of freedom
## Multiple R-squared:  0.0008551,  Adjusted R-squared:  -0.006282 
## F-statistic: 0.1198 on 1 and 140 DF,  p-value: 0.7298


Analyzing autocorrelation in the residuals

After that, we will extract the residuals (error) from this model and see if this error is auto-correlated as well
d$mod.resid <- residuals(lm1)
pacf2 <- pacf(d$mod.resid, main="PACF - Rb residuals")

pacf2
## 
## Partial autocorrelations of series 'd$mod.resid', by lag
## 
##      1      2      3      4      5      6      7      8      9     10     11 
##  0.277  0.235  0.053 -0.102  0.013  0.028  0.044 -0.134  0.059 -0.015 -0.054 
##     12     13     14     15     16     17     18     19     20     21 
## -0.046  0.009 -0.123  0.066 -0.019  0.077 -0.131 -0.134 -0.072  0.010
Residuals should be random. The values in this graph which are outside the confidence interval tell us that the residuals are not random. The goal of this analysis will be to correct for these non-random residuals.


Creating ARIMA Models

ARIMA stands for autoregressive intergrated moving average models. The values p, d and q are different values which help describe the data we are trying to fit to the ARIMA model.
p is the number of auto-regressive terms. This is the number of lag observations in the model; also known as the lag order. Lag is defined as Lag is defined as the shift of a time series back by a given number of observations. Mathmatically lag means the current value is predicted from a linear combination of the previous values.
AR(p): Yt = u + B(1) x Y(t-1) + B(2) x Y (t-2) + … + B(p) x Y(t-p) + e
u = the mean of the time series, Y = a given value in the time series, B = a weight given to each value in the time series, e = the error
d is the number of non-seasonal differences. The number of times that the raw observations are differenced, also called the degree of differencing. (The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary).
q is the number of moving average terms. Mathmatically this means each value in a time series is predicted by the a linear combination of q previous errors.
MA(q): Yt = u + B(1) x e(t-1) + B(2) x e(t-2) + … + B(p) x e(t-q) + e
u = the mean of the time series, Y = a given value in the time series, B = a weight given to each value in the time series, e = the error
Source: R in Action by Rob Kabacoff
The arima() function in r is the model we will use to fit the data based on the parameters predicted by the pacf(). Specifically, since the values outside the confidence interval were a lag term of 1 and 2 we will use auto-regressive terms and moving average terms near 1-2. In general, it is good to expand a little beyond what is beyond the confidence interval because pacf() is not perfect.


#AR
ar.1 <- arima(d$mod.resid, order=c(1,0,0)) 
ar.2 <- arima(d$mod.resid, order=c(2,0,0))
ar.3 <- arima(d$mod.resid, order=c(3,0,0))
ar.4 <- arima(d$mod.resid, order=c(4,0,0))

# MA
ma.1 <- arima(d$mod.resid, order=c(0,0,1)) 
ma.2 <- arima(d$mod.resid, order=c(0,0,2)) 
ma.3 <- arima(d$mod.resid, order=c(0,0,3)) 
ma.4 <- arima(d$mod.resid, order=c(0,0,4)) 


Challenge 1: rank the ARIMA models using AICc

After fitting multiple ARIMAs we will use AICc() to rank the models from best to worst. The lower the AICc value the better it is.


AICc(ar.1)
## [1] 1315.643
AICc(ar.2)
## [1] 1309.723
AICc(ar.3)
## [1] 1311.535
AICc(ar.4)
## [1] 1312.069
AICc(ma.1)
## [1] 1319.483
AICc(ma.2)
## [1] 1311.925
AICc(ma.3)
## [1] 1309.614
AICc(ma.4)
## [1] 1311.605


After running the AICc() we found that the models with 2 autoregressive terms and 3 moving average terms are the best fitted to the data.


Creating linear models which account for non-random error

We will now use the gls() function to model these two best fitting parameters into a linear model. gls stands for generalized least squares model. This type of linear model allows us to incorporate ARIMA terms into the modeling.


gls.mod1 <- gls(Rb ~ Year_AD, data=d, correlation = corARMA(p=2, q=0))
gls.mod2 <- gls(Rb ~ Year_AD, data=d, correlation = corARMA(p=0, q=3))

Challenge 2: rank the models using AICc

AICc(gls.mod1)
## [1] 1315.006
AICc(gls.mod2)
## [1] 1315.296


While the AICc value for model 1 is less than the model 2, they are so close we can consider than equivanlent. This means the data is either best modeled with 2 autoregressive terms or 3 moving average terms.

Checking our work

Let’s make a summary for the first linear model and see if the residuals still have auto-correlation using pacf()
summary(gls.mod1)
## Generalized least squares fit by REML
##   Model: Rb ~ Year_AD 
##   Data: d 
##        AIC      BIC    logLik
##   1314.565 1329.273 -652.2823
## 
## Correlation Structure: ARMA(2,0)
##  Formula: ~1 
##  Parameter estimate(s):
##      Phi1      Phi2 
## 0.2320468 0.2550797 
## 
## Coefficients:
##                 Value Std.Error   t-value p-value
## (Intercept) 157.47854  8.562384 18.391902  0.0000
## Year_AD       0.00186  0.009292  0.200333  0.8415
## 
##  Correlation: 
##         (Intr)
## Year_AD -0.893
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -2.2820376 -0.4497547  0.1684956  0.5384068  4.7031405 
## 
## Residual standard error: 25.91925 
## Degrees of freedom: 142 total; 140 residual
confint(gls.mod1)
##                    2.5 %       97.5 %
## (Intercept) 140.69657166 174.26050139
## Year_AD      -0.01635131   0.02007449
d$new.residuals <- resid(gls.mod1, type="normalized")

pacf3 <- pacf(d$new.residuals, main="PACF - GLS model residuals")

Poof! The auto-correlation is gone! Notice how there is now no values outside the confidence intervals. That means the data has been modeled in such a way that there is no autocorrelation error that we are not accounting for in the way we model the data.
To make sure that the model is appropriate, we check if the residuals are normally distributed using a qq-plot


qqnorm(d$new.residuals)
qqline(d$new.residuals)

The residuals adjust pretty well to the line in the Q-Q plot, which means they are normally distributed.


Forecasting with ARIMA

In this last part of the module we are going learn to predict future values of a time series. For this task, we are going to use a dataset from the package “datasets” called “LakeHuron”, that has information about the level of Lake Huron from 1875 to 1972.



Loading the dataset

LakeHuron
## Time Series:
## Start = 1875 
## End = 1972 
## Frequency = 1 
##  [1] 580.38 581.86 580.97 580.80 579.79 580.39 580.42 580.82 581.40 581.32
## [11] 581.44 581.68 581.17 580.53 580.01 579.91 579.14 579.16 579.55 579.67
## [21] 578.44 578.24 579.10 579.09 579.35 578.82 579.32 579.01 579.00 579.80
## [31] 579.83 579.72 579.89 580.01 579.37 578.69 578.19 578.67 579.55 578.92
## [41] 578.09 579.37 580.13 580.14 579.51 579.24 578.66 578.86 578.05 577.79
## [51] 576.75 576.75 577.82 578.64 580.58 579.48 577.38 576.90 576.94 576.24
## [61] 576.84 576.85 576.90 577.79 578.18 577.51 577.23 578.42 579.61 579.05
## [71] 579.26 579.22 579.38 579.10 577.95 578.12 579.75 580.85 580.41 579.96
## [81] 579.61 578.76 578.18 577.21 577.13 579.10 578.25 577.91 576.89 575.96
## [91] 576.80 577.68 578.38 578.52 579.74 579.31 579.89 579.96


Plotting it

plot(LakeHuron)

After plotting the time series, we can see that the data isn’t stationary, so we differentiate it 1 time to make it stationary.

plot(diff(LakeHuron))

h <- diff(LakeHuron)


Now we are going to define the p, d and q coefficients for our ARIMA model. Auto.arima() is a function from the “forecast” package that estimates the best coefficients for building a fitting ARIMA model for your data.

auto.arima(LakeHuron) 
## Series: LakeHuron 
## ARIMA(0,1,0) 
## 
## sigma^2 estimated as 0.5588:  log likelihood=-109.11
## AIC=220.22   AICc=220.26   BIC=222.79


It returns (0,1,0). d = 1 means that the data is differentiated once, as we noticed we had to do before to make the data stationary.


Now we are going to do the predictions using the predict() function:

a <- arima(LakeHuron, c(0, 1, 0), list(order = c(0, 1, 0))) # We make our arima model with the coefficients obtained before

pred <- predict(a, n.ahead = 10) # We predict the levels of the next 10 years.

pred
## $pred
## Time Series:
## Start = 1973 
## End = 1982 
## Frequency = 1 
##  [1] 580.03 580.10 580.17 580.24 580.31 580.38 580.45 580.52 580.59 580.66
## 
## $se
## Time Series:
## Start = 1973 
## End = 1982 
## Frequency = 1 
##  [1]  0.9752452  2.1807145  3.6490333  5.3416377  7.2326117  9.3032459
##  [7] 11.5392564 13.9292870 16.4640332 19.1356919
#In case you had a data set with different time subsets (month and years or days and weeks) you can select them by using "*" between the numbers in "n.ahead". Ex: 5*12 would be predict for the next 5 years the 12 months.


Challenge 3:

Using the “AirPassengers” dataset, predict the number of people that is going to take the plane in the following 3 years.
Hint: Make sure your data is stationary before applying the model!
stat_air <- log(AirPassengers)

#We are not using "diff" here because with "auto.arima" we will obtain "d" aka the number of times we need to apply "diff".

auto.arima(stat_air)
## Series: stat_air 
## ARIMA(0,1,1)(0,1,1)[12] 
## 
## Coefficients:
##           ma1     sma1
##       -0.4018  -0.5569
## s.e.   0.0896   0.0731
## 
## sigma^2 estimated as 0.001371:  log likelihood=244.7
## AIC=-483.4   AICc=-483.21   BIC=-474.77
a2 <- arima(AirPassengers, c(0, 1, 1), seasonal = list(order = c(0, 1, 1))) #We make our ARIMA model with the coefficients obtained above.

pred2 <- predict(a2, n.ahead = 3*12) #Prediction of the number of passengers that will take the plane in the following 3 years.

pred2
## $pred
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1961 447.0532 421.8774 453.5262 489.9008 502.1835 564.2246 649.7953 636.7146
## 1962 479.5818 454.4060 486.0548 522.4294 534.7120 596.7532 682.3238 669.2432
## 1963 512.1103 486.9345 518.5833 554.9579 567.2406 629.2817 714.8524 701.7717
##           Sep      Oct      Nov      Dec
## 1961 538.9209 491.0672 422.8243 464.7526
## 1962 571.4495 523.5958 455.3528 497.2811
## 1963 603.9780 556.1243 487.8814 529.8097
## 
## $se
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1961 11.63717 14.14733 16.27484 18.15472 19.85742 21.42523 22.88589 24.25875
## 1962 34.45453 37.66879 40.62956 43.38876 45.98269 48.43791 50.77454 53.00828
## 1963 66.28101 69.96584 73.46609 76.80699 80.00850 83.08675 86.05495 88.92413
##           Sep      Oct      Nov      Dec
## 1961 25.55798 26.79429 27.97601 29.10980
## 1962 55.15161 57.21472 59.20597 61.13240
## 1963 91.70359 94.40125 97.02393 99.57755