---
title: "Business Data Analytics. Practice Session"
subtitle: Time Series Analysis
author: University of Tartu
output:
prettydoc::html_pretty: null
highlight: github
html_document: default
html_notebook: default
github_document: default
theme: cayman
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
```
## Introduction
On today's lab session we will perform time series analysis on stock market data. Provided dataset contains historical stock prices (for last 1 year) for 29 of 30 DJIA companies.
## Load libraries and set global parameters
```{r message=FALSE, warning=FALSE}
library("tidyverse")
library("data.table")
library("gridExtra")
#New packages:
library("dygraphs") # for creating a cool plots
library("xts") # for time series
library("tseries") # for Dickey-Fuller Test
library("forecast") # for ACF, PACF
library("plotly") # for candleplot
library("quantmod") # for candleplot
```
## Read Data
```{r}
s_data <- read.csv(file.choose()) # all_stocks.csv
```
## Data overview
We will use 'summary' fuction of R to look at the data.
```{r}
summary(s_data)
```
### Dataset description
* Date - in format: yy-mm-dd
* Open - price of the stock at market opening (this is NYSE data so all in USD)
* Close - price of the stock at market closing
* High - Highest price reached in the day
* Low - Lowest price reached in the day
* Volume - Number of shares traded
* Name - the stock's ticker name
The rows that have missing values will be removed, since we have plenty of data.
Also, 'Date' feature is listed as factor, we will convert that to 'Date' structure.
### Data transformation
```{r eval=F}
# remove NA
# change type of date
```
```{r include=F}
s_data <- na.omit(s_data)
s_data$Date <- as.Date(s_data$Date, format = "%Y-%m-%d")
str(s_data)
```
## Let's look at the price distributions
```{r eval=F}
# create 4 plots (for Open, High, Low, Close)
# x - value
# y - density
```
```{r echo=F}
p1 = ggplot(s_data, aes(Open)) + geom_density()
p2 = ggplot(s_data, aes(High)) + geom_density()
p3 = ggplot(s_data, aes(Low)) + geom_density()
p4 = ggplot(s_data, aes(Close)) + geom_density()
grid.arrange(p1,p2,p3,p4, nrow=2,ncol=2)
```
From the plot above we can see that density of all 4 stock prices is skewed to the left.
Furthermore, we can say that there is no big difference between 4 variants of stock prices for all companies.
## Time Series Analysis
First, let's select only 5 companies as an example and look at their stocks and time series (Open, Close, High, Low).
```{r}
s_data5 <- s_data %>% filter(Name %in% c("IBM", # IBM
"BA", # Boeing Co
"AAPL", # Apple
"GS", # Goldman Sachs
"GOOGL" # Google
))
```
### Create and plot Time Series - High
There are 5 time series in the data provided - (High, Low, Open, Close, Volume).
We will look at the High values first.
```{r}
xts_list <- vector(mode = "list", length = length(unique(s_data5$Name))) # prepare data
names(xts_list) = unique(s_data5$Name)
for (company in unique(s_data5$Name)){
tmp_data <- s_data5 %>% filter(Name == company) # select data
xts = xts(tmp_data$High, order.by=tmp_data$Date) # create extensible time series
attr(xts, 'frequency') <- length(xts)/12 # set frequency
xts_list[[company]] <- xts
}
xts_table = do.call(cbind, xts_list)
dygraph(xts_table, xlab = "Time", ylab = "High value", main = "Time Series") %>%
dyRangeSelector()
```
### Create and plot Time Series - Low, Open, Close
Now try to take any feature (Low, Open, Close or Volume) and prepare similar to the previous plot.
```{r eval=F}
```
Next we will demostrate the time series modeling process on 'GOOGL' (Google) company.
### Candle plot
```{r}
tmp_data <- s_data5 %>% filter(Name == "GOOGL") # select data
# cutom colors
i <- list(line = list(color = '#FFD700'))
d <- list(line = list(color = '#0000ff'))
tmp_data %>%
plot_ly(x = ~Date, type="candlestick",
open = ~Open, close = ~Close,
high = ~High, low = ~Low,
increasing = i, decreasing = d)
```
## Stationary test
A stationary process has a mean and variance that do not change overtime and the process does not have trend.
The above time series are not stationary.
To confirm that we will use "Dickey-Fuller test" to determine stationarity.
Dickey-Fuller test for variable:
```{r}
xts_GOOGL <- xts_list[["GOOGL"]]
adf.test(xts_GOOGL, alternative = "stationary", k = 0)
```
p-value > 0.05: Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.
## Decomposing Time Series
Decomposing a time series involves separating the time series into trend and irregular components.
We test both the additive model and multiplicative model.
```{r}
ts <- as.ts(xts_GOOGL)
plot(decompose(ts, type = "additive"), col = "darkgreen")
plot(decompose(ts, type = "multiplicative"), col = "darkblue")
tscomponents <- decompose(ts, type = "additive")
```
## Time Series Analysis
Before we start with the time series analysis, lets go through the theory in brief.
What is AutoRegressive or AR model:
Autoregressive (AR) models are models where the value of variable in one period is related to the values in the previous period.
AR(p) is a Autoregressive model with p lags.
What is Moving Average or MA model:
Moving average (MA) model accounts for the possibility of a relationship between a variable and the residual from the previous period.
MA(q) is a Moving Average model with q lags.
What is ARIMA model:
Autoregressive integrated moving average model combines both p auto regressive terms and q Moving average terms, also called ARIMA(p,q)
ARIMA takes a stationary time series, so we have to prepere our time series for it.
Integrated refers to the differencing which is used to make the data stationary, represented by the variable d.
## Differencing a Time Series
In order to make time series stationary, we have to remove seasonality and trend.
First, let's remove seasonality:
```{r}
ts_noseason <- ts - tscomponents$seasonal
plot.ts(ts_noseason, col = "darkgreen")
adf.test(ts_noseason, alternative = "stationary", k = 0)
```
p-value decreased, but still > 0.05
Let's remove trend:
```{r}
tsdiff <- diff(ts_noseason, differences=1)
plot.ts(tsdiff, col = "darkgreen")
adf.test(tsdiff, alternative = "stationary", k = 0)
```
p-value < 0.05
The time series (above) appears to be stationary.
# Selecting a Candidate ARIMA Model (ACF and PACF)
The next step is to select appropriate ARIMA model, which means finding the most appropriate values of p and q for an ARIMA(p,d,q) model. You usually need to examine the correlogram and partial correlogram of the stationary time series for this.
To plot a correlogram and partial correlogram, we can use the acf() and pacf() functions in R, respectively.
```{r}
Acf(tsdiff, lag.max=60) # plot a correlogram
Acf(tsdiff, lag.max=60, plot=FALSE) # get the autocorrelation values
```
```{r}
Pacf(tsdiff, lag.max=60) # plot a partial correlogram
Pacf(tsdiff, lag.max=60, plot=FALSE) # get the partial autocorrelation values
```
Now, we could compare the sample ACF and PACF to those of various theoretical ARMA models. Use properties of ACF & PACF as a guide to estimate plausible models and select appropriate p, q and d. Alternative to this is discussed next.
## Fitting an ARIMA Model (auto.arima)
R provides a function auto.arima, which returns best ARIMA model according to either AIC, AICc or BIC value. The function conducts a search over possible model within the order constraints provided.
We train 3 models with different training data.
For example, the model 'tsarima120' is trained with the whole time series exluding the last 120 daily data.
```{r}
tsarima120 <- auto.arima(head(ts, -120), max.p = 3, max.q = 3, max.d = 3, seasonal=FALSE, trace=TRUE) #120
```
```{r}
tsarima60 <- auto.arima(head(ts, -60), max.p = 3, max.q = 3, max.d = 3, seasonal=FALSE) #60
tsarima30 <- auto.arima(head(ts, -30), max.p = 3, max.q = 3, max.d = 3, seasonal=FALSE) #30
```
## Forecasting using an ARIMA Model
```{r}
tsforecasts120 <- forecast(tsarima120, h = 120) # forecast the next 120 time series
tsforecasts60 <- forecast(tsarima60, h = 60) # forecast the next 60 time series
tsforecasts30 <- forecast(tsarima30, h = 30) # forecast the next 30 time series
```
```{r}
autoplot(tsforecasts120)
accuracy(tsforecasts120, head(tail(ts, 120), 120))
```
```{r}
autoplot(tsforecasts60)
accuracy(tsforecasts120, head(tail(ts, 120), 60))
accuracy(tsforecasts60, head(tail(ts, 60), 60))
```
```{r}
autoplot(tsforecasts30)
accuracy(tsforecasts120, head(tail(ts, 120), 30))
accuracy(tsforecasts60, head(tail(ts, 60), 30))
accuracy(tsforecasts30, head(tail(ts, 30), 30))
```
```{r}
print('tsforecasts120')
checkresiduals(tsforecasts120)
print('tsforecasts60')
checkresiduals(tsforecasts60)
print('tsforecasts30')
checkresiduals(tsforecasts30)
```
The forecast errors seem to be normally distributed with mean zero and constant variance, the ARIMA model does seem to provide an adequate predictive model