---
title: "Business Data Analytics"
subtitle: "Lab 2. Descriptive analysis and visualization"
author: "University of Tartu"
output:
prettydoc::html_pretty: null
highlight: github
html_document: default
html_notebook: default
github_document: default
theme: cayman
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
```
Descriptive analysis (sometimes referred to as Exploratory analysis) is a
crucial part in any kind of analysis. The goal is to "get to know your dataset" -
to find outliers, irregularities and identify what transformations to do for the
further stages (e.g. for modelling). Exploratory phase is largely an iterative
process, where you develop understanding of the data by
* collecting questions about the data
* answering them by transforming the data, summarizing and plotting it,
* which leads to new questions :)
##Prerequisites (recap)
The power of R comes from the vast universe of packages. Every time you
need a new package you install it:
```{r eval=FALSE}
install.packages("ggplot2")
install.packages("dplyr")
```
In your script you will load packages by writing:
```{r}
library("ggplot2")
library("dplyr")
```
##Loading datasets
Now we load a dataset into R. Today we are analyzing data (artificial) of total
sales on a weekly basis for two particular products in different stores.
There are many ways things can be done in R. For example, you can read in
dataset using basic functionality:
There are a planty of ways to load data from .csv files.
The one that we will use:
```{r}
df <- read.csv(file.choose()) # lab2_Data.csv
```
Let's print some rows and look at the data that we loaded:
```{r}
head(df,3)
```
Let us understand the dataset.
```{r eval=FALSE}
View(df)
```
We can use another functions to study our data:
```{r}
str(df)
```
```{r}
dim(df)
```
Now find all the countries where stores are located.
## Removing NA values
Let's check for a NA:
```{r}
rows <- which(is.na(df), arr.ind=TRUE)[,1]
df[rows,]
```
We can use next function to remove rows with NA:
```{r}
df <- na.omit(df)
rows <- which(is.na(df), arr.ind=TRUE)[,1]
df[rows,]
```
##Descriptive statistics
Summary statistics is displayed by:
```{r}
summary(df)
```
We see that country is a factor and represents country where store is located. There are 7 different countries and we can see amount of observations with each country after ":".
We can study columns using next functions:
```{r}
unique(df$country)# displays all unique values
```
```{r}
length(unique(df$storeNum)) # number of unique countries in our dataset
```
##Discrete variables
Another useful inspection for discrete features (variables) is using frequency
counts. This can be achieved by table function:
```{r}
table(df$p1prom)
```
```{r}
table(promotion_p1=df$p1prom, promotion_p2=df$p2prom) # two-way frequency table
```
The next question we might ask is how much varies price of the first and
second product:
```{r}
summary(df$p1price)
```
```{r}
summary(df$p2price)
```
We also would like to know what are the prices when the product's price is
reduced and is not. Price is a continuous variable. However, we already
checked that it does not vary much.
We can display frequencies using table function:
```{r}
price_table <- table(promotion=df$p1prom, price=df$p1price) # how many times each product was promoted at each price level
price_table
```
But let's see the percentages and margins instead of the counts. This way, we can compare it to the second product.
```{r}
# promotions are in rows, prices are in columns
# summing over rows
margin.table(price_table, 1)
```
```{r}
# summing over columns
margin.table(price_table, 2)
```
```{r}
prop.table(price_table) # cell %
```
```{r}
prop.table(price_table, 1) # row percentages
```
```{r}
prop.table(price_table, 2) # column percentages
```
Find now the percentages of discounted products for different price levels for product 2.
##Continuous variables
For continuous variables, where data varies a lot, table is not feasible nor
meaningful. Try for example running table(dt$p1sales). The following table
summarizes the functions of descriptive statistics that we discussed during
the lecture.
* min(x) - Minimum value
* max(x) - Maximum value
* mean(x) - Mean
* median(x) - Median
* var(x) - Variance
* sd(x) - Standard deviation
* quantile(x, probs=c(0.25, 0.5, 0.75)) - Percentiles/Quartiles
* cor(x, y) Correlation function
Observe statistics for sales of product1. Is there anything interesting?
##Data Transformations
Often exploring data involves a narrower approach. For example, instead of
average sales of the first product we want to calculate average sales for each
country, or for each shop. Or we want to filter the data and take a look at
average sales when the product was promoted. If you have experience with
other programming languages, your first approach for such analysis would
be to run for loops. However, the power of R comes from
the vectorized approach.
Vectorization is the process of rewriting a loop so that instead of processing
a single element of an array N times, it processes (for example) 4 elements of the
array simultaneously N/4 times.
A bit too technical, but let's see what it means on practice. Let's first try to
calculate average sales per country using classical approach:
```{r}
countries = unique(df$country) # choose unique countries in our dataset
for(cn in countries){ # create a loop to pick a country from our country vector
country_df <- filter(df, country==cn) # filter the data, so that only data for this country is chosen
country_mean <- mean(country_df$p1sales) # calculate the mean for this country
print(c(cn, country_mean)) # print the result one-by-one
}
```
Now the vectorized approach would be:
```{r}
df %>%
group_by(country) %>%
summarise(sales_mean = mean(p1sales))
```
Let's measure the time taken for each of two approaches by
writing Sys.time around our functions (the first is 'for loop' and the second
is vectorized approach):
```{r}
start <- Sys.time()
countries = unique(df$country) # choose unique countries in our dataset
for(cn in countries){ # create a loop to pick a country from our country vector
country_df <- filter(df, country==cn) # filter the data, so that only data for this country is chosen
country_mean <- mean(country_df$p1sales) # calculate the mean for this country
print(c(cn, country_mean)) # print the result one-by-one
}
Sys.time() - start
```
```{r}
start <- Sys.time()
df %>%
group_by(country) %>%
summarise(sales_mean = mean(p1sales))
Sys.time() - start
```
And this is on a very tiny dataset! On bigger sizes the difference becomes
much more crucial.
##Visualization
Let's try to practice how to do charts for our later analysis, as we will do a lot
of graphs along the course. It is important to remember what is the purpose
of your plots. There are plots with the main goal to communicate the
message for the decision-makers or consumers. These plots have to clearly
convey the message that was found and investigated earlier. For the
exploratory analysis the graphs facilitate the discovery. We have to learn how
to detect interesting and unexpected patterns and what questions to ask
next.
###Bar chart. Multi-set bar chart
The following chart displays how many transactions per country were made.
```{r}
ggplot(data = df) +
geom_bar(mapping = aes(x = country))
```
Compare it to the results of the table df %>% count(country):
```{r}
df %>% count(country)
```
The ggplot2 plots are very flexible. Let's try to beautify it a bit:
```{r}
ggplot(data = df) +
geom_bar(mapping = aes(x = country), fill='#f1e8ca',
color='#745151', alpha=0.8) + theme_bw(base_size=28)
```
Multi-chart bar:
```{r}
df %>%
mutate(p1prom = as.factor(p1prom)) %>%
ggplot() +
geom_bar(mapping = aes(x = country, fill=p1prom),
position='dodge')
```
How do you interpret this plot? Try to customize it by changing colors (check help for scale_fill_manual())
###Histogram. Density
Histogram shows the distribution of continuous variable. In a histogram the
x-axis is partitioned into equal bins. The height of a bar shows how many
observations fall into each bin. The results of the histogram heavily depend
on the chosen bin size.
```{r}
ggplot(data = df) +
geom_histogram(mapping = aes(x = p1sales))
ggplot(data = df) +
geom_histogram(mapping = aes(x = p1sales), binwidth=50)
ggplot(data = df) +
geom_histogram(mapping = aes(x = p1sales), binwidth=1)
```
Another plot for the distribution is density plot. It is a smoothed histogram
that provide a general estimate of the distribution. The area under the
distribution is always sums to 1. It helps to estimate general tendency,
however it comes at a cost, as y-axis is difficult to interpret.
```{r}
ggplot(data = df) +
geom_density(mapping = aes(x = p1sales), fill='#f1e8ca')
```
It is more helpful to compare several distributions via density plots:
```{r}
ggplot(data = df) +
geom_density(mapping = aes(x = p1sales), fill='#f1e8ca',
alpha=0.3) +
geom_density(aes(x=p2sales), fill = '#745151', alpha=0.3) + xlab('sales')
```
###Boxplot
While histogram is a good way to show the distribution of one continuous
variable, if we want to compare several distributions, it may become tricky.
Densities are good, if there are not too many either. We could overlay several
histograms, but if the scale between two histograms vary a lot, it is difficult to
analyze. A better approach would be to use boxplots:
```{r}
ggplot(data = df, mapping = aes(x = as.factor(storeNum), y = p1sales)) + geom_boxplot() + theme_bw() + xlab("storeNum")
```
Boxplots help to quickly grasp the information about several distributions
and compare it.
Try to change geom_boxplot to geom_violin. Use help to check (type ?geom_violin).
###Line chart
Line charts are often the easiest to understand :). Usually, the most common
way is to use it for time-series, e.g. we want to observe a behavior over time.
Often line charts are "misused". Look at the following graph and explain what is wrong with it:
```{r}
df %>%
group_by(country) %>%
summarise(p1sales=sum(p1sales)) %>%
ggplot(aes(x=country, y=p1sales, group=1)) + geom_line() +
theme_bw()
```
###Scatter plot. Correlation
Two continuous variables can be visualized using scatterplot. It shows the relationship between these variables. We will later observe, how such relationship helps in models. Let's first observe what are the basic patterns we want (or don't want) to see in the data:

The relationship between teo variables can be expressed with the correlation coefficient. However, it only shows whether there is a linear relationship.
Make a scatterplot using product 1 and product 2 sales data. What patterns do you observe?
From graph alone it might be hard to define relationship between variable, so we can use cor() functon to have more determination:
```{r}
cor(df$p1sales, df$p2sales)
```
The sign in correlation means in what way first feature affects second (+ for increases; - for decreses).
##Melting/Casting
There is a particular concept of long and wide data formats, as we discussed
during the lecture. For example, when we made plots of two density plots for
product 1 and product 2, it was not the optimal way to do it. We can tidy our
data.
To do so, we will need tidyr package:
```{r eval=FALSE}
install.packages("tidyr")
```
```{r}
library("tidyr")
```
```{r}
gathered_df <- df %>%
gather(p1sales, p2sales, key = "product", value = "sales")
head(gathered_df)
```
Now we can build the plot with densities in more natural way:
```{r}
ggplot(data = gathered_df) +
geom_density(mapping = aes(x = sales, fill=product),
alpha=0.3) +
scale_fill_manual(values = c('#f1e8ca', '#745151'))
```
##Heatmap
A heat map (or heatmap) is a graphical representation of data where the
individual values contained in a matrix are represented as colors. It is really
useful to display a general view of numerical data, not to extract specific data
point.
In our example, we will add one more column that will display the quality of the shop:
```{r}
set.seed(13)
df <- cbind(df, quality = sample(as.factor(c("low", "medium", "high")), nrow(df), replace = TRUE))
```
Now we can build the heatmap:
```{r}
df %>%
select(quality,country) %>%
group_by(country, quality) %>%
summarize(rowCount= n()) %>%
ggplot() + geom_tile(aes(x = quality, y = country, fill = rowCount))
```
```{r include=FALSE}
df %>%
select(quality,country) %>%
group_by(country, quality) %>%
summarize(rowCount= n()) %>%
ggplot() + geom_tile(aes(x = quality, y = country, fill = as.factor(rowCount)))
```