--- title: "Business Data Analytics" subtitle: "Lab 2. Descriptive analysis and visualization" author: "University of Tartu" output: prettydoc::html_pretty: null highlight: github html_document: default html_notebook: default github_document: default theme: cayman --- ```{r global_options, include=FALSE} knitr::opts_chunk$set(warning=FALSE, message=FALSE) ``` Descriptive analysis (sometimes referred to as Exploratory analysis) is a crucial part in any kind of analysis. The goal is to "get to know your dataset" - to find outliers, irregularities and identify what transformations to do for the further stages (e.g. for modelling). Exploratory phase is largely an iterative process, where you develop understanding of the data by * collecting questions about the data * answering them by transforming the data, summarizing and plotting it, * which leads to new questions :) ##Prerequisites (recap) The power of R comes from the vast universe of packages. Every time you need a new package you install it: ```{r eval=FALSE} install.packages("ggplot2") install.packages("dplyr") ``` In your script you will load packages by writing: ```{r} library("ggplot2") library("dplyr") ``` ##Loading datasets Now we load a dataset into R. Today we are analyzing data (artificial) of total sales on a weekly basis for two particular products in different stores. There are many ways things can be done in R. For example, you can read in dataset using basic functionality: There are a planty of ways to load data from .csv files. The one that we will use: ```{r} df <- read.csv(file.choose()) # lab2_Data.csv ``` Let's print some rows and look at the data that we loaded: ```{r} head(df,3) ``` Let us understand the dataset. ```{r eval=FALSE} View(df) ``` We can use another functions to study our data: ```{r} str(df) ``` ```{r} dim(df) ```
Now find all the countries where stores are located.
## Removing NA values Let's check for a NA: ```{r} rows <- which(is.na(df), arr.ind=TRUE)[,1] df[rows,] ``` We can use next function to remove rows with NA: ```{r} df <- na.omit(df) rows <- which(is.na(df), arr.ind=TRUE)[,1] df[rows,] ``` ##Descriptive statistics Summary statistics is displayed by: ```{r} summary(df) ``` We see that country is a factor and represents country where store is located. There are 7 different countries and we can see amount of observations with each country after ":". We can study columns using next functions: ```{r} unique(df$country)# displays all unique values ``` ```{r} length(unique(df$storeNum)) # number of unique countries in our dataset ``` ##Discrete variables Another useful inspection for discrete features (variables) is using frequency counts. This can be achieved by table function: ```{r} table(df$p1prom) ``` ```{r} table(promotion_p1=df$p1prom, promotion_p2=df$p2prom) # two-way frequency table ``` The next question we might ask is how much varies price of the first and second product: ```{r} summary(df$p1price) ``` ```{r} summary(df$p2price) ``` We also would like to know what are the prices when the product's price is reduced and is not. Price is a continuous variable. However, we already checked that it does not vary much. We can display frequencies using table function: ```{r} price_table <- table(promotion=df$p1prom, price=df$p1price) # how many times each product was promoted at each price level price_table ``` But let's see the percentages and margins instead of the counts. This way, we can compare it to the second product. ```{r} # promotions are in rows, prices are in columns # summing over rows margin.table(price_table, 1) ``` ```{r} # summing over columns margin.table(price_table, 2) ``` ```{r} prop.table(price_table) # cell % ``` ```{r} prop.table(price_table, 1) # row percentages ``` ```{r} prop.table(price_table, 2) # column percentages ```
Find now the percentages of discounted products for different price levels for product 2.
##Continuous variables For continuous variables, where data varies a lot, table is not feasible nor meaningful. Try for example running table(dt$p1sales). The following table summarizes the functions of descriptive statistics that we discussed during the lecture. * min(x) - Minimum value * max(x) - Maximum value * mean(x) - Mean * median(x) - Median * var(x) - Variance * sd(x) - Standard deviation * quantile(x, probs=c(0.25, 0.5, 0.75)) - Percentiles/Quartiles * cor(x, y) Correlation function
Observe statistics for sales of product1. Is there anything interesting?
##Data Transformations Often exploring data involves a narrower approach. For example, instead of average sales of the first product we want to calculate average sales for each country, or for each shop. Or we want to filter the data and take a look at average sales when the product was promoted. If you have experience with other programming languages, your first approach for such analysis would be to run for loops. However, the power of R comes from the vectorized approach. Vectorization is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (for example) 4 elements of the array simultaneously N/4 times. A bit too technical, but let's see what it means on practice. Let's first try to calculate average sales per country using classical approach: ```{r} countries = unique(df$country) # choose unique countries in our dataset for(cn in countries){ # create a loop to pick a country from our country vector country_df <- filter(df, country==cn) # filter the data, so that only data for this country is chosen country_mean <- mean(country_df$p1sales) # calculate the mean for this country print(c(cn, country_mean)) # print the result one-by-one } ``` Now the vectorized approach would be: ```{r} df %>% group_by(country) %>% summarise(sales_mean = mean(p1sales)) ``` Let's measure the time taken for each of two approaches by writing Sys.time around our functions (the first is 'for loop' and the second is vectorized approach): ```{r} start <- Sys.time() countries = unique(df$country) # choose unique countries in our dataset for(cn in countries){ # create a loop to pick a country from our country vector country_df <- filter(df, country==cn) # filter the data, so that only data for this country is chosen country_mean <- mean(country_df$p1sales) # calculate the mean for this country print(c(cn, country_mean)) # print the result one-by-one } Sys.time() - start ``` ```{r} start <- Sys.time() df %>% group_by(country) %>% summarise(sales_mean = mean(p1sales)) Sys.time() - start ``` And this is on a very tiny dataset! On bigger sizes the difference becomes much more crucial. ##Visualization Let's try to practice how to do charts for our later analysis, as we will do a lot of graphs along the course. It is important to remember what is the purpose of your plots. There are plots with the main goal to communicate the message for the decision-makers or consumers. These plots have to clearly convey the message that was found and investigated earlier. For the exploratory analysis the graphs facilitate the discovery. We have to learn how to detect interesting and unexpected patterns and what questions to ask next. ###Bar chart. Multi-set bar chart The following chart displays how many transactions per country were made. ```{r} ggplot(data = df) + geom_bar(mapping = aes(x = country)) ``` Compare it to the results of the table df %>% count(country): ```{r} df %>% count(country) ``` The ggplot2 plots are very flexible. Let's try to beautify it a bit: ```{r} ggplot(data = df) + geom_bar(mapping = aes(x = country), fill='#f1e8ca', color='#745151', alpha=0.8) + theme_bw(base_size=28) ``` Multi-chart bar: ```{r} df %>% mutate(p1prom = as.factor(p1prom)) %>% ggplot() + geom_bar(mapping = aes(x = country, fill=p1prom), position='dodge') ```
How do you interpret this plot? Try to customize it by changing colors (check help for scale_fill_manual())
###Histogram. Density Histogram shows the distribution of continuous variable. In a histogram the x-axis is partitioned into equal bins. The height of a bar shows how many observations fall into each bin. The results of the histogram heavily depend on the chosen bin size. ```{r} ggplot(data = df) + geom_histogram(mapping = aes(x = p1sales)) ggplot(data = df) + geom_histogram(mapping = aes(x = p1sales), binwidth=50) ggplot(data = df) + geom_histogram(mapping = aes(x = p1sales), binwidth=1) ``` Another plot for the distribution is density plot. It is a smoothed histogram that provide a general estimate of the distribution. The area under the distribution is always sums to 1. It helps to estimate general tendency, however it comes at a cost, as y-axis is difficult to interpret. ```{r} ggplot(data = df) + geom_density(mapping = aes(x = p1sales), fill='#f1e8ca') ``` It is more helpful to compare several distributions via density plots: ```{r} ggplot(data = df) + geom_density(mapping = aes(x = p1sales), fill='#f1e8ca', alpha=0.3) + geom_density(aes(x=p2sales), fill = '#745151', alpha=0.3) + xlab('sales') ``` ###Boxplot While histogram is a good way to show the distribution of one continuous variable, if we want to compare several distributions, it may become tricky. Densities are good, if there are not too many either. We could overlay several histograms, but if the scale between two histograms vary a lot, it is difficult to analyze. A better approach would be to use boxplots: ```{r} ggplot(data = df, mapping = aes(x = as.factor(storeNum), y = p1sales)) + geom_boxplot() + theme_bw() + xlab("storeNum") ``` Boxplots help to quickly grasp the information about several distributions and compare it.
Try to change geom_boxplot to geom_violin. Use help to check (type ?geom_violin).
###Line chart Line charts are often the easiest to understand :). Usually, the most common way is to use it for time-series, e.g. we want to observe a behavior over time.
Often line charts are "misused". Look at the following graph and explain what is wrong with it:
```{r} df %>% group_by(country) %>% summarise(p1sales=sum(p1sales)) %>% ggplot(aes(x=country, y=p1sales, group=1)) + geom_line() + theme_bw() ``` ###Scatter plot. Correlation Two continuous variables can be visualized using scatterplot. It shows the relationship between these variables. We will later observe, how such relationship helps in models. Let's first observe what are the basic patterns we want (or don't want) to see in the data: ![](D:\cor.png) The relationship between teo variables can be expressed with the correlation coefficient. However, it only shows whether there is a linear relationship.
Make a scatterplot using product 1 and product 2 sales data. What patterns do you observe?
From graph alone it might be hard to define relationship between variable, so we can use cor() functon to have more determination: ```{r} cor(df$p1sales, df$p2sales) ``` The sign in correlation means in what way first feature affects second (+ for increases; - for decreses). ##Melting/Casting There is a particular concept of long and wide data formats, as we discussed during the lecture. For example, when we made plots of two density plots for product 1 and product 2, it was not the optimal way to do it. We can tidy our data. To do so, we will need tidyr package: ```{r eval=FALSE} install.packages("tidyr") ``` ```{r} library("tidyr") ``` ```{r} gathered_df <- df %>% gather(p1sales, p2sales, key = "product", value = "sales") head(gathered_df) ``` Now we can build the plot with densities in more natural way: ```{r} ggplot(data = gathered_df) + geom_density(mapping = aes(x = sales, fill=product), alpha=0.3) + scale_fill_manual(values = c('#f1e8ca', '#745151')) ``` ##Heatmap A heat map (or heatmap) is a graphical representation of data where the individual values contained in a matrix are represented as colors. It is really useful to display a general view of numerical data, not to extract specific data point. In our example, we will add one more column that will display the quality of the shop: ```{r} set.seed(13) df <- cbind(df, quality = sample(as.factor(c("low", "medium", "high")), nrow(df), replace = TRUE)) ``` Now we can build the heatmap: ```{r} df %>% select(quality,country) %>% group_by(country, quality) %>% summarize(rowCount= n()) %>% ggplot() + geom_tile(aes(x = quality, y = country, fill = rowCount)) ``` ```{r include=FALSE} df %>% select(quality,country) %>% group_by(country, quality) %>% summarize(rowCount= n()) %>% ggplot() + geom_tile(aes(x = quality, y = country, fill = as.factor(rowCount))) ```