---
title: "Business Data Analytics. Practice Session"
subtitle: A/B testing
author: "University of Tartu"
output:
prettydoc::html_pretty: null
highlight: github
html_document: default
html_notebook: default
github_document: default
theme: cayman
editor_options:
chunk_output_type: console
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
```
```{r setup, echo=FALSE}
library(knitr)
```
## Introduction
During today's practice we will investigate how to perform A/B Testing. A/B testing is used in numerous ways to test different versions of web pages, UX, surveys and questionnaires, changes in policies, different marketing campaigns, emails and so on.
Broadly speaking, A/B tetsts are run mostly on two types of data:
Continuous or discrete numbers, for example average number of clicks, time spent on the page
Proprotions or percentages, for example, conversion rates
For the first data type, the t-test is most frequently used. For the second type of data, the Pearson's Chi-squared test is the obvious choice. Let's take a look at both cases.
##Libraries
```{r message=FALSE}
library(data.table)
library(dplyr)
library(ggplot2)
library(nortest) # install.packages("nortest")
library(pwr) # install.packages("pwr")
```
##Loading the data
Let us take a look at the following data:
```{r}
dt <- fread(file.choose()) # AB_clicks.csv
```
We should look at the data:
```{r}
View(dt)
```
Second column in our dataset contains names of the tags in HTML.
HTML is specific language used for building of the web pages.
![](HTML_source_code_example.svg.png)
Let's look ad unique values in the data:
Amout of different html elements:
```{r}
length(unique(dt$Element_ID))
```
What different tags are there:
```{r}
unique(dt$Tag_name)
```
Valuese for feature "Visible":
```{r}
unique(dt$Visible)
```
Valuese for feature "Version":
```{r}
unique(dt$Version)
```
This is the cleaned version of the data from https://scholarworks.montana.edu/xmlui/handle/1/3507. University of Montana explored that the button Interact on their page is heavily underused. They surveyed the problem by conducting questionnaires and realized that the name might be one of the reasons being too intimidating. They came up with several other versions:
![](1.jpg)
![](2.jpg)
![](3.jpg)
![](4.jpg)
![](5.jpg)
First, we have to look at the amount of the clicks on the sites:
```{r eval=F}
# Plot total amount of the clicks on the site
# X - versions of the site
# Y - amount
```
```{r include=F}
dt %>%
group_by(Version) %>%
summarise(sum = sum(No_clicks)) %>%
ggplot(aes(x=Version, y=sum)) +
geom_histogram(stat="identity", fill="#7ba367", bins = 30) + theme_bw()
```
Based on this plot, we can say that in the data we have, version of the site with "Interact" button has more clicks in total than others.
Let's take amount of the clicks on the button Interact at the different sites and plot it.
First we have to find ids of the elements that is used for button Interact in different versions:
```{r eval=F}
# Plot amount of the clicks on the button we are interested in (for all versions)
# X - version of the site
# Y - amount of the clicks
# button names: "INTERACT", "LEARN", "CONNECT", "HELP", "SERVICES"
```
```{r include=F}
btn_names <- c("INTERACT", "LEARN", "CONNECT", "HELP", "SERVICES")
btns <- dt %>%
filter(Name %in% btn_names)
ggplot(btns, aes(x=Version, y=No_clicks)) +
geom_histogram(stat="identity", fill="#7ba367", color='white', bins = 30) + theme_bw()
```
## t-testing
Let's try to perform t-test like many people do (A is default and B the Connect version):
```{r}
dt_cleaned <- filter(dt, Tag_name!='area')
dt_interact_connect <- filter(dt_cleaned, Version %in% c("Interact", "Connect"))
t.test(No_clicks ~ Version, data=dt_interact_connect)
```
What can we conclude from results above? Did we performed the test correctly?
Let's check wether amount of clicks is gaussian distribution.
```{r}
ggplot(dt_cleaned, aes(x=No_clicks, fill=Version)) + geom_density(alpha=0.3) + theme_bw()
```
```{r}
ggplot(dt_cleaned, aes(x=No_clicks, fill=Version)) + geom_density(alpha=0.3) + theme_bw() + scale_x_log10()
```
```{r}
#nortest
qqnorm(filter(dt_cleaned, Version=='Interact')$No_clicks, cex=0.5)
qqline(filter(dt_cleaned, Version=='Interact')$No_clicks, col = 2)
```
Statistical tests for normality can be easily formulated in the framework of hypothesis testing. The null hypothesis is that the data is normally distributed. Here is the catch: we assume it by default, we are not trying to prove this. Basically, we can only reject (or accept) the hypothesis that data is not statistically distributed (if it is not significant, it is either normally distributed or we do not have enough data to reject it).
```{r}
ad.test(filter(dt_cleaned, Version=='Interact')$No_clicks)
```
```{r}
shapiro.test(filter(dt_cleaned, Version=='Interact')$No_clicks)
```
What to do when the data is not gaussian (not normally distributed)? There are two large groups of tests: parametric - they have assumptions about distributions, and non-parametric, where you can forget about the distribution assumptions. However, it comes with a catch - often their power is lower. We can use wilcoxon test for normality instead of t-test (as the Normality assumption is not fulfilled):
```{r}
wilcox.test(No_clicks ~ Version, data=dt_interact_connect)
```
Or we can use transform No_clicks to log10:
```{r}
dt_interact_connect <- dt_interact_connect %>%
mutate(log_clicks = log10(No_clicks))
```
```{r}
wilcox.test(log_clicks ~ Version, data=dt_interact_connect) # same result
```
Let's try to change the hypothesis. Let's check only those clicks on the objects, when the Visibility=TRUE. What are the results?
```{r}
dt_interact_connect_visible <- dt_interact_connect %>%
filter(Visible==TRUE)
wilcox.test(log_clicks ~ Version, data=dt_interact_connect_visible)
```
Currently, we do not have the possibility to plan the experiment and make decisions about the sample sizes. In real-life, you first need to calculate how many samples you need with respect to significance, power and effect size:
```{r}
pwr.t.test(d = 0.2, power=0.8)
```
##Test of proportions
The hypothesis that in general one version of the page is more clicked than another one is quite optimistic. Let's narrow down our hypothesis. What if we want to check whether the number of times clicked on this component (*Connect) out of all clicks to this page is signifcantly better (worse) than the same proportion of clicks in the default version (Interact)? Here we test proportions.
```{r}
total_clicks <- group_by(dt, Version) %>%
summarise(total = sum(No_clicks)) # total number of clicks
dt_button <- filter(dt, Name %in% c("SERVICES", "HELP", "LEARN","CONNECT", "INTERACT")) %>%
# choose only areas we focus
left_join(total_clicks, by = "Version") %>%
mutate(proportions = No_clicks/total)# combine
dt_button
```
The assumption is that the two groups are mutually exclusive. In other words, a user can only see one of the versions. Null hypothesis is that the proportions are equal in both groups.
```{r}
prop.test(x=dt_button$No_clicks[c(1,2)], n=dt_button$total[c(1,2)])
```
And the power of the samples:
```{r}
power.prop.test(n=dt_button$total[c(1,2)], p1=0.01130856, p2=0.03339635)
```
```{r}
pwr.2p2n.test(h=0.1, power=0.8, n1=3714)
```
There are many tests, but all of them fit into the general framework!