--- title: "Introduction to Business Data Analytics" subtitle: "Fraud Detection" author: "University of Tartu" output: prettydoc::html_pretty: null highlight: github html_document: default html_notebook: default github_document: default theme: cayman --- ```{r global_options, include=FALSE} knitr::opts_chunk$set(warning=FALSE, message=FALSE) ``` ```{r setup, echo=FALSE} library(knitr) ``` ## Libraries Today we will work with the following libraries: ```{r warning=FALSE} library(dplyr) library(ggplot2) library(ROSE) # Balancing library(pROC) # AUC library(reshape) #melt library(stringr) # work with strings library(rpart) # decision tree library(xgboost) # for xgboost library(caret) # for confusion matrix ``` Additional packages that may need to run the code: ```{r, eval=F} # packages required to create a map install.packages("maps") install.packages("mapproj") # package required printing important features for XGBoost install.packages("Ckmeans.1d.dp") # for confusion matrix install.packages("e1071") ``` ## Data preparation ```{r} df <- read.csv(file.choose()) # fraud.csv ``` Let's explore the data: ```{r} str(df) ``` We have next features: * step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. * type - type of transaction that was executed. * amount - amount of the transaction in local currency * nameOrig - customer who started the transaction. * oldbalanceOrg - initial balance before the transaction. * newbalanceOrig - customer's balance after the transaction. * nameDest - recipient ID of the transaction. * oldbalanceDest - initial recipient balance before the transaction. * newbalanceDest - recipient's balance after the transaction. * isFraud - identifies a fraudulent transaction (1) and non fraudulent (0). ## Data exploration Let's explore our data by looking at how much fraud attempts there were for each type of transaction. ```{r} df %>% group_by(type) %>% summarise(isFraud = sum(isFraud)) ``` Based on result we can conclude that there aren't many atempts in fraud. Now let's try to find amount of money that could be lost due to the fraud. First, let's calculate total amount of cash: ```{r} df %>% group_by(type) %>% summarise(amount = sum(amount)) ``` Now let's calculate the percentage of money that could be lost: ```{r} df %>% group_by(type) %>% mutate(totAmount = sum(amount)) %>% filter(isFraud > 0) %>% summarise(amount = sum(amount), percent = amount / last(totAmount)) ``` #Plotting coordinates on geomap Now, let's try to plot locations on the map, where transactions were executed. Let's look at coordinates: ```{r} head(df$coord) ``` There are some empty values in coord column: ```{r} data_with_coord <- df[df$coord != "",] ``` Next, to plot coordinates on the map, we have to separate them. ```{r} tmp <- data.frame(str_split_fixed(data_with_coord$coord, ", ", 2)) tmp$X1 <- gsub('^.', '', tmp$X1) # Remove [ tmp$X2 <- gsub('.$', '', tmp$X2) # Remove ] colnames(tmp) <- c("lat", "lon") # Set colnames data_with_coord$lat <- as.numeric(tmp$lat) # Add latitude column to airline_data data_with_coord$lon <- as.numeric(tmp$lon) # Add longtitude column to airline_data ``` Now let's plot the map: ```{r} # Setting world data WorldData <- map_data('world') WorldData <- fortify(WorldData) # Setting world map geomap <- ggplot() + geom_map( data=WorldData, map=WorldData, aes(x=long, y=lat, group=group, map_id=region), fill="white", colour="#7f7f7f", size=0.5) # Setting map borders geomap <- geomap + coord_map("rectangular", lat0=0, xlim=c(-180,180), ylim=c(-60, 90)) # Setting coordinates geomap <- geomap + geom_point( aes(x = lon, y = lat, color = type), alpha = 1, data = data_with_coord, size = 2) # Setting colors geomap <- geomap + scale_colour_manual(values=c("orange", "purple", "red", "darkgreen", "blue")) geomap df$coord <- NULL ``` # Prediction ## Data preparation Let's look at our data once more: ```{r} str(df) ``` ## Balancing data First issue that we have to address is data balancing. Currently we have only few observations with the fraud, so let's perform bothsampling and check the result. ```{r} df_b <- ovun.sample(isFraud~. , data = df, method = "both", p=0.5, seed = 1000)$data ggplot(df_b, aes(x = isFraud, fill = as.factor(isFraud))) + geom_bar() ``` ### Data preparation To use XGBoost & Logistic Regression we need to convert features to numerics and scale them. ```{r} df_b$type <- as.numeric(df_b$type) df_b$nameOrig <- as.numeric(df_b$nameOrig) df_b$nameDest <- as.numeric(df_b$nameDest) ``` Next, we have to rescale the features: ```{r} scaled_df <- df_b %>% mutate(step = scale(step), type = scale(type), amount = scale(amount), nameOrig = scale(nameOrig), oldbalanceOrg = scale(oldbalanceOrg), newbalanceOrig = scale(newbalanceOrig), nameDest = scale(nameDest), oldbalanceDest = scale(oldbalanceDest), newbalanceDest = scale(newbalanceDest) ) ``` ## Logistic regression Let's try logistic regression on our data: Splitting the data: ```{r} samples <- sample.int(n = nrow(scaled_df), size = floor(0.8*nrow(scaled_df)), replace = F) train <- scaled_df[samples,] test <- scaled_df[-samples,] ``` Now, let's check how logistic regression performs on our data. First, we train the model: ```{r} glm <- glm(isFraud ~ ., train, family=binomial) ``` And finaly evaluate the result: ```{r} pred_glm_scores <- predict(glm, test) threshold <- 0.5 # We set threshold for obtained scores glm_predict_boolean <- ifelse(pred_glm_scores >= threshold,1,0) glm_cm <- confusionMatrix(as.factor(glm_predict_boolean), as.factor(test$isFraud)) glm_cm ``` ```{r} print(paste("Recall of Logistic Regression is:", glm_cm$byClass["Recall"])) print(paste("Precision of Logistic Regression is:", glm_cm$byClass["Precision"])) print(paste("F1 of Logistic Regression is:", glm_cm$byClass["F1"])) ``` We can also create ROC and find AUC ```{r} roc_glm <- roc(test$isFraud, glm_predict_boolean) plot.roc(roc_glm, col="orange") text(0.36, 0.53, labels=sprintf("AUC(GLM)): %0.3f", auc(roc_glm)), col="orange") ``` ## Decision Tree Let's try decison tree on our data and see the result. ```{r} tree <- rpart(isFraud~., train) ``` And finaly evaluate the result: ```{r} pred_tree_scores <- predict(tree, test) threshold <- 0.5 # We set threshold for obtained scores tree_predict_boolean <- ifelse(pred_tree_scores >= threshold,1,0) tree_cm <- confusionMatrix(as.factor(tree_predict_boolean), as.factor(test$isFraud)) tree_cm ``` ```{r} print(paste("Recall of Decision Tree is:", tree_cm$byClass["Recall"])) print(paste("Precision of Decision Tree is:", tree_cm$byClass["Precision"])) print(paste("F1 of Decision Tree is:", tree_cm$byClass["F1"])) ``` We can also create ROC and find AUC ```{r} roc_dt <- roc(test$isFraud, predict(tree, test)) plot.roc(roc_dt, col="green") text(0.36, 0.53, labels=sprintf("AUC(DT)): %0.3f", auc(roc_dt)), col="green") ``` ## XGBoost Let's split the data: ```{r} samples <- sample.int(n = nrow(scaled_df), size = floor(0.8*nrow(scaled_df)), replace = F) train <- scaled_df[samples,] test <- scaled_df[-samples,] ``` ### Preparing the matrices ```{r} matrix_train <- xgb.DMatrix(as.matrix(train %>% select(-isFraud)), label = train$isFraud) matrix_test <- xgb.DMatrix(as.matrix(test %>% select(-isFraud)), label = test$isFraud) ``` ### Run the model with XGBoost ```{r} watchlist <- list(train = matrix_train, cv = matrix_test) params <- list( "objective" = "binary:logitraw", #Specify the learning task and the corresponding learning objective. "eval_metric" = "auc" ) model_xgb <- xgb.train(params=params, data=matrix_train, maximize=TRUE, # larger the evaluation score the better nrounds=50, # max number of boosting iterations nthread=3, # number of threads early_stopping_round=10, # stop if the performance doesn't improve for 10 rounds watchlist = watchlist, # named list of xgb.DMatrix datasets to use for evaluating model performance print_every_n=5) # prints metrics for every 5th iteration ``` ### Variable importance ```{r} importance <- xgb.importance(colnames(matrix_train), model = model_xgb) xgb.ggplot.importance(importance) ``` ### Ploting AUC Let's see how AUC changes with next iterations of model training ```{r} melted <- melt(model_xgb$evaluation_log, id.vars="iter") ggplot(data=melted, aes(x=iter, y=value, group=variable, color = variable)) + geom_line() ``` ### Prediction Now let's predict fraud using test data: ```{r} xgb_predict_scores <- predict(model_xgb, matrix_test) threshold <- 0.5 # We set threshold for obtained scores xgb_predict_boolean <- ifelse(xgb_predict_scores >= threshold,1,0) ``` And evaluate the result by creating confusion matrix: ```{r} xgb_cm <- confusionMatrix(as.factor(xgb_predict_boolean), as.factor(test$isFraud)) xgb_cm ``` We can access different metrics from our confusion matrix. ```{r} print(paste("Recall of XGBoost is:", xgb_cm$byClass["Recall"])) print(paste("Precision of XGBoost is:", xgb_cm$byClass["Precision"])) print(paste("F1 of XGBoost is:", xgb_cm$byClass["F1"])) ``` We can also create ROC and find AUC ```{r} roc_xgb <- roc(test$isFraud, predict(model_xgb, matrix_test, type = "prob")) plot.roc(roc_xgb) text(0.36, 0.53, labels=sprintf("AUC(XGB)): %0.3f", auc(roc_xgb)), col="black") ``` ## Result Let's compare all 3 AUC's and select the best one. ```{r} plot.roc(roc_glm, col="green") text(0.36, 0.53, labels=sprintf("AUC(GLM)): %0.3f", auc(roc_glm)), col="green") lines(roc_dt, col="orange") text(0.36, 0.33, labels=sprintf("AUC(DT)): %0.3f", auc(roc_dt)), col="orange") lines(roc_xgb, col="black") text(0.36, 0.43, labels=sprintf("AUC(XGB): %0.3f", auc(roc_xgb)), col="black") ```