Libraries

Today we will work with the following libraries:

Additional packages that may need to run the code:

Data preparation

Let’s explore the data:

## 'data.frame':    108717 obs. of  11 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : Factor w/ 108717 levels "C1000028246",..: 12889 37314 16984 99757 58719 103085 30706 51110 14751 92379 ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : Factor w/ 55434 levels "C1000156006",..: 34340 35755 9983 8874 17883 45921 42245 47239 16753 6419 ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ coord         : Factor w/ 833 levels "","[-33.87144962, 151.20821275]",..: 97 663 765 34 87 709 129 565 272 458 ...

We have next features:

Data exploration

Let’s explore our data by looking at how much fraud attempts there were for each type of transaction.

## # A tibble: 5 x 2
##   type     isFraud
##   <fct>      <int>
## 1 CASH_IN        0
## 2 CASH_OUT      61
## 3 DEBIT          0
## 4 PAYMENT        0
## 5 TRANSFER      59

Based on result we can conclude that there aren’t many atempts in fraud.

Now let’s try to find amount of money that could be lost due to the fraud.

First, let’s calculate total amount of cash:

## # A tibble: 5 x 2
##   type          amount
##   <fct>          <dbl>
## 1 CASH_IN  3846102487.
## 2 CASH_OUT 6615077732.
## 3 DEBIT       4375667.
## 4 PAYMENT   466499907.
## 5 TRANSFER 8170484901.

Now let’s calculate the percentage of money that could be lost:

## # A tibble: 2 x 3
##   type        amount percent
##   <fct>        <dbl>   <dbl>
## 1 CASH_OUT 33551538. 0.00507
## 2 TRANSFER 34615462. 0.00424

#Plotting coordinates on geomap

Now, let’s try to plot locations on the map, where transactions were executed.

Let’s look at coordinates:

## [1] [29.98384194, -95.33664018] [40.80718573, -73.95477259]
## [3] [42.3658858, -71.01423374]  [24.55401318, -81.80300774]
## [5] [29.7420124, -95.5606921]   [41.86591215, -87.6231126] 
## 833 Levels:  [-33.87144962, 151.20821275] ... [59.38247253, 18.00789007]

There are some empty values in coord column:

Next, to plot coordinates on the map, we have to separate them.

Now let’s plot the map:

Prediction

Data preparation

Let’s look at our data once more:

## 'data.frame':    108717 obs. of  10 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : Factor w/ 108717 levels "C1000028246",..: 12889 37314 16984 99757 58719 103085 30706 51110 14751 92379 ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : Factor w/ 55434 levels "C1000156006",..: 34340 35755 9983 8874 17883 45921 42245 47239 16753 6419 ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : int  0 0 1 1 0 0 0 0 0 0 ...

Logistic regression

Let’s try logistic regression on our data:

Splitting the data:

Now, let’s check how logistic regression performs on our data.

First, we train the model:

And finaly evaluate the result:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 10427  1391
##          1   599  9327
##                                           
##                Accuracy : 0.9085          
##                  95% CI : (0.9046, 0.9123)
##     No Information Rate : 0.5071          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8167          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9457          
##             Specificity : 0.8702          
##          Pos Pred Value : 0.8823          
##          Neg Pred Value : 0.9397          
##              Prevalence : 0.5071          
##          Detection Rate : 0.4795          
##    Detection Prevalence : 0.5435          
##       Balanced Accuracy : 0.9079          
##                                           
##        'Positive' Class : 0               
## 
## [1] "Recall of Logistic Regression is: 0.945673861781244"
## [1] "Precision of Logistic Regression is: 0.882298189202911"
## [1] "F1 of Logistic Regression is: 0.9128874102609"

We can also create ROC and find AUC

Decision Tree

Let’s try decison tree on our data and see the result.

And finaly evaluate the result:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 10579     0
##          1   447 10718
##                                           
##                Accuracy : 0.9794          
##                  95% CI : (0.9775, 0.9813)
##     No Information Rate : 0.5071          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9589          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9595          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9600          
##              Prevalence : 0.5071          
##          Detection Rate : 0.4865          
##    Detection Prevalence : 0.4865          
##       Balanced Accuracy : 0.9797          
##                                           
##        'Positive' Class : 0               
## 
## [1] "Recall of Decision Tree is: 0.959459459459459"
## [1] "Precision of Decision Tree is: 1"
## [1] "F1 of Decision Tree is: 0.979310344827586"

We can also create ROC and find AUC

XGBoost

Let’s split the data:

Run the model with XGBoost

## [1]  train-auc:0.994021  cv-auc:0.993619 
## Multiple eval metrics are present. Will use cv_auc for early stopping.
## Will train until cv_auc hasn't improved in 10 rounds.
## 
## [6]  train-auc:0.998888  cv-auc:0.998773 
## [11] train-auc:0.999747  cv-auc:0.999532 
## [16] train-auc:0.999971  cv-auc:0.999984 
## [21] train-auc:0.999992  cv-auc:0.999999 
## [26] train-auc:0.999996  cv-auc:0.999999 
## [31] train-auc:0.999998  cv-auc:1.000000 
## Stopping. Best iteration:
## [22] train-auc:0.999992  cv-auc:1.000000

Prediction

Now let’s predict fraud using test data:

And evaluate the result by creating confusion matrix:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 10916     0
##          1    16 10812
##                                           
##                Accuracy : 0.9993          
##                  95% CI : (0.9988, 0.9996)
##     No Information Rate : 0.5028          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9985          
##                                           
##  Mcnemar's Test P-Value : 0.0001768       
##                                           
##             Sensitivity : 0.9985          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9985          
##              Prevalence : 0.5028          
##          Detection Rate : 0.5020          
##    Detection Prevalence : 0.5020          
##       Balanced Accuracy : 0.9993          
##                                           
##        'Positive' Class : 0               
## 

We can access different metrics from our confusion matrix.

## [1] "Recall of XGBoost is: 0.998536406878888"
## [1] "Precision of XGBoost is: 1"
## [1] "F1 of XGBoost is: 0.999267667521055"

We can also create ROC and find AUC