The results from TPOT point to using a Decision Tree algorithm.
Once we've selected our algorithm:
Train a randomForest model in R.
Predict fraudulent credit card transactions.
The model that will be used is randomForest.
To listen to the videos please copy and paste the website URL into your host Chrome browser, as there's no soundcard in the Lab environment.
Train the randomForest algorithm with the same dataset.
In Spoon, open the following main job:
/home/pentaho/Workshop--Data-Integration/Labs/Module 7 - Workflows/Machine Learning/Credit Card Fraud/solution/jb_fraud_main_job.kjb
Right-click on the train_model transformation and select Open Referenced Object -> Transformation.
R Script Executor
Double-click on the rscrpt-train_model_randomForest step to bring up the configuration settings.
Under the Configure tab, ensure the Input Frames points to the step name sv-convert_booleans_to_numbers and the R Frame name: train.
Set Row Handling to Number of Rows to Process: All.
Select the R script tab. Copy and paste the code snippets based on the Comments.
library(randomForest)
train.df <- as.data.frame(train)
rf <- randomForest(train.df$reported_as_fraud_historic ~ ., train.df, ntree=8, importance=TRUE)
save(rf, file="/home/pentaho/rf.rdata")
ok <- "Finished"
ok.df <- as.data.frame(ok)
ok.df
Using the 'trained' model - predict fraudulent credit activity.
In Spoon, open the following main job:
/home/pentaho/Workshop--Data-Integration/Labs/Module 7 - Workflows/Machine Learning/Credit Card Fraud/solution/jb_fraud_main_job.kjb
Right-click on the train_model transformation and select Open Referenced Object -> Transformation.
R Script Executor
Double-click on the rscrpt-predict step to bring up the configuration settings.
Under the Configure tab, ensure the Input Frames points to the step name sv-convert_booleans_to_numbers and the R Frame name: test.
Set Row Handling to Number of Rows to Process: All.
Select the R script tab. Copy and paste the code snippets based on the Comments.
library(randomForest)
test.df <- as.data.frame(test)
load(file="/home/pentaho/rf.rdata")
pred <- predict(rf, newdata = test.df)
pred.df <- as.data.frame(pred)
submission <- data.frame(cbind(test.df,pred.df))
submission
A spreadsheet formula is used to calculate %which can be used to trigger various actions.
/home/pentaho/Workshop--Data-Integration/Labs/Module 7 - Workflows/Machine Learning/Credit Card Fraud/solution/output/credit_card_predict.xlsx