Pentaho Data Integration
InstallationBusiness AnalyticsCToolsData CatalogData QualityLLMs
  • Overview
    • Pentaho Data Integration ..
  • Data Integration
    • Getting Started
      • Configuring PDI UI
      • KETTLE Variables
    • Concepts & Terminolgy
      • Hello World
      • Logging
      • Error Handling
    • Data Sources
      • Flat Files
        • Text
          • Text File Input
          • Text File Output
        • Excel
          • Excel Writer
        • XML
          • Read XML
        • JSON
          • Read JSON
      • Databases
        • CRUID
          • Database Connections
          • Create DB
          • Read DB
          • Update DB
          • Insert / Update DB
          • Delete DB
        • SCDs
          • SCDs
      • Object Stores
        • MinIO
      • SMB
      • Big Data
        • Hadoop
          • Apache Hadoop
    • Enrich Data
      • Merge
        • Merge Streams
        • Merge Rows (diff)
      • Joins
        • Cross Join
        • Merge Join
        • Database Join
        • XML Join
      • Lookups
        • Database Lookups
      • Scripting
        • Formula
        • Modified JavaScript Value
        • User Defined Java Class
    • Enterprise Solution
      • Jobs
        • Job - Hello World
        • Backward Chaining
        • Parallel
      • Parameters & Variables
        • Parameters
        • Variables
      • Scalability
        • Run Configurations
        • Partition
      • Monitoring & Scheduling
        • Monitoring & Scheduling
      • Logging
        • Logging
      • Dockmaker
        • BA & DI Servers
      • Metadata Injection
        • MDI
    • Plugins
      • Hierarchical Data Type
  • Use Cases
    • Streaming Data
      • MQTT
        • Mosquitto
        • HiveMQ
      • AMQP
        • RabbitMQ
      • Kafka
        • Kafka
    • Machine Learning
      • Prerequiste Tasks
      • AutoML
      • Credit Card
    • RESTful API
    • Jenkins
    • GenAI
  • Reference
    • Page 1
Powered by GitBook
On this page
  1. Use Cases
  2. Machine Learning

Credit Card

ML - randomForest

PreviousAutoMLNextRESTful API

Last updated 3 months ago

The results from TPOT point to using a Decision Tree algorithm.

Once we've selected our algorithm:

  • Train a randomForest model in R.

  • Deploy your model.

  • Predict fraudulent credit card transactions.

The model that will be used is randomForest.

To listen to the videos please copy and paste the website URL into your host Chrome browser, as there's no soundcard in the Lab environment.

Train the randomForest algorithm with the same dataset.

  1. In Spoon, open the following main job:

/home/pentaho/Workshop--Data-Integration/Labs/Module 7 - Workflows/Machine Learning/Credit Card Fraud/solution/jb_fraud_main_job.kjb

  1. Right-click on the train_model transformation and select Open Referenced Object -> Transformation.


R Script Executor

  1. Double-click on the rscrpt-train_model_randomForest step to bring up the configuration settings.

  2. Under the Configure tab, ensure the Input Frames points to the step name sv-convert_booleans_to_numbers and the R Frame name: train.

  1. Set Row Handling to Number of Rows to Process: All.

  2. Select the R script tab. Copy and paste the code snippets based on the Comments.

library(randomForest)

train.df <- as.data.frame(train)
rf <- randomForest(train.df$reported_as_fraud_historic ~ ., train.df, ntree=8, importance=TRUE)
save(rf, file="/home/pentaho/rf.rdata")

ok <- "Finished"
ok.df <- as.data.frame(ok)
ok.df

Using the 'trained' model - predict fraudulent credit activity.

  1. In Spoon, open the following main job:

/home/pentaho/Workshop--Data-Integration/Labs/Module 7 - Workflows/Machine Learning/Credit Card Fraud/solution/jb_fraud_main_job.kjb

  1. Right-click on the train_model transformation and select Open Referenced Object -> Transformation.


R Script Executor

  1. Double-click on the rscrpt-predict step to bring up the configuration settings.

  2. Under the Configure tab, ensure the Input Frames points to the step name sv-convert_booleans_to_numbers and the R Frame name: test.

  1. Set Row Handling to Number of Rows to Process: All.

  2. Select the R script tab. Copy and paste the code snippets based on the Comments.

library(randomForest)

test.df <- as.data.frame(test)
load(file="/home/pentaho/rf.rdata")
pred <- predict(rf, newdata = test.df)
pred.df <- as.data.frame(pred)

submission <- data.frame(cbind(test.df,pred.df))
submission

A spreadsheet formula is used to calculate %which can be used to trigger various actions.

  1. Navigate to:

/home/pentaho/Workshop--Data-Integration/Labs/Module 7 - Workflows/Machine Learning/Credit Card Fraud/solution/output/credit_card_predict.xlsx

randomForest
main_job.kjb
train model
R Script executor step
R script train
main_job.kjb
predict fraud
R Script executor step
R script - predict
Fraud prediction