AutoML
Automated Machine Learning (AutoML) is tied in with producing Machine Learning solutions for the data scientist without doing unlimited inquiries on data preparation, model selection, model parameters
Last updated
Automated Machine Learning (AutoML) is tied in with producing Machine Learning solutions for the data scientist without doing unlimited inquiries on data preparation, model selection, model parameters
Last updated
Imagine that a direct retailer wants to reduce losses due to orders involving fraudulent use of credit cards. They accept orders via phone and their web site, and ship goods directly to the customer.
Basic customer details, such as customer name, date of birth, billing address and preferred shipping address, are stored in a relational database.
Orders, as they come in, are stored in a database. There is also a report of historical instances of fraud contained in a CSV spreadsheet.
To listen to the video please copy and paste the website URL into your host Chrome browser, as there's no soundcard in the Lab environment.
In this lab you will:
Prepare Data - Data Wrangling
Set Feature Engineering
TPOT - automated ML to determine algorithm.
Colab - Build and Train a Decision Tree Model.
Deploy and Test the model.
With the goal of preparing a dataset for ML, we can use PDI to combine these disparate data sources and engineer some features for learning from it. The following figure shows a transformation demonstrating an example of just that, and includes some steps for deriving new fields.
To begin with, customer data is joined from several data sources, and then blended with transactional data and historical fraud occurrences contained in a CSV file.
Start PDI
cd
cd ~/Pentaho/design-tools/data-integration
./spoon.sh
Open the following autoML.ktr
~/Workshop--Data-Integration/Labs/Module 6 - Machine Learning/autoML.ktr
Browse the various customer data sources:
customer transaction details
feature engineering for ship_to_zip
transaction details (x variables) are used by the decision trees to determine whether the transaction is fraudulent (y variable). The Boolean values will need to be changed into numbers for the randomForest algorithm.
Is a machine learning technique that leverages data to create new variables that aren't in the training set.
It can produce new features for both supervised and unsupervised learning, with the goal of simplifying and speeding up data transformations while also enhancing model accuracy.
billing_shipping_zip_equal: [customer_billing_zip]=[ship_to_ zip]
There are steps for deriving additional fields that might be useful for predictive modelling. These include computing the customer's age, extracting the hour of the day the order was placed, and setting a flag to indicate whether the shipping and billing addresses have the same zip code.
So, what does the data scientist do at this point?
Typically, they will want to get a feel for the data by examining simple summary statistics and visualizations, followed by applying quick techniques for assessing the relationship between individual attributes (fields) and the target of interest which, in this example, is the reported_as_fraud_historic field.
Following that, if there are attributes that look promising, quick tests with common supervised classification algorithms will be next on the list. This comprises the initial stages of experimental data mining – that is, the process of determining which predictive techniques are going to give the best result for a given problem.
Tree-based Pipeline Optimization Tool for Automating Machine Learning (TPOT) is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best fit for your data.
Sign into Colab.
Connect to a hosted runtime.
Select File -> Open Notebook
Upload:
/home/pentaho/Workshop--Data-Integration/Labs/Module 7 - Workflows/Machine Learning/AutoML/data/credit_card_fraud.ipynb
These are the code sections for the Jupyter file credit_card_fraud.ipynb:
Install the TPOT libraries:
# Installs TPOT libraries.
!pip install tpot
Import libraries:
# import libraries
import numpy as np
import pandas as pd
from tpot import TPOTClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
Import dataset:
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
dataset = pd.read_csv('TPOT.csv', sep= ';', header=None)
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 8].values
Path to TPOT.csv
/home/pentaho/Workshop--Data-Integration/Labs/Module 7 - Workflows/Machine Learning/AutoML/data/TPOT.csv:
Add column headers:
dataset.columns = ['first_time_customer','order_dollar_amount','num_items','age','web_order','total_transactions_to_date','hour_of_day','billing_shipping_zip_equal','reported_as_fraud_historic']
Convert dataset to numpy array and fit data (optional):
x = dataset.iloc[:,0:-1].values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
X=np.asarray(x_scaled)
y=np.asarray(dataset.iloc[:,-1])
Split the dataset. 75% used for test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.75, random_state=None)
8. Run TPOT Classifier:
tpot = TPOTClassifier(generations=1, verbosity=2, population_size=100, scoring='accuracy', n_jobs = -1, config_dict='TPOT light')
tpot.fit(X_train, y_train)
output_score=str(tpot.score(X_test, y_test))
print(tpot.fitted_pipeline_)
Export Pipeline as Python script:
tpot.export('tpot_exported_credit_card_pipeline.py')
from google.colab import files
files.download('tpot_exported_credit_card_pipeline.py')
Once the python script has been tested, it can added to our autoML transformation and from the output suggested algorithms can tested.
Enable the rest of the hops in the transformation, except: Model Catalogue (table).
Open the step: auto machine learning
Ensure you set the path to Python3.10
To ensure the script does not take a long time to process, the following TPOT parameters have been set:
tpot = TPOTClassifier(generations=1, verbosity=2,population_size=100, config_dict='TPOT light')
For further details:
Click on the Input tab.
a. Use this tab to make selections for moving data from PDI fields to Python variables.
b. The All rows option is commonly used for data frames. A data frame is used for storing data tables and is composed of a list of vectors of equal length.
c. Select the All rows option to process all your data at once, for example, using the Python list of dictionaries.
Available variables
Use the plus sign button to add a Python variable to the input mapping for the script used in the transformation. You can remove the Python variable by clicking the X icon.
Variable name
Enter the name of the Python variable. The list of available variables will update automatically.
Step
Specify the name of the input step to map from. It can be any step in the parent transformation with an outgoing hop connected to the Python Executor step.
Data structure
Specify the data structure from which you want to pull the fields for mapping. You can select one of the following:
· Pandas data frame: the tabular data structure for Python/Pandas.
· NumPy array: the table of values, all the same type, which is indexed by a tuple of positive integers.
· Python List of Dictionaries: each row in the PDI stream becomes a Python dictionary. All the dictionaries are put into a Python list.
The Mapping table contains the following field properties:
Data structure field
The value of the Python data structure field to which you want to map the PDI field.
Data structure type
The value of the data structure type assigned to the data structure field to which you want to map the PDI field.
PDI field
The name of the PDI field which contains the vector data stored in the mapped Python variable.
PDI data type
The value of the data type assigned to the PDI field, such as a date, a number, or a timestamp.
The cust variable defines the dataframe in the Python script using iloc:
x = dataset.iloc[:,1:-1].values
The dataframe is pulled from the PDI step sv-changes_to_numbers.
From this list, for the purposes of predictive modeling, we can drop the customer name, ID fields, email addresses, phone numbers and physical addresses. These fields are unlikely to be useful for learning purposes and, in fact, can be detrimental due to the large number of distinct values they contain.
Click on the Output tab.
The output of model.df dataframe, from the script:
model_df=pd.DataFrame(model_list,columns=['pipe','generation','mutation','crossover','predecessor','operator','cv'])
is converted back to PDI fields.
Preview the data for the tfo_model_catalogue step - sort by Cross Validation Performance..
For the First Generation, the best algorithm pipeline run is DecisionTree with a scoring of 0.8583 and accuracy of 0.8448 (figure used to judge the quality of the pipeline - look in the 'last' logging line ).
The best pipeline to use (with 86% accuracy) for this dataset is based on Decision Trees with a minimum of 5 trees.
It may also be worth looking at KNeighbors Classifier.
The object of using TPOT is to point you in the right direction for selecting the appropriate algorithm.
💡The results will be different each time you run the TPOTClassifier.