Data Cleansing

Traditional data cleansing techiques ..

This workshop section focuses on demonstrating traditional data cleansing techniques using Pentaho Data Integration.

This dataset contains various issues:

  • Duplicate records (John Doe, Alice Johnson)

  • Inconsistent phone number formats

  • Inconsistent date formats

  • Missing values

  • Inconsistent address formats

  • Removing Duplicates

Your mission, should you wish to accept it .. is to build a workflow / pipeline that resolves the issues .. explain the decisions you have made and any suggest possible enhancements ..

CustomerID	FirstName	LastName	Email	                Phone	         BirthDate	Address
1	        John	        Doe	        [email protected]	555-123-4567	 1985-03-15	123 Main St, City, CA, 12345
2	        Jane	        Smith	        [email protected]	(555) 987-6543	 03/22/1990	456 Elm Avenue, Town, AZ, 67890
3	        John	        Doe	        [email protected]	        5551234567	 1985-03-15	123 Main Street, City, CO, 12345
4	        Alice	        Johnson	        [email protected]	555-555-5555	 1988-12-01	789 Oak Rd, Village, State, 54321
5	        Bob	        Williams	[email protected]		                 1975-07-30	101 Pine Lane, Hamlet, State, 13579
6	        Emma	        Brown	        [email protected]	(555)246-8135	 05-19-1992	202 Cedar Blvd, Borough, NY, 24680
7	        Alice	        Johnson	        [email protected]	555.555.5555	 12/01/1988	789 Oak Road, Village, FL, 54321
8	        Charlie	        Davis	        [email protected]	555-369-2587		        303 Maple Dr, City, State, 97531
9		                Taylor	        [email protected]	555-159-7532	 1982-09-25	404 Birch St, Town, State, 86420
10	        Grace	        Lee	        [email protected]	5557894561	 11-11-1995	505 Walnut Ave, City, State, 
...

All the workshop files: ../Databases/CRUID/Workshop - Data Cleansing

First step is to onboard the data .. currently its a CSV file - which could easily be onboarded - instead let's onboard into a table.

There's a number of databases already installed and configured, running as containers.

  1. Log on to Portainer and check the MariaDB database container is up and running.

Link to Docker Containers in Portainer
  1. Execute the following script to create a sourceDB & targetDB databases.

'grant all' to pentaho_user & pentaho_admin with the password: 'password'.

CREATE DATABASE  IF NOT EXISTS sourceDB;
grant all on sourceDB.* to pentaho_user identified by 'password';
grant all on sourceDB.* to pentaho_admin identified by 'password';

USE sourceDB;

set session sql_mode=replace(@@sql_mode,'NO_ZERO_DATE','');
create sourceDB & targetDB

Transformation

Currently the customer records are in CSV format. It's optional, but, let's onboard into a database table as this is a more likely scenario.

tr_onboard

  1. Drag and drop a CSV File input onto the canvas.

  2. Double-click to configure the following settings:

CSV File input - customer_data.csv
  1. Increase the varchar (length) to prevent truncation.

  2. After clicking on 'Get Fields', 'Preview' the data.

Preview data

Last updated