Lab: SCDs
Slowly Changing Dimensions ..
Last updated
Slowly Changing Dimensions ..
Last updated
The demonstration illustrates the key concepts and workflows for dealing with Slowly Changing Dimensions.
In this Lab, you are going to create a dimension table DIM_SCD, which will have:
• a technical key field
• a version field
• a date from and date to fields
• an Id field
• a name field
The technical key will become the primary key for the dimension table.
The version field sets the record version (increments) on an update or insert.
Create the Table in MySQL sampledata database.
In your DB management tool, execute the following statement.
CREATE TABLE `DIM_SCD` (
`TK` bigint(10) NOT NULL,
`version` int(11) DEFAULT '0',
`id` int(11) DEFAULT '0',
`city` tinytext,
`date_from` datetime DEFAULT NULL,
`date_to` datetime DEFAULT NULL,
`last_update` datetime DEFAULT NULL,
PRIMARY KEY (`TK`),
KEY `idx_DIM_SCD_lookup` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
This step allows you to enter a static list of rows in a grid. This is usually done for testing, reference or demo purposes.
Drag the Data Grid step onto the canvas.
Open the Data Grid properties dialog box.
Ensure the following details are configured, as outlined below:
On the Data tab enter the following values:
The Get System Info step retrieves information from the Kettle environment.
This step generates a single row with the fields containing the requested information. It also accepts input rows.
Drag the Get System Info step onto the canvas.
Open the Get System Info properties dialog box.
Ensure the following details are configured, as outlined below:
The Dimension Lookup/Update step allows you to implement Ralph Kimball's slowly changing dimension for both types: Type I (update) and Type II (insert) together with some additional functions. Not only can you use this step to update a dimension table, it may also be used to look up values in a dimension.
Drag the Dimension lookup/update step onto the canvas.
Open the Dimension lookup/update properties dialog box.
Ensure the following details are configured, as outlined below:
x
This step is rarely used for lookups as specific steps exists; however, it does illustrate the database process of how referential integrity is maintained.
Add ‘London’ to the Data Grid.
To initially configure the fields set the step to Update mode.
Once completed remember to reset to lookup mode.
Ensure that the Dimension Lookup / Update is set to: Lookup Mode.
Sets the step to Update Mode with lookup keys: id
There’s no point adding the LAST_UPDATE field as we're dealing with Type 1 changes.
Click on the Fields tab.
RUN the Transformation
Click the Run button in the Canvas Toolbar.
Click on the Dimension lookup/update step and Preview the data.
Why does London have a TK 0 and a value of NULL?
To maintain referential integrity, the record is assigned a TK 0, and as it doesn’t exist in the database, the value returned is NULL. As this is a lookup, no record is written to the database table.
The Dimension Lookup/Update step allows you to implement Ralph Kimball's slowly changing dimension for both types:
Type 1
Overwriting the old value. In this method, no history of dimension changes is kept in the database. The old dimension value is simply overwritten with the new one. This type is easy to maintain and is often use for data which changes are caused by processing corrections (e.g. removal special characters, correcting spelling errors).
Open the Dimension Lookup / Update properties dialog box.
Ensure the following details are configured, as outlined below:
Click on the Get Fields button.
This will add the key fields used in the Lookup.
The keys that map the records (rows) in the dimension table to the stream field is: id
Click on the Fields tab and map the stream name field to the dimension name field and ensure:
There are several options available to insert / update the dimension record (row).
This option implements a Type I & II slowly changing dimension policy. If the difference is detected for one or more mappings that have the Insert option, then a row is added to the dimension table.
This option simply updates the matched row. It can be used to implement a Type I slowly changing dimension.
The punch through option also performs an update. But instead of only updating the matched dimension row, it will update all versions of the row in a Type II slowly changing dimension.
Use this option to let the step automatically maintain a date field that records the date of the insert or update using the system date field as source.
Use this option to let the step automatically maintain a date field that records the date of the last insert using the system date field as source.
Use this option to let the step automatically maintain a date field that records the date of the last update using the system date field as source.
Use this option to let the step automatically maintain a flag that indicates if the row is the last version.
At the moment its all Type I changes .. Insert / Update record and Update when that record change occurred.
Click the Run button in the Canvas Toolbar.
Click on the Preview tab:
Examine the table in your SQL Query Tool:
Kettle automatically inserts an additional record with a technical key of value 0 (for default or unknown values). This will only happen in the first execution. Below this record, you find the one record (London) from our sample dataset.
In update mode (update option is enabled) the step first performs a lookup of the dimension entry. The result of the lookup is different though. Not only the technical key is retrieved from the query, but also the dimension attribute fields. A field-by-field comparison then follows. The result can be one of the following situations:
• The record was not found, we insert a new record in the table.
• The record was found and any of the following is true:
* One or more attributes were different and had an "Insert" (Kimball Type II) setting: A new dimension record version is inserted.
* One or more attributes were different and had a "Punch through" (Kimball Type I) setting: These attributes in all the dimension record versions are updated.
* One or more attributes were different and had an "Update" setting: These attributes in the last dimension record version are updated.
* All the attributes (fields) were identical: No updates or insertions are performed.
If you mix Insert, Punch Through and Update options in this step, this algorithm acts like a Hybrid Slowly Changing Dimension. (it is no longer just Type I or II, it is a combination)
Double-click on the Data Grid step.
Add the following City: Madrid
Double-click on the Dimension lookup/update step.
Click on th Fields tab and set the following:
Execute the transformation and view the results.
Note that if the record doesnt exisit (Madrid) then it gets inserted. Both records are updated with the same last_update timestamp.
The Dimension Lookup/Update step allows you to implement Ralph Kimball's slowly changing dimension for both types:
Type 2
Creating a new additional record. In this methodology, all history of dimension changes are kept in the database. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key (or another durable identifier). Also 'effective date' and 'current indicator' columns are used in this method. There could be only one record with current indicator set to 'Y'. For 'effective date' columns, i.e. start_date and end_date, the end_date for current record usually is set to value 9999-12-31. Introducing changes to the dimensional model in Type 2 could be very expensive database operation so it is not recommended to use it in dimensions where a new attribute could be added in the future.
Double-click on the Data Grid step.
Add the following City: Pariss (incorrectly spelt 😄)
Keep the Fields strategy the same:
At the moment its all Type I changes .. Insert / Update record and Update when that record change occurred.
Click the Run button in the Canvas Toolbar.
Click on the Preview tab:
Examine the table in your SQL Query Tool:
Execute the transformation and view the results.
As expected the Pariss record is inserted. Notice that the natural primary keys (id) are the same as the TK (Technical Key) and we're on version 1 for each record.
Let's now do a bit of data entry .. Correct the entry: Pariss to Paris
The workflows below will give you an idea of the different strategies that can be implemented.
Any records that are changed are:
Archived with date_from to date_to timestamp.
Record is assigned Version 1. New record Version 2
Note the Keys.
Double-click on the Data Grid step.
Edit the Pariss value to: Paris
Double-click on the Dimension lookup/update step.
Click on th Fields tab and set the following:
The reason for selecting Insert is obvious.. you need to insert a new record that tracks the change. If you select Update, then a Type 1 change occurs; i.e. the original record is updated.
RUN the Transformation
Click the Run button in the Canvas Toolbar.
Examine the table in your SQL Query Tool:.
A new record has been inserted, with the value ‘Paris’, with an updated date_from and last_update timestamp, and version.
Notice that the LAST_UPDATE field has also been updated for the other records.
Lets start with a clean table to clearly illustrate the results. Truncate the table.
truncate tabel DIM_SCD;
Repeat the workflow above and check the results.
Ensure you have set 'Pariss' as the value for city in the data grid step.
Set update as your initial strategy.
Double-click on the Data Grid step.
Edit the Pariss value to: Paris
Double-click on the Dimension lookup/update step.
Click on th Fields tab and set the following:
RUN the Transformation
Click the Run button in the Canvas Toolbar.
Examine the table in your SQL Query Tool:
Edit the Fields tab in the Dimension Lookuo/update step to Punch through:
RUN the Transformation
Click the Run button in the Canvas Toolbar.
Examine the table in your SQL Query Tool:
So whats happened ..
Remember a Punch through updates the fields where the records match ..
All the records have a last_update field .. so this will be updated
As both our Pariss & Paris records have a matching key id=3 then the archived record will be updated with the current city value .. Paris