Pentaho Data Integration
InstallationBusiness AnalyticsCToolsData CatalogData QualityLLMs
  • Overview
    • Pentaho Data Integration ..
  • Data Integration
    • Getting Started
      • Configuring PDI UI
      • KETTLE Variables
    • Concepts & Terminolgy
      • Hello World
      • Logging
      • Error Handling
    • Data Sources
      • Flat Files
        • Text
          • Text File Input
          • Text File Output
        • Excel
          • Excel Writer
        • XML
          • Read XML
        • JSON
          • Read JSON
      • Databases
        • CRUID
          • Database Connections
          • Create DB
          • Read DB
          • Update DB
          • Insert / Update DB
          • Delete DB
        • SCDs
          • SCDs
      • Object Stores
        • MinIO
      • SMB
      • Big Data
        • Hadoop
          • Apache Hadoop
    • Enrich Data
      • Merge
        • Merge Streams
        • Merge Rows (diff)
      • Joins
        • Cross Join
        • Merge Join
        • Database Join
        • XML Join
      • Lookups
        • Database Lookups
      • Scripting
        • Formula
        • Modified JavaScript Value
        • User Defined Java Class
    • Enterprise Solution
      • Jobs
        • Job - Hello World
        • Backward Chaining
        • Parallel
      • Parameters & Variables
        • Parameters
        • Variables
      • Scalability
        • Run Configurations
        • Partition
      • Monitoring & Scheduling
        • Monitoring & Scheduling
      • Logging
        • Logging
      • Dockmaker
        • BA & DI Servers
      • Metadata Injection
        • MDI
    • Plugins
      • Hierarchical Data Type
  • Use Cases
    • Streaming Data
      • MQTT
        • Mosquitto
        • HiveMQ
      • AMQP
        • RabbitMQ
      • Kafka
        • Kafka
    • Machine Learning
      • Prerequiste Tasks
      • AutoML
      • Credit Card
    • RESTful API
    • Jenkins
    • GenAI
  • Reference
    • Page 1
Powered by GitBook
On this page
  1. Data Integration
  2. Data Sources
  3. Databases

SCDs

Slowly Changing Dimensions ..

PreviousDelete DBNextSCDs

Last updated 28 days ago

Introduction

Slowly Changing Dimensions (SCD) - dimensions that change slowly over time, rather than changing on regular schedule, time-base. In a Data Warehouse, there is a need to track changes in dimension attributes in order to report historical data. In other words, implementing one of the SCD types should enable users assigning proper dimension's attribute value for given date. Example of such dimensions could be: customer, geography, employee.

There are many approaches how to deal with SCD. The most popular are:

  • Type 0 - The passive method

  • Type 1 - Overwriting the old value

  • Type 2 - Creating a new additional record

  • Type 3 - Adding a new column

  • Type 4 - Using historical table

  • Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)

Type 0 - value does not change over time

A type 0 slowly changing dimension is a dimension that never changes its attributes over time.

For example, the date of birth of a person is a type 0 attribute, because it does not change after it is recorded. A type 0 dimension can be used to store the original values of some attributes that are not relevant for historical analysis.

Type 1 - Overwrite with new value

Overwriting the old value. In this method, no history of dimension changes is kept in the database. The old dimension value is simply overwritten with the new one. This type is easy to maintain and is often use for data which changes are caused by processing corrections (e.g. removal special characters, correcting spelling errors).

Type 2 - Historically track the value change

Type 2 - Creating a new additional record. In this methodology, all history of dimension changes is kept in the database. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key (or another durable identifier).

Also 'effective date' and 'current indicator' columns are used in this method. There could be only one record with current indicator set to 'Y'. For 'effective date' columns, i.e. start_date and end_date, the end_date for current record usually is set to value 9999-12-31.

Introducing changes to the dimensional model in type 2 could be very expensive database operation so it is not recommended to use it in dimensions where a new attribute could be added in the future.

Type 3 - Add a column to track value changes

Adding a new column. In this type, usually only the current and previous value of dimension is kept in the database. The new value is loaded into 'current/new' column and the old one into 'old/previous' column.

The history is limited to the number of column created for storing historical data. This is the least commonly needed technique.

Using historical table. In this method, a separate historical table is used to track all dimension's attribute historical changes for each of the dimension. The 'main' dimension table keeps only the current data e.g. customer and customer_history tables.

Combine approaches of types 1,2,3 (1+2+3=6). In this type, we have in dimension table such additional columns as: current_type - for keeping current value of the attribute. All history records for given item of attribute have the same current value. historical_type - for keeping historical value of the attribute. All history records for given item of attribute could have different values. start_date - for keeping start date of 'effective date' of attribute's history. end_date - for keeping end date of 'effective date' of attribute's history. current_flag - for keeping information about the most recent record. In this method to capture attribute change we add a new record as in type 2. The current_type information is overwritten with the new one as in type 1. We store the history in a historical_column as in type 3.

Workshops

Slowly changing dimensions (SCDs) are an integral part of data warehousing and business intelligence platforms, capturing how an entity's data changes over time. They facilitate tracking of historical data, providing insights into trends, patterns, and long-term changes in data. Implementing SCDs enables businesses to maintain historical accuracy, understand temporal shifts, and make informed decisions based on a comprehensive view of data evolution.

The workshop covers how to implement Type 1 & 2.

Slowly Changing Dimensions

Slowly Changing Dimensions (SCD) in Pentaho Data Integration handles historical changes to dimension data in your data warehouse. PDI includes a dedicated "Dimension lookup/update" step that supports multiple SCD types:

SCD Types Supported

Type 1 (Overwrite): Simply replaces old values with new ones, maintaining no history.

Type 2 (Add new row): Preserves history by creating new records when changes occur. This typically involves:

  • Adding date/time fields to track when records were active

  • Including a "current" flag to identify the latest version

  • Maintaining version numbers for each record

Type 3 (Add new field): Tracks limited history by keeping both the current and previous values in separate columns.

SCDs
Type 1
Type 2
Type 3
Type 4
Type 6