Pentaho Data Integration
InstallationBusiness AnalyticsCToolsData CatalogData QualityLLMs
  • Overview
    • Pentaho Data Integration ..
  • Data Integration
    • Getting Started
      • Configuring PDI UI
      • KETTLE Variables
    • Concepts & Terminolgy
      • Hello World
      • Logging
      • Error Handling
    • Data Sources
      • Flat Files
        • Text
          • Text File Input
          • Text File Output
        • Excel
          • Excel Writer
        • XML
          • Read XML
        • JSON
          • Read JSON
      • Databases
        • CRUID
          • Database Connections
          • Create DB
          • Read DB
          • Update DB
          • Insert / Update DB
          • Delete DB
        • SCDs
          • SCDs
      • Object Stores
        • MinIO
      • SMB
      • Big Data
        • Hadoop
          • Apache Hadoop
    • Enrich Data
      • Merge
        • Merge Streams
        • Merge Rows (diff)
      • Joins
        • Cross Join
        • Merge Join
        • Database Join
        • XML Join
      • Lookups
        • Database Lookups
      • Scripting
        • Formula
        • Modified JavaScript Value
        • User Defined Java Class
    • Enterprise Solution
      • Jobs
        • Job - Hello World
        • Backward Chaining
        • Parallel
      • Parameters & Variables
        • Parameters
        • Variables
      • Scalability
        • Run Configurations
        • Partition
      • Monitoring & Scheduling
        • Monitoring & Scheduling
      • Logging
        • Logging
      • Dockmaker
        • BA & DI Servers
      • Metadata Injection
        • MDI
    • Plugins
      • Hierarchical Data Type
  • Use Cases
    • Streaming Data
      • MQTT
        • Mosquitto
        • HiveMQ
      • AMQP
        • RabbitMQ
      • Kafka
        • Kafka
    • Machine Learning
      • Prerequiste Tasks
      • AutoML
      • Credit Card
    • RESTful API
    • Jenkins
    • GenAI
  • SETUP
    • Windows 11 Pentaho Lab
  • FAQs
    • FAQs
Powered by GitBook
On this page
  1. Data Integration
  2. Enterprise Solution
  3. Scalability

Partition

PreviousRun ConfigurationsNextMonitoring & Scheduling

Last updated 8 months ago

x

Partitioning data in Pentaho Data Integration allows you to distribute your data into distinct subsets based on a specific rule, such as a field value or a hash function. This can improve the performance and scalability of your data integration jobs, especially when you have a large amount of data or multiple servers. Partitioning data can also help you avoid data skew and resource underutilization.

By default, each step in a transformation is executed in parallel in a single separate thread. With a single copy of each step, the data is read from the CSV file input step and then aggregated in the count by state step.

  1. Open transformation:

~/Workshop--Data-Integration/Labs/Module 5 - Enterprise Solution/Scalability/Demo - Partitioning/tr_parallel_reading_and_aggregation.ktr

To take advantage of the processing resources in your server, you can scale up the transformation using the multi-threading option Change Number of Copies to Start to produce copies of the steps (right-click the step to access the menu). As shown below, the x2 notation indicates that two copies will be started at runtime.

  1. Change the number of copies for the CSV file input to 2.

By default, this data movement from the CSV file input step into the count by state step will be performed in round-robin order. This means that if there are 'N' copies, the first copy gets the first row, the second copy gets the second row, and the Nth copy receives the Nth row. Row N+1 goes to the first copy again, and so on until there are no more rows to distribute.

Reading the data from the CSV file is done in parallel.

Attempting to aggregate in parallel, however, produces incorrect results because the rows are split arbitrarily (without a specific rule) over the two copies of the count by state aggregation step, as shown in the preview data.

  1. Preview the data .. notice that some of the 'State counts' are duplicated.

Why ? and can you suggest how to solve the problem?

This is where partitioning data becomes a useful concept, as it applies specific rule-based direction for aggregation, directing rows from the same state to the same step copy, so that the rows are not split arbitrarily.

In the example below, a partition schema called 'State' was applied to the 'count by state' step and the 'Remainder of division' partitioning rule was applied to the 'State' field. Now, the count by state aggregation step produces consistent correct results because the rows were split up according to the partition schema and rule.

So whats happening behind the scenes?

Remainder of division is a partitioning method that assigns a row to a partition based on the remainder of dividing a field value by the number of partitions. For example, if you have four partitions and a field value of 13, the remainder of division is 1 (13 mod 4 = 1), so the row goes to the first partition. This method ensures that rows with the same field value end up in the same partition, which can be useful for aggregation or grouping operations.

The Table output step (double-click the step to open it) supports partitioning rows of data to different tables. When configured to accept the table name from a Partitioning field, the PDI client will output the rows to the appropriate table. You can also Partition data per month or Partition data per day. To ensure that all the necessary tables exist, its recommend to create them in a separate transformation.

You can choose from different partitioning methods, such as remainder of division, binary tree, mirror sequence, or custom.

Partitioning dataHitachi Vantara Lumada and Pentaho Documentation
Logo
Sate Count
x2 Copies
Step Metrics
Duplication due to round-robin aggregation
Partitions
Partition Table