Partition | Pentaho Data Integration

By default, each step in a transformation is executed in parallel in a single separate thread. With a single copy of each step, the data is read from the CSV file input step and then aggregated in the count by state step.

Open transformation:

~/Workshop--Data-Integration/Labs/Module 5 - Enterprise Solution/Scalability/Demo - Partitioning/tr_parallel_reading_and_aggregation.ktr

To take advantage of the processing resources in your server, you can scale up the transformation using the multi-threading option Change Number of Copies to Start to produce copies of the steps (right-click the step to access the menu). As shown below, the x2 notation indicates that two copies will be started at runtime.

Change the number of copies for the CSV file input to 2.

By default, this data movement from the CSV file input step into the count by state step will be performed in round-robin order. This means that if there are 'N' copies, the first copy gets the first row, the second copy gets the second row, and the Nth copy receives the Nth row. Row N+1 goes to the first copy again, and so on until there are no more rows to distribute.

Reading the data from the CSV file is done in parallel.

Attempting to aggregate in parallel, however, produces incorrect results because the rows are split arbitrarily (without a specific rule) over the two copies of the count by state aggregation step, as shown in the preview data.

Preview the data .. notice that some of the 'State counts' are duplicated.

Why ? and can you suggest how to solve the problem?

This is where partitioning data becomes a useful concept, as it applies specific rule-based direction for aggregation, directing rows from the same state to the same step copy, so that the rows are not split arbitrarily.

In the example below, a partition schema called 'State' was applied to the 'count by state' step and the 'Remainder of division' partitioning rule was applied to the 'State' field. Now, the count by state aggregation step produces consistent correct results because the rows were split up according to the partition schema and rule.

So whats happening behind the scenes?

Remainder of division is a partitioning method that assigns a row to a partition based on the remainder of dividing a field value by the number of partitions. For example, if you have four partitions and a field value of 13, the remainder of division is 1 (13 mod 4 = 1), so the row goes to the first partition. This method ensures that rows with the same field value end up in the same partition, which can be useful for aggregation or grouping operations.

The Table output step (double-click the step to open it) supports partitioning rows of data to different tables. When configured to accept the table name from a Partitioning field, the PDI client will output the rows to the appropriate table. You can also Partition data per month or Partition data per day. To ensure that all the necessary tables exist, its recommend to create them in a separate transformation.

You can choose from different partitioning methods, such as remainder of division, binary tree, mirror sequence, or custom.