Partition
Last updated
Last updated
x
Partitioning data in Pentaho Data Integration allows you to distribute your data into distinct subsets based on a specific rule, such as a field value or a hash function. This can improve the performance and scalability of your data integration jobs, especially when you have a large amount of data or multiple servers. Partitioning data can also help you avoid data skew and resource underutilization.
By default, each step in a transformation is executed in parallel in a single separate thread. With a single copy of each step, the data is read from the CSV file input step and then aggregated in the count by state step.
Open transformation:
~/Workshop--Data-Integration/Labs/Module 5 - Enterprise Solution/Scalability/Demo - Partitioning/tr_parallel_reading_and_aggregation.ktr
To take advantage of the processing resources in your server, you can scale up the transformation using the multi-threading option Change Number of Copies to Start to produce copies of the steps (right-click the step to access the menu). As shown below, the x2 notation indicates that two copies will be started at runtime.
Change the number of copies for the CSV file input to 2.
By default, this data movement from the CSV file input step into the count by state step will be performed in round-robin order. This means that if there are 'N' copies, the first copy gets the first row, the second copy gets the second row, and the Nth copy receives the Nth row. Row N+1 goes to the first copy again, and so on until there are no more rows to distribute.
Reading the data from the CSV file is done in parallel.
Attempting to aggregate in parallel, however, produces incorrect results because the rows are split arbitrarily (without a specific rule) over the two copies of the count by state aggregation step, as shown in the preview data.
Preview the data .. notice that some of the 'State counts' are duplicated.
Why ? and can you suggest how to solve the problem?