Joins
Pentaho Joins ..
Introduction
Pentaho Data Integration (PDI) offers several join components to combine data from different streams based on specified key fields. Here's a summary of the main join types available:
Merge Join: This is the standard join step that combines two sorted input streams based on matching key fields. It supports inner joins, left outer joins, right outer joins, and full outer joins. Both input streams must be sorted on the join keys for this step to work correctly.
Cross Join (Cartesian Product): This join creates all possible combinations of rows from two streams (a Cartesian product). It can be filtered to function as other join types by adding conditions. It's memory-intensive but doesn't require pre-sorted input.
Database Join: This specialized join allows you to look up values in a database table for each input row. It performs a database query for each incoming row, using values from the input stream as parameters.
Multiway Merge Join: This advanced join can combine more than two streams in a single operation, allowing for complex data integration scenarios when you need to merge multiple datasets together.
XML Join: A specialized join for combining XML data structures. It allows you to merge XML content from two streams, useful when working with XML-based data sources or targets.
Each join type has specific use cases and performance characteristics, allowing PDI to handle a wide variety of data integration scenarios efficiently.

Workshops
There are different types of joins that you can use in Pentaho to combine data from different sources based on a common key or condition.
Here are some joins we are going to cover in this section:
Cross Join
A Pentaho cross join is a way of combining two streams of data in a Cartesian product, meaning that every row from one stream is joined with every row from the other stream. This can be useful for creating combinations of values or performing calculations based on multiple inputs.
However, a cross join can also result in a very large output, especially if the input streams have many rows. Therefore, it is important to optimize the cross join step by using filters, conditions, or lookups to reduce the number of rows or columns in the output.

Last updated