Pentaho Data Integration
InstallationBusiness AnalyticsCToolsData CatalogData QualityLLMs
  • Overview
    • Pentaho Data Integration ..
  • Data Integration
    • Getting Started
      • Configuring PDI UI
      • KETTLE Variables
    • Concepts & Terminolgy
      • Hello World
      • Logging
      • Error Handling
    • Data Sources
      • Flat Files
        • Text
          • Text File Input
          • Text File Output
        • Excel
          • Excel Writer
        • XML
          • Read XML
        • JSON
          • Read JSON
      • Databases
        • CRUID
          • Database Connections
          • Create DB
          • Read DB
          • Update DB
          • Insert / Update DB
          • Delete DB
        • SCDs
          • SCDs
      • Object Stores
        • MinIO
      • SMB
      • Big Data
        • Hadoop
          • Apache Hadoop
    • Enrich Data
      • Merge
        • Merge Streams
        • Merge Rows (diff)
      • Joins
        • Cross Join
        • Merge Join
        • Database Join
        • XML Join
      • Lookups
        • Database Lookups
      • Scripting
        • Formula
        • Modified JavaScript Value
        • User Defined Java Class
    • Enterprise Solution
      • Jobs
        • Job - Hello World
        • Backward Chaining
        • Parallel
      • Parameters & Variables
        • Parameters
        • Variables
      • Scalability
        • Run Configurations
        • Partition
      • Monitoring & Scheduling
        • Monitoring & Scheduling
      • Logging
        • Logging
      • Dockmaker
        • BA & DI Servers
      • Metadata Injection
        • MDI
    • Plugins
      • Hierarchical Data Type
  • Use Cases
    • Streaming Data
      • MQTT
        • Mosquitto
        • HiveMQ
      • AMQP
        • RabbitMQ
      • Kafka
        • Kafka
    • Machine Learning
      • Prerequiste Tasks
      • AutoML
      • Credit Card
    • RESTful API
    • Jenkins
    • GenAI
  • Reference
    • Page 1
Powered by GitBook
On this page
  1. Data Integration
  2. Enrich Data
  3. Joins

Cross Join

Good old Cartesian Join ..

PreviousJoinsNextMerge Join

Last updated 1 month ago

Workshop - Cross Join

The CARTESIAN JOIN or CROSS JOIN returns the Cartesian product of the sets of records from two or more joined tables. Thus, it equates to an inner join where the join-condition always evaluates to either True or where the join-condition is absent from the statement .. whatever that means .. basically, its every possible combination.

In this workshop we'll be cross joining first names with middle names and again with our surname.

You should be familiar with the Data Grid step. Used to list values for first and middle names.

Joins all possible first_name, middle_name combinations together.

You can also add a condition to constrain the resulting dataset.

2 new fields are added to the data stream:

  • first_middle_name: concates the first_names and middle names.

  • initials: returns the 2 character initials.

This method has two variants and returns a new string that is a substring of this string. The substring begins with the character at the specified index and extends to the end of this string or up to endIndex – 1, if the second argument is given.

Returns the value associated with the ${surname}. This is set in the Parameters tab in Transformation properties.

Joins all possible first_middle_name, surname combinations together. The output for initials is also excludes in the list various initials combinations.

Determine the order and selection of the data stream fields.

2 new fields are added to the data stream:

  • boys_initials: returns the babys’ 3 character initials.

  • boys_name: concates first_name + middle_name + surname

  • Returns 5 sampled records.

  • Returns Random seed ${seed}

  • Reservoir Sampling allows you to select a set number of random records, from an unknown number ‘reservoir’ of records, i.e. not known beforehand.

  • Use a different seed value to ensure no two ‘sets’ are the same.

RUN

The workshop illustrates the use of cross joins to create data sets with every possible combination - unless conditions are set. The final dataset is randomly selected using Reservoir Sampling - a common technique used in ML.

  1. Click the Run button in the Canvas Toolbar.

  2. Click on the Preview tab:

Cross / Cartesian Join
Cross Joins
First names
Middle names
Join rows
UDJE - concat names, initials
Join rows
Select values
UDJE - name
Reservoir sampling
Preview data