Pentaho Data Integration
InstallationBusiness AnalyticsCToolsData CatalogData QualityLLMs
  • Overview
    • Pentaho Data Integration ..
  • Data Integration
    • Getting Started
      • Configuring PDI UI
      • KETTLE Variables
    • Concepts & Terminolgy
      • Hello World
      • Logging
      • Error Handling
    • Data Sources
      • Flat Files
        • Text
          • Text File Input
          • Text File Output
        • Excel
          • Excel Writer
        • XML
          • Read XML
        • JSON
          • Read JSON
      • Databases
        • CRUID
          • Database Connections
          • Create DB
          • Read DB
          • Update DB
          • Insert / Update DB
          • Delete DB
        • SCDs
          • SCDs
      • Object Stores
        • MinIO
      • SMB
      • Big Data
        • Hadoop
          • Apache Hadoop
    • Enrich Data
      • Merge
        • Merge Streams
        • Merge Rows (diff)
      • Joins
        • Cross Join
        • Merge Join
        • Database Join
        • XML Join
      • Lookups
        • Database Lookups
      • Scripting
        • Formula
        • Modified JavaScript Value
        • User Defined Java Class
    • Enterprise Solution
      • Jobs
        • Job - Hello World
        • Backward Chaining
        • Parallel
      • Parameters & Variables
        • Parameters
        • Variables
      • Scalability
        • Run Configurations
        • Partition
      • Monitoring & Scheduling
        • Monitoring & Scheduling
      • Logging
        • Logging
      • Dockmaker
        • BA & DI Servers
      • Metadata Injection
        • MDI
    • Plugins
      • Hierarchical Data Type
  • Use Cases
    • Streaming Data
      • MQTT
        • Mosquitto
        • HiveMQ
      • AMQP
        • RabbitMQ
      • Kafka
        • Kafka
    • Machine Learning
      • Prerequiste Tasks
      • AutoML
      • Credit Card
    • RESTful API
    • Jenkins
    • GenAI
  • Reference
    • Page 1
Powered by GitBook
On this page
  1. Data Integration
  2. Enrich Data
  3. Joins

Database Join

A self join or recursive join .. or is it ?

PreviousMerge JoinNextXML Join

Last updated 1 month ago

Workshop - Database Join

Searching for information in databases, text files, web services, and so on, is a very common task. In this workshop we're going to query the Products table for products are listed below a set buy price.

The database join isn't actually a join, but a series of queries against the table based on set conditions. Be aware this results in a performance hit.

Data grid

The Data grid step allows you to enter a static list of rows in a grid. This is usually done for testing, reference or demo purposes.

  1. Drag the Data grid step onto the canvas.

  2. Open the Data grid properties dialog box.

  3. Ensure the following details are configured, as outlined below:

Database Join

The Database Join step allows you to run a query against a database using data obtained from previous steps. The parameters for this query are specified as follows:

  • The data grid in the step properties dialog. This allows you to select the data coming in from the source hop.

  • As question marks (?) in the SQL query. When the step runs, these will be replaced with data coming in from the fields defined from the data grid. The question marks will be replaced in the same order as defined in the data grid.

  1. Drag the Database Join step onto the canvas.

  2. Open the Database Join properties dialog box.

  3. Ensure the following details are configured, as outlined below:

The ‘Parameter fieldname’ is where you specify the parameters, therefore the values, for the conditions. Each row in the grid represents a comparison between a column in the table, and a field in your stream, by using one of the provided comparators.

LIKE matches values. You can't alias a column in the select clause and then use it in the where clause

The question marks you type in the SQL statement represent parameters. The purpose of these parameters is to be replaced with the fields you provide in ‘Parameter fieldname’. For each row in the stream, the Database join step replaces the parameters in the same order as they are in the grid, and executes the SQL statement.

So, let’s look at the WHERE conditions entered:

PRODUCTNAME LIKE like_statement and BUYPRICE < max_price

For the first record this translates as:

WHERE PRODUCTNAME LIKE concat ('%','Aston Martin','%') AND BUYPRICE < 90

As the Outer Join option is checked The FULL OUTER JOIN keyword returns all rows from the left table and from the right table. The FULL OUTER JOIN keyword combines the result of both LEFT and RIGHT joins.

The table dataset A is then compared with the stream dataset B. If there’s a match, then values for PRODUCTNAME and PRODUCTSCALE are returned.

This is not a database join. Instead of joining tables in a database, you are joining the result of a database query with a dataset.

For the second record:

WHERE PRODUCTNAME LIKE concat ('%','Ford Falcon','%') AND BUYPRICE < 70

As there is no record, NULL values are returned for:

PRODUCTNAME and PRODUCTSCALE.

So far, the results could be achieved using a Database Lookup step. However, there is a significant difference, as illustrated with the third row. For Corvette, the Database join found two matching rows in the database, and retrieved them both. Not possible with a Database lookup step.

RUN

A Database Join involves running a bunch of queries with condition against a table. Useful when you're expecting to return a few records.

  1. Click the Run button in the Canvas Toolbar.

  2. Click on the Preview tab:

Note that there is more than one Corvette product. The database join is querying the table to return all the values, even NULL.

Database Join
Data grid - Meta
Data grid - Data
Database join
Results