Getting Started

Spoon

Graphical modelling environment for developing, testing, debugging and monitoring jobs and transformations.

Designer

Drag & Drop 'objects' to design your pipelines and workflows.

Scheduler

Connects to Quartz scheduler on server. Jobs and transformations must be uploaded to Repository.

Engine

Kettle and Spark engines available to execute jobs and transformations.

Repository Browser

Connects to Apache Jackrabbit content Repository, pointing to a supported database:

PostgreSQL
MSSQL Server
Oracle
MySQL
MariaDB

DB Explorer

Database Explorer that enables you to conduct minimal database operations.

PDI REST APIs

You can use PDI's command line tools to execute PDI content from outside of Spoon. Typically, you would use these tools in the context of creating a script or a Cron job to run the job or transformation based on some condition outside of the realm of Pentaho software.

Pan

A standalone command line process that can be used to execute transformations and jobs you created in Spoon. The data transformation engine Pan reads data from and writes data to various data sources. Pan also allows you to manipulate data.

./pan.sh /file:/home/[pentaho_user]/[path]/[transformation].ktr  /level:[Log Level]

Kitchen

A standalone command line process that can be used to execute jobs. The program that executes the jobs designed in the Spoon graphical interface, either in XML or in a database repository. Jobs are usually scheduled to run in batch mode at regular intervals.

./kitchen.sh /file:/home/[pentaho_user]/[path]/[job].kjb  /level:[Log level]

User Interface

Within the UI, you can author, edit, run, and debug transformations and jobs. You can also enter license keys, add data connections, and define security (default options - Pentaho or LDAP).

The Welcome page contains useful links to documentation, community links for getting involved in the Pentaho Data Integration project, and links to blogs from some of the top contributors to the Pentaho Data Integration project.

There are a few different ways to start PDI. The method that you should use depends on the way you installed Pentaho Data Integration (PDI).

OS: Windows / Unix

Action

spoon.bat / spoon.sh

Starts Spoon

kichen.bat / kitchen.sh

Command Line for Jobs

pan.bat / pan.sh

Command Line for Transformations

Launch Data Integration

Run the following command (Linux):

cd
cd ~/Scripts
sh pentaho--platform.sh

Configuring PDI UI

Configuration Files

The default Pentaho Data Integration (PDI) HOME directory is the user's home directory. Here is located in the .kettle folder, are the main PDI configuration files.

Windows C:{user}.kettle
Linux based operating systems ($HOME/.kettle)

The directory may change depending on the user who is logged on. Thus, the configuration files that control the behaviour of PDI jobs and transformations are different from user to user.

This also applies when running PDI from the Pentaho BI Platform. When you set the KETTLE_HOME variable, the PDI jobs and transformations can be run without being affected by the user who is logged on. KETTLE_HOME is used to change the location of the files normally in [user home].kettle

File

Description

kettle.properties

main configuration file with global variables

shared.xml

list of shared artefacts

db.cache

database cache for metadata

repositories.xml

list of repositories

.spoonrc

settings for the UI

.languageChoice

language settings

kettle.properties

The kettle.properties file is where you will find all the global variables for KETTLE. You can also set global variables that can be used in Transformations and Jobs. For example, you can define database connections, paths to files, or variables that can be used as parameters in your solution.

The kettle.properties can be edited using a Text Editor or via the Toolbar, select:

KETTLE Variables

repositories.xml

A variety of objects can now be placed in a shared objects file on the local machine. The default location for the shared objects file is:

$HOME/.kettle/repositories.xml

<repositories>
<repository>
<id>PentahoEnterpriseRepository</id>
<name>Pentaho</name>
<description/>
<is_default>false</is_default>
<repository_location_url>http://localhost:8080/pentaho</repository_location_url>
<version_comment_mandatory>N</version_comment_mandatory>
</repository>
</repositories>

.spoonrc

Used to store preferences and program state of Spoon. Other Kettle programs do not use this file.

General settings and defaults
User interface settings
The last opened transformation/job

The default location for the shared objects file is:

$HOME/.kettle/.spoonrc

#Kettle Properties file
#Sat Dec 16 22:49:28 GMT 2023
AskAboutReplacingDatabases=N
AutoCollapseCoreObjectsTree=Y
AutoSave=N
AutoSplit=N
BackgroundColorB=255
BackgroundColorG=255
BackgroundColorR=255
CustomParameterMergeJoinSortWarning=Y
CustomParameterMergeRowsSortWarning=Y
CustomParameterSetVariableUsageWarning=Y
...

These options are set from the main menu: Tools -> Options

Adding JDBC Drivers

The PDI & Pentaho Server needs the appropriate driver to connect to the database that stores your data. Your database administrator, Chief Intelligence Officer, or IT manager should be able to provide the appropriate driver. If not, you can download drivers from your database vendor's website.

The Components Reference contains a list of drivers.

Once you have the correct driver, copy it to the following directories:

Pentaho Server: /pentaho/server/pentaho-server/tomcat/lib/
PDI client: data-integration/lib

You must restart the PDI client for the driver to take effect.

There should be only one driver for your database in the directory. Ensure that there are no other versions of the same vendor's driver in this directory. If there are, back up the old driver files and remove them to avoid version conflicts.

Pentaho Repository

The Pentaho+ platform implements its repository using Apache Jackrabbit, a fully conforming implementation of the content repository for Java technology API (JCR, specified in JSR 170 and JSR 283)

Apache Jackrabbit needs two pieces of information to set up a runtime content repository instance:

Repository home directory The filesystem path of the directory containing the content repository accessed by the runtime instance of Jackrabbit. This directory usually contains all the repository content, search indexes, internal configuration, and other persistent information managed within the content repository. Note that this is not absolutely required and some persistence managers and other Jackrabbit components may well be configured to access files and even other resources (like remote databases) outside the repository home directory. A designated repository home directory is however always needed even if some components choose to not use it. Jackrabbit will automatically fill in the repository home directory with all the required files and subdirectories when the repository is first instantiated.
Repository configuration file The filesystem path of the repository configuration XML file. This file specifies the class names and properties of the various Jackrabbit components used to manage and access the content repository. Jackrabbit parses this configuration file and instantiates the specified components when the runtime content repository instance is created.

Hibernate is a Java framework which is used to store the Java objects in the relational database system. It is an open-source, lightweight, ORM (Object Relational Mapping) tool.

Quartz is an open source job-scheduling framework written entirely in Java and designed for use in both J2SE and J2EE applications.

Versioning & Comments (Dev only)

Pentaho Data Integration (PDI) can track versions and comments for jobs, transformations, and connection information when you save them. You can turn version control and comment tracking on or off by modifying their related statements in the repository.spring.properties text file.

By default, version control and comment tracking are disabled (set to false). Best Practice: manage your ETL workflows with a 3rd party content management tool, e.g. Github; only uploading the production version into the Repository.

Exit from the PDI client (also called Spoon).
Stop the Pentaho Server.
Edit repository.spring.properties file.

cd
cd ~/Pentaho/server/pentaho-server/pentaho-solution/systems
nano repository.spring.properties

Edit the versioningEnabled and versionCommentsEnabled statements:

versioningEnabled=true versionCommentsEnabled=true

Introduction

Pentaho Pro Suite Components

Spoon

Pentaho Server

Carte Server

Simplicity and Efficiency

PDI REST APIs

Pan

Kitchen

User Interface

Launch Data Integration

Configuration Files

kettle.properties

shared.xml

repositories.xml

.spoonrc

Adding JDBC Drivers

Pentaho Repository

Versioning & Comments (Dev only)