Pentaho Data Integration
InstallationBusiness AnalyticsCToolsData CatalogData QualityLLMs
  • Overview
    • Pentaho Data Integration ..
  • Data Integration
    • Getting Started
      • Configuring PDI UI
      • KETTLE Variables
    • Concepts & Terminolgy
      • Hello World
      • Logging
      • Error Handling
    • Data Sources
      • Flat Files
        • Text
          • Text File Input
          • Text File Output
        • Excel
          • Excel Writer
        • XML
          • Read XML
        • JSON
          • Read JSON
      • Databases
        • CRUID
          • Database Connections
          • Create DB
          • Read DB
          • Update DB
          • Insert / Update DB
          • Delete DB
        • SCDs
          • SCDs
      • Object Stores
        • MinIO
      • SMB
      • Big Data
        • Hadoop
          • Apache Hadoop
    • Enrich Data
      • Merge
        • Merge Streams
        • Merge Rows (diff)
      • Joins
        • Cross Join
        • Merge Join
        • Database Join
        • XML Join
      • Lookups
        • Database Lookups
      • Scripting
        • Formula
        • Modified JavaScript Value
        • User Defined Java Class
    • Enterprise Solution
      • Jobs
        • Job - Hello World
        • Backward Chaining
        • Parallel
      • Parameters & Variables
        • Parameters
        • Variables
      • Scalability
        • Run Configurations
        • Partition
      • Monitoring & Scheduling
        • Monitoring & Scheduling
      • Logging
        • Logging
      • Dockmaker
        • BA & DI Servers
      • Metadata Injection
        • MDI
    • Plugins
      • Hierarchical Data Type
  • Use Cases
    • Streaming Data
      • MQTT
        • Mosquitto
        • HiveMQ
      • AMQP
        • RabbitMQ
      • Kafka
        • Kafka
    • Machine Learning
      • Prerequiste Tasks
      • AutoML
      • Credit Card
    • RESTful API
    • Jenkins
    • GenAI
  • Reference
    • Page 1
Powered by GitBook
On this page
  1. Use Cases
  2. Machine Learning

Prerequiste Tasks

Configure Colab & Data Integration for ML ..

PreviousMachine LearningNextAutoML

Last updated 1 month ago

You will need to complete the following prerequisites:

  • Install Python

  • Create a Google CoLab account

  • Install R (optional R Studio)

  • Configure Pentaho Data Integration with R

Google Colab

Colab is a Python development environment, based on Jupyter Notebooks, that runs in the browser using Google Cloud.

It provides a runtime, fully configured for deep learning libraries, such as Keras, TensorFlow, PyTorch, and OpenCV.

If you haven't already .. sign up for a free account..!!

The following prerequiste steps configure your environment to RUN ML data pipelines in Pentaho Data Integration.

This section is for Reference only.

The following tasks configure Pentaho Data Integration in a Linux environment.

Python

  1. Make sure all installed Packages are up-to-date.

sudo apt update && sudo apt upgrade -y
  1. Check to see if Python is installed.

python3 --version
Python 3.10.12

Install the latest Python version

Only proceed to update your Python to the latest version if required.

  1. Install dependencies.

sudo apt install dirmngr ca-certificates software-properties-common apt-transport-https -y
  1. Import key for PPA deadsnakes.

sudo gpg --no-default-keyring --keyring /usr/share/keyrings/deadsnakes.gpg --keyserver keyserver.ubuntu.com --recv-keys F23C5A6CF475977595C89F51BA6932366A755776
  1. Add Repository.

echo "deb [signed-by=/usr/share/keyrings/deadsnakes.gpg] https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/pythonppa-deadsnakes.list
  1. Renew the cache, then find current Python version.

sudo apt-get update && apt-cache search python3.1
  1. Install latest version.

sudo apt install python3.12-full -y
  1. Create symlink.

sudo ln -s /usr/bin/python3.12 /usr/bin/python
python --version

Different Python versions

You may have a particular one you want as the default for users needing multiple versions of Python on their system.

The default version of python has been set to 3.10

- required by Apache AirFlow

  1. List the python versions:

ls -ls /usr/bin/python*
  1. Set the Python version:

sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1
sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.12 2
  1. Then set the required version:

sudo update-alternatives --config python

The following libraries need to be installed:

  • pandas

  • matplotlib

  • py4j

  • numpy

  • wheel

  • scikit-learn

  • TPOT

  1. Ensure pip is installed.

sudo apt install python3-pip
  1. Install ML libraries.

pip install py4j
pip install scikit-learn
pip install tpot
Defaulting to user installation because normal site-packages is not writeable
Collecting py4j
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.5/200.5 KB 8.0 MB/s eta 0:00:00
Installing collected packages: py4j
Successfully installed py4j-0.10.9.7
Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.4/13.4 MB 60.7 MB/s eta 0:00:00
Collecting numpy>=1.19.5
  Downloading numpy-2.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.5/19.5 MB 54.1 MB/s eta 0:00:00
Collecting scipy>=1.6.0
  Downloading scipy-1.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41.1/41.1 MB 37.2 MB/s eta 0:00:00
Collecting joblib>=1.2.0
  Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.8/301.8 KB 57.4 MB/s eta 0:00:00
...
pip install numpy
pip install matplotlib

Install R from Ubuntu Repository

R is a language and environment for statistical computing and graphics.

  1. Update APT packages.

sudo apt update  && sudo apt upgrade
  1. Install the R base package and its dependencies.

sudo apt install r-base r-base-dev -y
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  bzip2-doc gfortran gfortran-11 icu-devtools libblas-dev libbz2-dev
  libgfortran-11-dev libicu-dev libjpeg-dev libjpeg-turbo8-dev libjpeg8-dev
  liblapack-dev liblzma-dev libncurses-dev libncurses5-dev libpcre16-3
  libpcre2-dev libpcre2-posix3 libpcre3-dev libpcre32-3 libpcrecpp0v5
  libpng-dev libpng-tools libreadline-dev libtk8.6 pkg-config r-base-core
  r-base-html r-cran-boot r-cran-class r-cran-cluster r-cran-codetools
  r-cran-foreign r-cran-kernsmooth r-cran-lattice r-cran-mass r-cran-matrix
  r-cran-mgcv r-cran-nlme r-cran-nnet r-cran-rpart r-cran-spatial
  r-cran-survival r-doc-html r-recommended
Suggested packages:
  gfortran-multilib gfortran-doc gfortran-11-multilib gfortran-11-doc
  libcoarrays-dev liblapack-doc icu-doc liblzma-doc ncurses-doc readline-doc
  tk8.6 elpa-ess r-doc-info | r-doc-pdf r-mathlib texlive-base
  texlive-latex-base texlive-plain-generic texlive-fonts-recommended
  texlive-fonts-extra texlive-extra-utils texlive-latex-recommended
  texlive-latex-extra texinfo
...
  1. Check version.

R --version
R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.
  1. Type R and hit enter to verify that R has been installed.

Using the R command without sudo creates a personal library for your user. To install packages available to every user on the system, run the R command as root by typing sudo -i R.

R
  1. Type q() to exit the R console.

q()
  1. Install missing dependency.

wget http://ftp.osuosl.org/pub/ubuntu/pool/main/i/icu/libicu70_70.1-2_amd64.deb
sudo dpkg -i libicu70_70.1-2_amd64.deb
  1. Reboot.


rJava

  1. Check to see if Java is installed .. if so then move onto step 4.

java --version
  1. Install the Java Runtime Environment (JRE).

sudo apt-get install -y default-jre
  1. Install the Java Development Kit (JDK).

sudo apt-get install -y default-jdk
  1. Update where R expects to find various Java files.

export JAVA_LIBS="$JAVA_LIBS -ldl"
sudo R CMD javareconf
  1. In a 'R' Terminal.

R
install.packages('rJava')
  1. Check rJava has successfully installed.

system.file(package="rJava")
> system.file(package="rJava")
[1] "/home/pentaho/R/x86_64-pc-linux-gnu-library/4.1/rJava"
>

randomForest

The random forest classifier can be used to solve regression or classification problems. The random forest algorithm is made up of a collection of decision trees, and each tree in the ensemble is comprised of a data sample drawn from a training set with replacement, called the bootstrap sample.

  1. In a R terminal.

R
  1. Install randomForest package:

install.packages('randomForest')
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/randomForest_4.7-1.1.tar.gz'
Content type 'application/x-gzip' length 80886 bytes (78 KB)
==================================================
downloaded 78 KB

* installing *source* package ‘randomForest’ ...
** package ‘randomForest’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
...
  1. Type q() to quit the R console.

  2. Click Yes to close the workplace image.

  3. Close R.


RStudio

RStudio is an integrated development environment (IDE) comprising of a set of tools built to help you be more productive with R and Python.

wget https://download1.rstudio.org/electron/
  1. Install Package.

sudo apt install -f ./rstudio-2022.12.0-353-amd64.deb
  1. Once installed, in a Terminal.

rstudio

Set Environmental Variables

You can find the paths to set for each of the environmental variable using R.

Env Variable

R_HOME

Path to the root directory of your R installation. Enter Sys.getenv("R_HOME") in the R console to get the path.

R_LIBS_USER

Path to the directory where R installs your packages.

Enter Sys.getenv("R_LIBS_USER") in the R console to get the path.

LD_LIBRARY_PATH

Used to load a libraries - libjri.so

PATH

Append the PATH variable with the directory that contains the R executable.

  1. In a R Terminal

R
Sys.getenv("R_HOME")
Sys.getenv("R_LIBS_USER")
  1. Edit the /etc/environment.

cd
sudo nano /etc/environment
  1. Copy & paste the values.

# R variables
R_HOME=/usr/lib/R
LD_LIBRARY_PATH=/usr/lib/R/site-library/rJava/jri/
R_LIBS_USER=/home/pentaho/R/x86_64-pc-linux-gnu-library/4.1
  1. Ensure the path to the R/bin is added to PATH.

PATH="/usr/local/sbin:/usr/local/bin: ....:/usr/lib/R/bin"
  1. Save.

CRTL + O
Enter
CTRL + X

libjri.so

In the rJava directory, there is a libjri.so file that needs to be copied into the libswt directory of Spoon.

  1. Copy libjri.so to PDI ../libswt/linux.

cd
cd ~/R/x86_64-pc-linux-gnu-library/4.1/rJava/jri
sudo cp libjri.so ~/Pentaho/design-tools/data-integration/libswt/linux/x86_64
  1. Reboot.


Test - R

  1. Start Pentaho Data Integration.

cd
cd ~/Pentaho/design-tools/data-integration
./spoon.sh
  1. Create the following transformation.

  1. Copy and paste the following R script into the R Executor step.

library(datasets)
iris
  1. Click on the 'Test Script' button.

Visit the to grab the latest release.

RStudio downloads page
Google Colaboratory
Link to Google Colab
Logo
Install Python Data Science Packages
Link to dowmload Data Science packages
Logo
Google Colab
RStudio
iris