Prerequiste Tasks
Configure Colab & Data Integration for ML ..
Last updated
Configure Colab & Data Integration for ML ..
Last updated
You will need to complete the following prerequisites:
Install Python
Create a Google CoLab account
Install R (optional R Studio)
Configure Pentaho Data Integration with R
Colab is a Python development environment, based on Jupyter Notebooks, that runs in the browser using Google Cloud.
It provides a runtime, fully configured for deep learning libraries, such as Keras, TensorFlow, PyTorch, and OpenCV.
If you haven't already .. sign up for a free account..!!
The following prerequiste steps configure your environment to RUN ML data pipelines in Pentaho Data Integration.
This section is for Reference only.
The following tasks configure Pentaho Data Integration in a Linux environment.
Python
Make sure all installed Packages are up-to-date.
sudo apt update && sudo apt upgrade -y
Check to see if Python is installed.
python3 --version
Python 3.10.12
Only proceed to update your Python to the latest version if required.
Install dependencies.
sudo apt install dirmngr ca-certificates software-properties-common apt-transport-https -y
Import key for PPA deadsnakes.
sudo gpg --no-default-keyring --keyring /usr/share/keyrings/deadsnakes.gpg --keyserver keyserver.ubuntu.com --recv-keys F23C5A6CF475977595C89F51BA6932366A755776
Add Repository.
echo "deb [signed-by=/usr/share/keyrings/deadsnakes.gpg] https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/pythonppa-deadsnakes.list
Renew the cache, then find current Python version.
sudo apt-get update && apt-cache search python3.1
Install latest version.
sudo apt install python3.12-full -y
Create symlink.
sudo ln -s /usr/bin/python3.12 /usr/bin/python
python --version
You may have a particular one you want as the default for users needing multiple versions of Python on their system.
The default version of python has been set to 3.10
- required by Apache AirFlow
List the python versions:
ls -ls /usr/bin/python*
Set the Python version:
sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1
sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.12 2
Then set the required version:
sudo update-alternatives --config python
The following libraries need to be installed:
pandas
matplotlib
py4j
numpy
wheel
scikit-learn
TPOT
Ensure pip is installed.
sudo apt install python3-pip
Install ML libraries.
pip install py4j
pip install scikit-learn
pip install tpot
Defaulting to user installation because normal site-packages is not writeable
Collecting py4j
Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.5/200.5 KB 8.0 MB/s eta 0:00:00
Installing collected packages: py4j
Successfully installed py4j-0.10.9.7
Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
Downloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.4/13.4 MB 60.7 MB/s eta 0:00:00
Collecting numpy>=1.19.5
Downloading numpy-2.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.5/19.5 MB 54.1 MB/s eta 0:00:00
Collecting scipy>=1.6.0
Downloading scipy-1.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41.1/41.1 MB 37.2 MB/s eta 0:00:00
Collecting joblib>=1.2.0
Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.8/301.8 KB 57.4 MB/s eta 0:00:00
...
pip install numpy
pip install matplotlib
R is a language and environment for statistical computing and graphics.
Update APT packages.
sudo apt update && sudo apt upgrade
Install the R base package and its dependencies.
sudo apt install r-base r-base-dev -y
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
bzip2-doc gfortran gfortran-11 icu-devtools libblas-dev libbz2-dev
libgfortran-11-dev libicu-dev libjpeg-dev libjpeg-turbo8-dev libjpeg8-dev
liblapack-dev liblzma-dev libncurses-dev libncurses5-dev libpcre16-3
libpcre2-dev libpcre2-posix3 libpcre3-dev libpcre32-3 libpcrecpp0v5
libpng-dev libpng-tools libreadline-dev libtk8.6 pkg-config r-base-core
r-base-html r-cran-boot r-cran-class r-cran-cluster r-cran-codetools
r-cran-foreign r-cran-kernsmooth r-cran-lattice r-cran-mass r-cran-matrix
r-cran-mgcv r-cran-nlme r-cran-nnet r-cran-rpart r-cran-spatial
r-cran-survival r-doc-html r-recommended
Suggested packages:
gfortran-multilib gfortran-doc gfortran-11-multilib gfortran-11-doc
libcoarrays-dev liblapack-doc icu-doc liblzma-doc ncurses-doc readline-doc
tk8.6 elpa-ess r-doc-info | r-doc-pdf r-mathlib texlive-base
texlive-latex-base texlive-plain-generic texlive-fonts-recommended
texlive-fonts-extra texlive-extra-utils texlive-latex-recommended
texlive-latex-extra texinfo
...
Check version.
R --version
R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.
Type R
and hit enter to verify that R has been installed.
Using the R
command without sudo creates a personal library for your user. To install packages available to every user on the system, run the R
command as root by typing sudo -i R
.
R
Type q()
to exit the R console.
q()
Install missing dependency.
wget http://ftp.osuosl.org/pub/ubuntu/pool/main/i/icu/libicu70_70.1-2_amd64.deb
sudo dpkg -i libicu70_70.1-2_amd64.deb
Reboot.
Check to see if Java is installed .. if so then move onto step 4.
java --version
Install the Java Runtime Environment (JRE).
sudo apt-get install -y default-jre
Install the Java Development Kit (JDK).
sudo apt-get install -y default-jdk
Update where R expects to find various Java files.
export JAVA_LIBS="$JAVA_LIBS -ldl"
sudo R CMD javareconf
In a 'R' Terminal.
R
install.packages('rJava')
Check rJava has successfully installed.
system.file(package="rJava")
> system.file(package="rJava")
[1] "/home/pentaho/R/x86_64-pc-linux-gnu-library/4.1/rJava"
>
The random forest classifier can be used to solve regression or classification problems. The random forest algorithm is made up of a collection of decision trees, and each tree in the ensemble is comprised of a data sample drawn from a training set with replacement, called the bootstrap sample.
In a R terminal.
R
Install randomForest package:
install.packages('randomForest')
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/randomForest_4.7-1.1.tar.gz'
Content type 'application/x-gzip' length 80886 bytes (78 KB)
==================================================
downloaded 78 KB
* installing *source* package ‘randomForest’ ...
** package ‘randomForest’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
...
Type q() to quit the R console.
Click Yes to close the workplace image.
Close R.
RStudio is an integrated development environment (IDE) comprising of a set of tools built to help you be more productive with R and Python.
Visit the RStudio downloads page to grab the latest release.
wget https://download1.rstudio.org/electron/
Install Package.
sudo apt install -f ./rstudio-2022.12.0-353-amd64.deb
Once installed, in a Terminal.
rstudio
You can find the paths to set for each of the environmental variable using R.
R_HOME
Path to the root directory of your R installation. Enter Sys.getenv("R_HOME") in the R console to get the path.
R_LIBS_USER
Path to the directory where R installs your packages.
Enter Sys.getenv("R_LIBS_USER") in the R console to get the path.
LD_LIBRARY_PATH
Used to load a libraries - libjri.so
PATH
Append the PATH variable with the directory that contains the R executable.
In a R Terminal
R
Sys.getenv("R_HOME")
Sys.getenv("R_LIBS_USER")
Edit the /etc/environment.
cd
sudo nano /etc/environment
Copy & paste the values.
# R variables
R_HOME=/usr/lib/R
LD_LIBRARY_PATH=/usr/lib/R/site-library/rJava/jri/
R_LIBS_USER=/home/pentaho/R/x86_64-pc-linux-gnu-library/4.1
Ensure the path to the R/bin is added to PATH.
PATH="/usr/local/sbin:/usr/local/bin: ....:/usr/lib/R/bin"
Save.
CRTL + O
Enter
CTRL + X
In the rJava directory, there is a libjri.so file that needs to be copied into the libswt directory of Spoon.
Copy libjri.so to PDI ../libswt/linux.
cd
cd ~/R/x86_64-pc-linux-gnu-library/4.1/rJava/jri
sudo cp libjri.so ~/Pentaho/design-tools/data-integration/libswt/linux/x86_64
Reboot.
Start Pentaho Data Integration.
cd
cd ~/Pentaho/design-tools/data-integration
./spoon.sh
Create the following transformation.
Copy and paste the following R script into the R Executor step.
library(datasets)
iris
Click on the 'Test Script' button.