Pentaho Data Integration
InstallationBusiness AnalyticsCToolsData CatalogData QualityLLMs
  • Overview
    • Pentaho Data Integration ..
  • Data Integration
    • Getting Started
      • Configuring PDI UI
      • KETTLE Variables
    • Concepts & Terminolgy
      • Hello World
      • Logging
      • Error Handling
    • Data Sources
      • Flat Files
        • Text
          • Text File Input
          • Text File Output
        • Excel
          • Excel Writer
        • XML
          • Read XML
        • JSON
          • Read JSON
      • Databases
        • CRUID
          • Database Connections
          • Create DB
          • Read DB
          • Update DB
          • Insert / Update DB
          • Delete DB
        • SCDs
          • SCDs
      • Object Stores
        • MinIO
      • SMB
      • Big Data
        • Hadoop
          • Apache Hadoop
    • Enrich Data
      • Merge
        • Merge Streams
        • Merge Rows (diff)
      • Joins
        • Cross Join
        • Merge Join
        • Database Join
        • XML Join
      • Lookups
        • Database Lookups
      • Scripting
        • Formula
        • Modified JavaScript Value
        • User Defined Java Class
    • Enterprise Solution
      • Jobs
        • Job - Hello World
        • Backward Chaining
        • Parallel
      • Parameters & Variables
        • Parameters
        • Variables
      • Scalability
        • Run Configurations
        • Partition
      • Monitoring & Scheduling
        • Monitoring & Scheduling
      • Logging
        • Logging
      • Dockmaker
        • BA & DI Servers
      • Metadata Injection
        • MDI
    • Plugins
      • Hierarchical Data Type
  • Use Cases
    • Streaming Data
      • MQTT
        • Mosquitto
        • HiveMQ
      • AMQP
        • RabbitMQ
      • Kafka
        • Kafka
    • Machine Learning
      • Prerequiste Tasks
      • AutoML
      • Credit Card
    • RESTful API
    • Jenkins
    • GenAI
  • SETUP
    • Windows 11 Pentaho Lab
  • FAQs
    • FAQs
Powered by GitBook

© Hitachi Vantara LLC 2025. All rights reserved. Hitachi is a trademark or registered trademark of Hitachi, Ltd. VSP is the trademark or registered trademark of Hitachi Vantara Corporation.

On this page

Was this helpful?

Export as PDF
  1. Data Integration
  2. Data Sources

Jupyter Notebook

Jupyter notebooks are used for data science tasks such as exploratory data analysis (EDA), data cleaning and transformation, data visualization, statistical modeling, machine learning, and

The following workshop is designed for Pentaho Data Integration running in a Windows environment, as Pentaho Data Services are not available in the Linux version.

Linux folks will only be able to run the Transformation (modify the Transformation: remove the Data Services and ouput as a csv file). Its still worth taking a look at the Jupyter Notebook, and look out for an update to the workshop that will load the csv file.

Data science solution development can be streamlined by leveraging the strengths of different developers in their optimal environments. Using Pentaho Data Integration (PDI) with Jupyter and Python enables efficient collaboration between data engineers and data scientists.

Data engineers use PDI for:

• Data preparation, blending, and cleansing

• Feature engineering and statistical analytics

• Easy scaling and migration to production

Data scientists use Jupyter/Python for:

• Model exploration, tuning, and training

• Focusing on core data science tasks

Benefits:

• Reduced time-to-market

• Improved solution quality

• Enhanced collaboration through easily shared PDI outputs

• Data scientists spend less time on data prep

This approach allows each team to work in their specialized environment while facilitating seamless integration of their efforts.

x

x

The following section is for Reference only.

The required pre-requisite steps have been completed

x

Pre-requistes

  1. Update the system packages to the latest versions available.

sudo apt-get update -y && sudo apt-get upgrade -y
  1. Install Python3 and its extensions.

sudo apt install python3 python3-pip python3-venv -y
  1. Check the installed version of Python.

python3 -V

x


Jupyter Python Env

x

  1. Create a Projects/JupyterNotebook directory.

cd
mkdir ~/Projects/JupyterNotebook
  1. Create a virtual environment for our Jupyter Notebook application.

cd
cd ~/Projects/JupyterNotebook
python3.10 -m venv jupyter-venv
  1. Activate the virtual environment.

source jupyter-venv/bin/activate
  1. After the activation, the command prompt should be:

(jupyter-venv) pentaho@pentaho:~/Projects/JupyterNotebook$

Install & Configure Jupyter Notebook

The Jupyter Notebook can be installed with the pip3 command. The pip3 command will download the jupyter files and will install the required requirements.

  1. Ensure you're in the vitrual environment

(jupyter-venv) pentaho@pentaho:~/Projects/JupyterNotebook$
  1. Install & upgrade to pip3

pip install --upgrade pip
Requirement already satisfied: pip in /usr/lib/python3/dist-packages (22.0.2)
Collecting pip
  Downloading pip-24.2-py3-none-any.whl (1.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 17.0 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.2
    Not uninstalling pip at /usr/lib/python3/dist-packages, outside environment /usr
    Can't uninstall 'pip'. No files were found to uninstall.
Successfully installed pip-24.2
...
  1. Once completed, install Jupyter.

pip3 install jupyter
Collecting jupyter
  Downloading jupyter-1.1.1-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting notebook (from jupyter)
  Downloading notebook-7.2.2-py3-none-any.whl.metadata (10 kB)
Collecting jupyter-console (from jupyter)
  Downloading jupyter_console-6.6.3-py3-none-any.whl.metadata (5.8 kB)
Collecting nbconvert (from jupyter)
  Downloading nbconvert-7.16.4-py3-none-any.whl.metadata (8.5 kB)
Collecting ipykernel (from jupyter)
  Downloading ipykernel-6.29.5-py3-none-any.whl.metadata (6.3 kB)
Collecting ipywidgets (from jupyter)
  Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting jupyterlab (from jupyter)
  Downloading jupyterlab-4.2.5-py3-none-any.whl.metadata (16 kB)
Collecting comm>=0.1.1 (from ipykernel->jupyter)
...
  1. Generate a config file.

jupyter notebook --generate-config
Writing default config to: /home/pentaho/.jupyter/jupyter_notebook_config.py
  1. Edit the file, uncomment the following settings and set your IP address:

cd
cd ~/.jupyter
nano jupyter_notebook_config.py
...
c.NotebookApp.open_browser = True
...
  1. Save.

CTRL + O
Enter
CTRL + X
  1. Check the config changes.

jupyter server --show-config
  1. Execute the last command to make Jupyter Notebook accessible in the browser.

jupyter notebook
// Some code

x

x

x

This workshop is Windows only - Pentaho Data Services

x

x

Last updated 9 months ago

Was this helpful?