Pentaho Data Integration
InstallationBusiness AnalyticsCToolsData CatalogData QualityLLMs
  • Overview
    • Pentaho Data Integration ..
  • Data Integration
    • Getting Started
      • Configuring PDI UI
      • KETTLE Variables
    • Concepts & Terminolgy
      • Hello World
      • Logging
      • Error Handling
    • Data Sources
      • Flat Files
        • Text
          • Text File Input
          • Text File Output
        • Excel
          • Excel Writer
        • XML
          • Read XML
        • JSON
          • Read JSON
      • Databases
        • CRUID
          • Database Connections
          • Create DB
          • Read DB
          • Update DB
          • Insert / Update DB
          • Delete DB
        • SCDs
          • SCDs
      • Object Stores
        • MinIO
      • SMB
      • Big Data
        • Hadoop
          • Apache Hadoop
    • Enrich Data
      • Merge
        • Merge Streams
        • Merge Rows (diff)
      • Joins
        • Cross Join
        • Merge Join
        • Database Join
        • XML Join
      • Lookups
        • Database Lookups
      • Scripting
        • Formula
        • Modified JavaScript Value
        • User Defined Java Class
    • Enterprise Solution
      • Jobs
        • Job - Hello World
        • Backward Chaining
        • Parallel
      • Parameters & Variables
        • Parameters
        • Variables
      • Scalability
        • Run Configurations
        • Partition
      • Monitoring & Scheduling
        • Monitoring & Scheduling
      • Logging
        • Logging
      • Dockmaker
        • BA & DI Servers
      • Metadata Injection
        • MDI
    • Plugins
      • Hierarchical Data Type
  • Use Cases
    • Streaming Data
      • MQTT
        • Mosquitto
        • HiveMQ
      • AMQP
        • RabbitMQ
      • Kafka
        • Kafka
    • Machine Learning
      • Prerequiste Tasks
      • AutoML
      • Credit Card
    • RESTful API
    • Jenkins
    • GenAI
  • Reference
    • Page 1
Powered by GitBook
On this page
  1. Use Cases

GenAI

Generative artificial intelligence (GenAI) can create certain types of images, text, videos, and other media in response to prompts ..

PreviousJenkinsNextPage 1

Last updated 3 months ago

So does it work?

Users send Messages to the Thread, which the Assistant then processes.

  • The framework uses a Thread to maintain the context of a conversation.

  • Each interaction is added to the Thread as a Message.

Assistants can work with uploaded files, analyzing and referencing them in responses.

  • The framework maintains state across interactions, allowing for complex, multi-turn conversations.

  • The Assistant generates responses based on the conversation history and its capabilities.

  • Responses are generated asynchronously, allowing for handling of long-running tasks.

Assistants can produce various types of output, including text, code, or structured data.

Developers can fine-tune the Assistant's behavior through detailed instructions and model selection.

The HTML Parser is a utility plugin for Pentaho Data Integration (PDI) that extracts desired text from HTML or XML files. Useful for cleaning data for natural language processing tasks like sentiment analysis and SEO keyword analysis.

  • Accepts input from both data streams and files

  • Supports parsing using Xpath expressions or CSS selectors

  • Can process single files or multiple inputs from a stream

  • Compatible with local and virtual file systems

The plugin utilizes jsoup, a Java library, that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and Xpath selectors.

The step is located in the Input folder.

XPath (XML Path Language) is a query language for selecting nodes from an XML or HTML document. While Jsoup doesn't natively support XPath, we can use a combination of Jsoup and Java's built-in XPath capabilities to achieve this.

Here's an overview of some common XPath syntax:

/ - Selects from the root node

// - Selects nodes anywhere in the document

. - Selects the current node

.. - Selects the parent of the current node

@ - Selects attributes

[] -Used for predicates (conditions)

Some examples:

  • //div - Selects all div elements in the document

  • //div[@class='content'] - Selects all div elements with class 'content'

  • //h1/text() - Selects the text content of all h1 elements

  • //div[@class='content']/p - Selects all p elements that are direct children of div elements with class 'content'

HTML Data Source

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>XPath Example Page</title>
</head>
<body>
    <header>
        <h1 id="main-title">Welcome to Our Website</h1>
        <nav>
            <ul>
                <li><a href="#home">Home</a></li>
                <li><a href="#about">About</a></li>
                <li><a href="#contact">Contact</a></li>
            </ul>
        </nav>
    </header>

    <main>
        <section id="featured-articles">
            <h2>Featured Articles</h2>
            <article>
                <h3>Article 1</h3>
                <p class="content">This is the content of article 1.</p>
                <span class="author">By John Doe</span>
            </article>
            <article>
                <h3>Article 2</h3>
                <p class="content">This is the content of article 2.</p>
                <span class="author">By Jane Smith</span>
            </article>
        </section>

        <section id="latest-news">
            <h2>Latest News</h2>
            <ul>
                <li>News item 1</li>
                <li>News item 2</li>
                <li>News item 3</li>
            </ul>
        </section>
    </main>

    <footer>
        <p>&copy; 2024 Our Website. All rights reserved.</p>
    </footer>
</body>
</html>

Transformation

Filepath

The data source is referenced in a path.

  1. Open the following transformation:

Windows

C:/Projects/genai/html/HTML Parser - Xpath.ktr

Linux

~/Projects/genai/html/HTML Parser - Xpath.ktr

  1. Double-click on the hp: html and configure with the following settings:

Leaving Xpath field blank will result in all tags being removed and all the content returned.

3. RUN and preview the results.

These XPath queries will help you navigate and extract specific content from the homepage.html

Select the main title:

//h1[@id='main-title']

Select all navigation links:

//nav//a

Select all article titles (h3 elements within articles):

//article/h3

Select all paragraph content within articles:

//article/p[@class='content']

Select all author names:

//span[@class='author']

Select the latest news items:

//section[@id='latest-news']//li

Select the footer text:

//footer/p/text()

Select all section titles (h2 elements that are direct children of section elements):

//section/h2

Select the second article:

(//article)[2]

Select all elements with a class attribute:

//*[@class]

Filepath from stream

The data source is referenced as a filepath in a datastream field.

  1. Enable the hop between: dg: filepath from stream -> hp: parse html xpath.

  2. Disable the hop between: Data Grid -> hp: parse html xpath

dg: html from stream -> hp: parse html xpath

  1. Double-click on the hp: html and configure with the following settings:

  1. RUN and preview the results.


HTML from stream

The data source is referenced as <html> in a data stream field.

Pentaho's data streams often use binary fields to handle various types of data, including large text objects like HTML. By using binary datum, you ensure that the entire HTML content is treated as a single, uninterpreted chunk of data within the Pentaho pipeline - represented as 0 or 1.

Storing the HTML as binary datum allows you to pass the raw content through various steps in your Pentaho transformation without Pentaho trying to interpret or modify the HTML prematurely.

  1. Enable the hop between: dg: html from stream -> hp: parse html xpath.

  2. Disable the hop between: dg: filepath from stream -> hp: parse html xpath

Data Grid -> hp: parse html xpath

  1. Double-click on the hp: html and configure with the following settings:

  1. RUN and preview the results.

CSS selectors are powerful tools for targeting specific HTML elements, and they're used not only for styling but also for selecting elements when extracting data from HTML documents.

Below are some examples of the syntax used to extract HTML snippets

Basic Selectors

a) Element Selector:

p       /* Selects all <p> elements */
div     /* Selects all <div> elements */

b) Class Selector:

.highlight   /* Selects elements with class="highlight" */
p.highlight  /* Selects <p> elements with class="highlight" */

c) ID Selector:

#header   /* Selects the element with id="header" */

d) Universal Selector:

*   /* Selects all elements */

Combinators

a) Descendant Selector (space):

div p   /* Selects all <p> elements inside <div> elements */

b) Child Selector (>):

ul > li   /* Selects all <li> elements that are direct children of <ul> */

c) Adjacent Sibling Selector (+):

h1 + p   /* Selects the first <p> element immediately after an <h1> */

d) General Sibling Selector (~):

h1 ~ p   /* Selects all <p> elements that are siblings of <h1> */

Attribute Selectors

a) [attribute]:

[type]   /* Selects elements with a type attribute */

b) [attribute="value"]:

[type="text"]   /* Selects elements with type="text" */

c) [attribute~="value"]:

[class~="highlight"]   /* Selects elements with class containing "highlight" as a whole word */

d) [attribute^="value"]:

[href^="https"]   /* Selects elements with href starting with "https" */

e) [attribute$="value"]:

[href$=".pdf"]   /* Selects elements with href ending with ".pdf" */

f) [attribute*="value"]:

[href*="example"]   /* Selects elements with href containing "example" */

Pseudo-classes

a:first-child     /* Selects every <a> element that is the first child of its parent */
p:last-child      /* Selects every <p> element that is the last child of its parent */
li:nth-child(2n)  /* Selects every even <li> element */
input:not(:checked)  /* Selects all unchecked input elements */

Combining Selectors

div.highlight, p.important   /* Selects <div> with class "highlight" and <p> with class "importa

HTML Data Source

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>TechGadgets - Your Electronics Store</title>
</head>
<body>
    <header id="main-header">
        <h1>TechGadgets</h1>
        <nav>
            <ul>
                <li><a href="#home">Home</a></li>
                <li><a href="#products">Products</a></li>
                <li><a href="#about">About</a></li>
                <li><a href="#contact">Contact</a></li>
            </ul>
        </nav>
    </header>

    <main>
        <section id="featured-products">
            <h2>Featured Products</h2>
            <div class="product">
                <h3>Smartphone X</h3>
                <p class="description">The latest smartphone with advanced features.</p>
                <span class="price">$999</span>
            </div>
            <div class="product">
                <h3>Laptop Pro</h3>
                <p class="description">Powerful laptop for professionals.</p>
                <span class="price">$1499</span>
            </div>
        </section>

        <section id="about">
            <h2>About Us</h2>
            <p>TechGadgets is your one-stop shop for all electronics needs.</p>
        </section>

        <section id="newsletter">
            <h2>Subscribe to Our Newsletter</h2>
            <form>
                <input type="email" name="email" placeholder="Enter your email">
                <button type="submit">Subscribe</button>
            </form>
        </section>
    </main>

    <footer>
        <p>&copy; 2024 TechGadgets. All rights reserved.</p>
    </footer>
</body>
</html>

Transformation

Filepath

The data source is referenced in a path.

  1. Open the following transformation:

Windows

C:/Projects/genai/html/HTML Parser - CSS.ktr

Linux

~/Projects/genai/html/HTML Parser - CSS.ktr

  1. Double-click on the hp: html and configure with the following settings:

Leaving CSS field blank will result in all tags being removed and all the content returned.

  1. RUN preview the results.

These CSS queries will help you navigate and extract specific content from the landingpage.html

Select the main title:

h1

Select all navigation links:

nav a

Select all product titles:

.product h3

Select all product descriptions:

.product .description

Select all product prices:

.product .price

Select the "About Us" section:

#about

Select the newsletter form:

#newsletter form

Select all section titles (h2 elements):

main h2

Select the footer text:

footer p

Select all elements with a class of "product":

.product

Filepath from stream

The data source is referenced as a filepath in a datastream field.

  1. Enable the hop between: dg: filepath from stream -> hp: parse html xpath.

  2. Disable the hop between: Data grid -> hp: parse html css

dg: html from stream -> hp: parse html css

  1. Double-click on the hp: html and configure with the following settings:

  1. RUN preview results.


HTML from stream

The data source is referenced as <html> in a data stream field.

Pentaho's data streams often use binary fields to handle various types of data, including large text objects like HTML. By using binary datum, you ensure that the entire HTML content is treated as a single, uninterpreted chunk of data within the Pentaho pipeline - represented as 0 or 1.

Storing the HTML as binary datum allows you to pass the raw content through various steps in your Pentaho transformation without Pentaho trying to interpret or modify the HTML prematurely.

  1. Enable the hop between: dg: html from stream -> hp: parse html css.

  2. Disable the hop between: dg: filepath from stream -> hp: parse html css

Data grid -> hp: parse html css

  1. Double-click on the hp: html and configure with the following settings:

  1. RUN preview results.

Apache Tika is a content analysis toolkit that extracts text, metadata, and language from a variety of file formats. It's commonly used in data processing to prepare data for further analysis.

  • Supports a wide range of document formats including PDFs, Word documents, and HTML files.

  • Extracts metadata such as author, title, creation date, and language.

  • Can be integrated into larger data processing pipelines for automated content extraction.

  • Facilitates full-text search indexing and content classification.

The step is located in the Input folder.

The data source is a word document.

The document type is referenced in a datastream field.

Word Document

The old oak tree stood sentinel at the edge of the meadow, its gnarled branches reaching skyward like ancient fingers grasping at clouds. Generations had passed beneath its sprawling canopy, each leaving whispered secrets in its bark. A gentle breeze rustled through its leaves, carrying the scent of wildflowers and distant rain. Nearby, a babbling brook wound its way through moss-covered stones, its crystalline waters reflecting the dappled sunlight filtering through the forest canopy. 
A family of deer cautiously approached the water's edge, their ears twitching at every sound. In the distance, a woodpecker's rhythmic tapping echoed through the trees, nature's own percussion. As the sun began its slow descent, the meadow came alive with the soft glow of fireflies, their bioluminescent dance a magical display against the deepening twilight. A lone owl hooted softly, heralding the arrival of night and all its mysterious inhabitants. The air grew cooler, and dew began to form on blades of grass, each droplet a miniature world reflecting the stars above. 
In this timeless moment, the boundary between earth and sky seemed to blur, and one could almost believe in the old tales of fairies and woodland spirits. As darkness settled fully over the land, the oak tree stood as it always had, a silent guardian of the forest's secrets, its roots deep in the earth, its crown brushing the heavens.
  1. Open the following transformation:

Windows

C:/Projects/genai/tika/Read Unstructured Document- Word Doc.ktr

Linux

~/Projects/genai/tika/Read Unstructured Document- Word Doc.ktr

  1. Double-click on the Read Unstructured Document step and configure with the following settings:

  1. RUN and preview the results.

The data source is a password protected PDF document.

The document type is referenced in a datastream field.

  1. Open the following transformation:

Windows

C:/Projects/genai/tika/Read Unstructured Document- Password PDF.ktr

Linux

~/Projects/genai/tika/Read Unstructured Document- Password PDF.ktr.

  1. Double-click on the Read Unstructured Document step and configure with the following settings:

  1. RUN and preview the results.

Multiple documents are referenced as the data source.

The Javascript step is used to add the PDF password.

  1. Open the following transformation:

Windows

C:/Projects/genai/tika/Read Unstructured Document- Stream Multiple Files.ktr

Linux

~/Projects/genai/tika/Read Unstructured Document- Stream Multiple Files.ktr.

  1. Double-click on the Javascript: Add password column.

A data stream field: filepass is associated with the password: qweasd

  1. Double-click on the Read Unstructured Document step and configure with the following settings:

  1. RUN and preview the results.

Base64 is a binary-to-text encoding scheme that represents binary data in an ASCII string format. It's widely used for transmitting data over media that are designed for textual data.

The BASE64 step is located in the Transform folder.

ASCII uses 8 bits to represent individual characters, but Base64 uses 6 bits. Therefore, the binary needs to be broken up into 6-bit chunks.

Since Base64 uses 24-bit sequences, padding is needed when the original binary cannot be divided into a 24-bit sequence. You have probably seen this type of padding before represented by printed equal signs (=). For example, Hi without a newline is represented by only two 8-bit ASCII characters (for a total of 16 bits). Padding is removed by the Base64 encoding schema when data is decoded.

Base64 is not necessarily used to protect information. It has the advantage that it can convert mostly any type of byte encoding into a human-readable ASCII.

  1. Open the following transformation.

Windows

C:/Projects/genai/base64/Base64 Encode.ktr

Linux

~/Projects/genai/base64/Base64.ktr.

  1. Double-click on the Base64 Encoder step and configure with the following settings:

  1. Check the Select values step.

  1. RUN and preview the results.


RAW Text

The Data grid holds the RAW Text that will be encoded.

  1. Enable the hop between: Data grid - Raw Text Input -> Base64 Encoder.

  2. Disable the hop between: Data grid -> Base64 Encoder

Get file names images -> Base64 Encoder

  1. Double-click on the Data grid -Raw Input Text step - Data tab

  1. Check the Select values.

  1. RUN and preview the results


Multiple Files

Encode multiple files.

  1. Enable the hop between: Get file names - Images -> Base64 Encoder

  2. Disable the hop between: Data grid - Raw Text Input -> Base64 Encoder.

Data grid -> Base64 Encoder

  1. Double-click on the Get File names - Images step.

  1. Double-click on the Base64 Encoder and configure with the following settings:

  1. Check the Select values.

  1. RUN and preview the results.

Pentaho GenAI is an extension of the Pentaho Data Integration that incorporates generative AI capabilities. It aims to enhance data workflows processes by leveraging large language models and other AI technologies.

OpenAI released an API platform that enables the creation of 'assistants' that can perform a wide range of tasks:

Natural Language Querying: Users can ask questions or provide prompts to large language models (LLMs) like OpenAI and Azure OpenAI, allowing for natural language interaction with data and systems.

Document Analysis: The plugin supports attaching documents for LLMs to process, enabling users to analyze and extract insights from text files and related documents.

Sentiment Analysis: The plugin can, for example, be used to determine the sentiment of text data, such as tweets.

Log Analysis: Process and analyze log files, potentially for troubleshooting or identifying patterns.

Structured Data Generation: The plugin supports generating responses in both text and JSON formats, allowing for the creation of structured data from natural language inputs.

Data Extraction and Transformation: The plugin can be used within Pentaho Data Integration (PDI) workflows, assisting in extracting and transforming data as part of larger ETL processes.

Question Answering: The plugin supports using document embeddings to efficiently answer multiple questions about a document(s), making it useful for information retrieval and FAQ-style applications.

Prompt Engineering: Users can create structured templates and use PDI environment variables for dynamic prompt generation, allowing for flexible and customizable interactions with LLMs.

Moderation and Content Filtering: The plugin includes options for response moderation, which can be used to filter / flag potentially harmful or inappropriate content.

Resources

Pentaho Data Integration

Let's start exploring some simple chat scenarios:

  • Enter the prompt directly.

  • Pass the prompt and 'role' in data stream fields.

  • Configure the step to use your own OpenAI account details.

The step is located in the AI folder.

Enter Prompt

  1. Enable the hop between: Data Grid -> AI Chat.

  2. Disable the hop bewteen: User Input -> AI Chat.

  1. Open the following transformation:

Windows

C:/Projects/genai/ai chat/.ktr

Linux

~/Projects/genai/html/HTML Parser - Xpath.ktr

  1. Double-click on the hp: html and configure with the following settings:

  2. Double-click on the AI Chat step and configure with the following settings:

Run Instruction for LLM

Role-playing with Large Language Models (LLMs), such as ChatGPT, is an emerging field that explores the interaction between AI and creative, narrative-driven experiences. It leverages the advanced capabilities of LLMs to simulate human-like dialogue and human behavior within a role-playing context.

This process enables the AI to engage in dynamic conversations, mimic various characters, and respond to user inputs in a manner that aligns with the character’s predefined traits and narrative context, making use of powerful computation of large corpus of text data. In this way, this role-playing technique enhances its efficiency in tasks that require specific skills or knowledge, such as acting like a historian or providing historical facts and analyses.

  1. Click on the Model tab.

The temperature value ranges from 0 to 2, with lower values indicating greater determinism and higher values indicating more randomness.

  1. RUN and preview the result.


Prompt from Data Stream fields

You can send multiple questions to ChatGPT.

However, there are limits to the number of requests an LLM model can accept. The response will fail if the threshold limit is reached.

  1. Disable the hop between: Data Grid -> AI Chat.

  2. Enable the hop bewteen: User Input -> AI Chat.

  3. Double-click on the User Input step and the Data tab.

  1. Double-click on the AI Chat step and configure with the following settings:

  1. RUN and preview result.


Configure Model with your own account details

The 'pipeline' configuration is the same as the previous scenario.

You will require to enter your own OpenAI key.

  1. Double-click anywhere on the canvas to configure the Parameters.

Enter your own OpenAI Key.

  1. Double-click on the AI Chat step and configure with the following settings.

  1. RUN and preview the result. Should be the same as the previous scenario ..!!

RAG (Retrieval-Augmented Generation) is a technique in artificial intelligence that combines information retrieval with text generation. It's particularly useful for keeping AI systems updated without constant retraining and for providing responses grounded in specific, retrievable facts.

Non-persistant RAG

So .. in the data folder you will find a story about Charlie the happy go lucky carrot who, with his friends, lives in Veggeville ..

If you don't attch the document then the response will ask for more information as it can't place Charlie in any context.

  1. Enable the hop between: Data Grid -> AI Chat.

  1. Double-click on the AI Chat step and configure with the following settings:

  1. Click on the Embedding tab.

An embedding model converts words, phrases, or other data into numerical vectors. These vectors serve as a bridge between raw data and machine learning algorithms by creating meaningful, computable representations of complex information.

  1. RUN and preview the result.


Embedding - Write

Let's change the embbedding to store to WRITE to persist the results as a file -

openai-embedding-store.json.

This file is particularly useful in RAG (Retrieval-Augmented Generation) systems, where quick access to embeddings is crucial for efficient information retrieval.

  1. Double-click on the AI Chat step and then on the Embedding tab.

  2. Configure with the following settings.

  1. RUN & check that the embedding has been stored.


Embedding - READ

Now that a Veggeville embedding has been created, we can ask questions leveraging the vector store: openai-embeeding-store.json

  1. Double-click on the AI Chat step and configure with the following settings:

Ensure you select the Attach Document(s) option - enables the embedding options.

  1. Click on the Embedding tab and confiure with the following settings:

  1. RUN and preview the result.

A prompt is essentially the input given to an AI model to elicit a desired output or behavior. It can range from simple questions to complex instructions or examples.

Prompt engineering is the art and science of crafting these inputs to optimize the AI's performance for specific tasks. This involves carefully selecting words, providing context, and structuring the prompt to guide the model towards producing the most accurate, relevant, and useful responses.

  1. Double-click anywhere on the canvas to set the parameters.

  1. Double-click on the Chat AI step and configure with the following settings:

  1. RUN and preview the result.


Prompt - Template

With a little prompt engineering, the response can populate a 'template'.

  1. Double-click on the Chat AI step and configure with the following settings:

Create a recipe for a ${DISHTYPE} with the following ingredients: ${INGREDIENTS}.

Structure your answer in the following way:
Recipe name: ... 
Description: ... 
Preparation time: ... 
Required ingredients: 
 - ...
 - ... 
Instructions: 
 - ...
 - ...
 Respond in JSON format. 
  1. RUN and preview the result.

Let's run through some Use Cases:

Sentiment Analysis - Determine the sentiment of a tweet: Positive, Neutral, Negative.

Log Analysis - Analyze multiple log files for any errors. The errors are hopefully resolved by AI Chat with the results written to a CSV file.

Analyzes multiple log files to identify errors, then using AI Chat to provide with a resolution. The generated result is in JSON format and the processed output is stored as a CSV file.

  1. Open the following transformation:

Windows

C:/Projects/genai/aichat/Usecase - Log Analysis.ktr

Linux

~/Projects/genai/aichat/Usecase - Log Analysis.ktr

  1. Double-click on the hp: html and configure with the following settings:

The Prompt has been engineered to analyze log files identifying errors. The errors are resolved using

  1. Double-click on the AI Chat step to view the settings.

Message / Prompt

Analyze the log file from the stream and identify the issue. Once the issueis identified, respond with possible resolutions to fix the issue. Include the date (IssueDate) of when the issue occurred.
If no issue is found, then respond as "No Issues found" and Resolution as "No Resolution suggested. Log Looks fine".
Reply the answer in the below JSON template:
{
	"Issue" : "...",
	"IssueDate": "...",
	"Resolution" : "..."
}

Document

Determines location the data source:

File - Browse and enter the path to the data source

Stream - the data (or reference) is being passed in a data stream field. In this workshop the paths to the log files are being passed from the previous step in the filename data stream field.


Model

  1. Click on the Model tab.

The temperature value ranges from 0 to 2, with lower values indicating greater determinism and higher values indicating more randomness.


Embedding

  1. Click on the Embedding tab.

Enter your embedding model and whether you want to create and persist in a file or keep it as default In-Memory.


Response

  1. Click on the Response tab.

The response is held as a JSON object in the result field.

The response JSON object needs to be parsed to create our data stream fields.

  1. Double-click on the Process generated JSON result step.

  1. Click on the Fields tab.

Take a look at the result field (preview data in AI Chat step) to determine the structure of the JSON object / array.

This should reflect the structure of the prompt template.

 {
   "Issue": "Errors initializing Table output step and executing query job due to Simba driver limitations.",
   "IssueDate": "2023-09-07",
   "Resolution": "Use the GBQ Bulk Loader step instead of the regular Table output step to create the table and handle data inserts."
 }
...

JSON Notation

JSON Exp
Description
Sample

$

Root object

$ returns the whole JSON structure

.

Child operator; it's used to access different levels of the JSON structure

$..Issue returns the Issue

  1. RUN and preview result - Output - log-analysis.csv

The sentiment_analysis function analyzes the overall sentiment of the discussion. It considers the tone, the emotions conveyed by the language used, and the context in which words and phrases are used.

  1. Double-click on the AI Chat step.

Again a bit of prompt engineering enables the tweets from the Tweet data stream field.

  1. RUN and preview result - Text file output.

One of the most common problems for large, high-growth businesses is dealing with increasing volumes and varieties of financial data - more specifically, extracting the data from PDF documents such as quarterly reports, balance sheets, bank statements and cash flow statements.

Without a solution to handle these data extraction tasks at scale, operations quickly become error-prone and time-consuming. This is why a growing number of organisations are now implementing AI data extraction tools.

In this use case we're going to extract sales data from PDF reports, using Pentaho Data Integration.

Review the main steps of the

The Apache Tikaâ„¢ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

  1. The previous step - Get file names returns the paths to the PDFs.

  1. Double-click on the Read Unstructured Document step to view settings.

Based on the filenames passed in the filename field, the pdf contents are extracted and associated with the pdf_file_contents data stream field.

  1. Double-click on the AI Chat step to view the settings.

Under the Message tab, a templated prompt provides the model the instructions to extract the sales data, from the data stream field pdf_file_contents which has the extracted pdf.

  1. Click on the Model tab.

Based on parameters set in the transformation properties, the OpenAI model, API Key and Temeprature, are set.

  1. Click on Embedding tab.

Based on parameters set in the transformation properties, the Embedding model is set.

  1. Click on the Response tab.

The Response is returned as a JSON object associated with the generated_response data stream field.

This is where we have to put our thinking caps on ..

In the generated_response data stream field the SaleYear & SaleMonth sales data is defined as JSON objects with an array for each:

ProductCategory

UnitSold

Revenue

This will have to be a 2 stage process.

Stage 1 - is to extract the SaleYear & SaleMonth.

Record: SaleYear: SaleMonth:

  1. 2024 August

  2. 2024 July

Stage 2 - iterates through the SalesPerformanceByProduct[array] for each Stage 1 record.

So .. on the first iteration SaleYear = 2024 SaleMonth = August

ProductCategory: UnitSold: Revenue:

Eco-Gear 1,500 $150,000

Smart Home Devices 1,200 $180,000

Fitness Equipment 960 $96,000

Accessories 1,200 $74,000

This is repeated for Record 2 ..

generated_response


Conclusion 

August 2024 was a positive month for Acme Corporation, marked by notable growth in sales and 
customer retention. However, addressing regional disparities and capitalizing on successful 
product lines will be crucial for sustaining this growth momentum in the coming months.	
{
  "SaleYear": "2024",
  "SaleMonth": "August",
  "SalesPerformanceByProduct": [
    {
      "ProductCategory": "Eco-Gear",
      "UnitSold": "1,500",
      "Revenue": "$150,000"
    },
    {
      "ProductCategory": "Smart Home Devices",
      "UnitSold": "1,200",
      "Revenue": "$180,000"
    },
    {
      "ProductCategory": "Fitness Equipment",
      "UnitSold": "960",
      "Revenue": "$96,000"
    },
    {
      "ProductCategory": "Accessories",
      "UnitSold": "1,200",
      "Revenue": "$74,000"
    }
  ]
}


Conclusion 

July 2024 was a solid month for Acme Corporation, characterized by successful product launches 
and moderate sales growth. While the overall performance was positive, addressing regional 
disparities and improving competitive positioning in certain product categories will be essential for 
sustaining growth in the coming months.	
{
  "SaleYear": "2024",
  "SaleMonth": "July",
  "SalesPerformanceByProduct": [
    {
      "ProductCategory": "Eco-Gear",
      "UnitSold": "1,250",
      "Revenue": "$125,000"
    },
    {
      "ProductCategory": "Smart Home Devices",
      "UnitSold": "1,100",
      "Revenue": "$165,000"
    },
    {
      "ProductCategory": "Fitness Equipment",
      "UnitSold": "900",
      "Revenue": "$105,000"
    },
    {
      "ProductCategory": "Accessories",
      "UnitSold": "1,250",
      "Revenue": "$60,000"
    }
  ]
}
...

The Select values step can perform all the following actions on fields in the data stream:

  • Select values

  • Remove values

  • Rename values

  • Change data types

  • Configure length and precision of values

  1. Double-click on the Select values step.

  1. RUN and preview the result.


The Microsoft Excel Writer step writes incoming rows from PDI out to an MS Excel file and supports both the .xls and .xlsx file formats.

The .xls files use a binary format which is better suited for simple content, while the .xlsx files use the Open XML format which works well with templates since it can better preserve charts and miscellaneous objects.

  1. Double-click on Write the Sales Forecast to Excel.

A 'Sales Report AI Generated' is created with an xlsx extension.

If the file exists, its replaced with 'new output file'.

The data set is written to active - Sheet 1.

  1. Click on the Content tab.

The beginning data set is written to cell A1

Any existing cells are overwritten.

Header cells are wriiten.

Get Fields - retrieves the data stream fields.

  1. RUN and open the file:

~/Projects/genai/Use Case - Analyzing Financial Reports/data/Sales Report AI Generated.xlsx

Consider the sentence Hi, where the \n represents a newline. The first step in the encoding process is to obtain the binary representation of each ASCII character. This can be done by looking up the values in an .

Finally, these 6-bit values can be converted into the appropriate printable character by using a .

The endpoint is a tool you can use to check whether text is potentially harmful. Developers can use it to identify content that might be harmful and take action, for instance by filtering it.

The endpoint is a tool you can use to check whether text is potentially harmful. Developers can use it to identify content that might be harmful and take action, for instance by filtering it.

Take a look at the .

It would be interesting to give this a go using: EE plugin

ASCII-to-binary conversion table
Base64 table
moderations
moderations
Hierachical Data Type (HDT)
RAG workshop
jsoup: Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safetyjhy
Link to JSoup
Logo
Apache Tika – Apache Tika
Link to Apache Tika
https://platform.openai.com/docs/overviewplatform.openai.com
Link to OpenAI
https://platform.openai.com/docs/assistants/overviewplatform.openai.com
Link to OpenAI Assisants
Building OpenAI GPT Assistant Framework with PentahoTech Spaghetti
Link to a great blog by the one and only Mr AI - Rishu Shrivastava
Logo
Logo
GenAI - OpenAI assistants
HTML Parser - Xpath
Set path to file
Results - no Xpath
Filepath from stream
Filepath from stream
Results
html from stream
html from stream
Results
HTML Parser - CSS
HTML Parser - CSS
Set path to file - no CSS
filepath from stream
Filepath from stream
Results - Filepath from stream
html from stream
HTML from stream
Results - HTML from stream
Word document
Word document
Results
PDF password protected document
PDF password protected
PDF password protected
Multiple documents
Add PDF password
Pass filenames + password
Results
Text to ASCII bianry
ASCII binary to 6-bit Base64
6-bit converted to character
Base64 with padding
Base64
Base64 encode an image
Select values - encoded_output
Encoded image-92.jpg
RAW Text
Data grid
Select values - RAW Text
Results - RAW Text
Multiple files
Expose all file names with .jpg extension
File name paths are held in the filename field
Results - multiple images
Pentaho Data Integration + OpenAI Assistant
Simple AI Chat
Enter your question?
Model Configuration
Moderation
Results
Prompt & Role set from data stream fields
User Input
Stream - Message & 'Role'
Multiple prompts & 'roles'
Parameters
Configure Model
RAG
Attach the Veggeville story for context
Charlie the happy carrot ..
Persist the Embedding
Enter new prompt ..
READ the embedding.
Happy Veggeville
Parameters
Prompt PDI variables.
Results - JSON object
Result - yum
Log Analysis
AI Chat
Model options
Embedding options
Response
File tab - data source is being passed in the result field from the previous step.
Configure the path to the fields.
result - log analysis
Sentiment Analysis
AI Chat - Sentiment Analysis
result - sentiment analysis
Analyze Finacial Reports
Get file names
Tika step
Message - Templated prompt
Model
Embedding
Select values
result - Select values
Excel writer
Sales Report