GenAI

Generative artificial intelligence (GenAI) can create certain types of images, text, videos, and other media in response to prompts ..

So does it work?

Users send Messages to the Thread, which the Assistant then processes.

The framework uses a Thread to maintain the context of a conversation.

Each interaction is added to the Thread as a Message.

Assistants can work with uploaded files, analyzing and referencing them in responses.

The framework maintains state across interactions, allowing for complex, multi-turn conversations.

The Assistant generates responses based on the conversation history and its capabilities.

Responses are generated asynchronously, allowing for handling of long-running tasks.

Assistants can produce various types of output, including text, code, or structured data.

Developers can fine-tune the Assistant's behavior through detailed instructions and model selection.

The HTML Parser is a utility plugin for Pentaho Data Integration (PDI) that extracts desired text from HTML or XML files. Useful for cleaning data for natural language processing tasks like sentiment analysis and SEO keyword analysis.

• Accepts input from both data streams and files

• Supports parsing using Xpath expressions or CSS selectors

• Can process single files or multiple inputs from a stream

• Compatible with local and virtual file systems

The plugin utilizes jsoup, a Java library, that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and Xpath selectors.

The step is located in the Input folder.

XPath (XML Path Language) is a query language for selecting nodes from an XML or HTML document. While Jsoup doesn't natively support XPath, we can use a combination of Jsoup and Java's built-in XPath capabilities to achieve this.

Here's an overview of some common XPath syntax:

/ - Selects from the root node

// - Selects nodes anywhere in the document

. - Selects the current node

.. - Selects the parent of the current node

@ - Selects attributes

[] -Used for predicates (conditions)

Some examples:

  • //div - Selects all div elements in the document

  • //div[@class='content'] - Selects all div elements with class 'content'

  • //h1/text() - Selects the text content of all h1 elements

  • //div[@class='content']/p - Selects all p elements that are direct children of div elements with class 'content'

HTML Data Source

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>XPath Example Page</title>
</head>
<body>
    <header>
        <h1 id="main-title">Welcome to Our Website</h1>
        <nav>
            <ul>
                <li><a href="#home">Home</a></li>
                <li><a href="#about">About</a></li>
                <li><a href="#contact">Contact</a></li>
            </ul>
        </nav>
    </header>

    <main>
        <section id="featured-articles">
            <h2>Featured Articles</h2>
            <article>
                <h3>Article 1</h3>
                <p class="content">This is the content of article 1.</p>
                <span class="author">By John Doe</span>
            </article>
            <article>
                <h3>Article 2</h3>
                <p class="content">This is the content of article 2.</p>
                <span class="author">By Jane Smith</span>
            </article>
        </section>

        <section id="latest-news">
            <h2>Latest News</h2>
            <ul>
                <li>News item 1</li>
                <li>News item 2</li>
                <li>News item 3</li>
            </ul>
        </section>
    </main>

    <footer>
        <p>&copy; 2024 Our Website. All rights reserved.</p>
    </footer>
</body>
</html>

Transformation

Filepath

The data source is referenced in a path.

  1. Open the following transformation:

Windows

C:/Projects/genai/html/HTML Parser - Xpath.ktr

Linux

~/Projects/genai/html/HTML Parser - Xpath.ktr

  1. Double-click on the hp: html and configure with the following settings:

Leaving Xpath field blank will result in all tags being removed and all the content returned.

3. RUN and preview the results.

These XPath queries will help you navigate and extract specific content from the homepage.html

Select the main title:

//h1[@id='main-title']

Select all navigation links:

//nav//a

Select all article titles (h3 elements within articles):

//article/h3

Select all paragraph content within articles:

//article/p[@class='content']

Select all author names:

//span[@class='author']

Select the latest news items:

//section[@id='latest-news']//li

Select the footer text:

//footer/p/text()

Select all section titles (h2 elements that are direct children of section elements):

//section/h2

Select the second article:

(//article)[2]

Select all elements with a class attribute:

//*[@class]

Filepath from stream

The data source is referenced as a filepath in a datastream field.

  1. Enable the hop between: dg: filepath from stream -> hp: parse html xpath.

  2. Disable the hop between: Data Grid -> hp: parse html xpath

dg: html from stream -> hp: parse html xpath

  1. Double-click on the hp: html and configure with the following settings:

  1. RUN and preview the results.


HTML from stream

The data source is referenced as <html> in a data stream field.

Pentaho's data streams often use binary fields to handle various types of data, including large text objects like HTML. By using binary datum, you ensure that the entire HTML content is treated as a single, uninterpreted chunk of data within the Pentaho pipeline - represented as 0 or 1.

Storing the HTML as binary datum allows you to pass the raw content through various steps in your Pentaho transformation without Pentaho trying to interpret or modify the HTML prematurely.

  1. Enable the hop between: dg: html from stream -> hp: parse html xpath.

  2. Disable the hop between: dg: filepath from stream -> hp: parse html xpath

Data Grid -> hp: parse html xpath

  1. Double-click on the hp: html and configure with the following settings:

  1. RUN and preview the results.

Last updated