GenAI
Generative artificial intelligence (GenAI) can create certain types of images, text, videos, and other media in response to prompts ..
Last updated
Was this helpful?
Generative artificial intelligence (GenAI) can create certain types of images, text, videos, and other media in response to prompts ..
Last updated
Was this helpful?
So does it work?
Users send Messages to the Thread, which the Assistant then processes.
The framework uses a Thread to maintain the context of a conversation.
Each interaction is added to the Thread as a Message.
Assistants can work with uploaded files, analyzing and referencing them in responses.
The framework maintains state across interactions, allowing for complex, multi-turn conversations.
The Assistant generates responses based on the conversation history and its capabilities.
Responses are generated asynchronously, allowing for handling of long-running tasks.
Assistants can produce various types of output, including text, code, or structured data.
Developers can fine-tune the Assistant's behavior through detailed instructions and model selection.
The HTML Parser is a utility plugin for Pentaho Data Integration (PDI) that extracts desired text from HTML or XML files. Useful for cleaning data for natural language processing tasks like sentiment analysis and SEO keyword analysis.
Accepts input from both data streams and files
Supports parsing using Xpath expressions or CSS selectors
Can process single files or multiple inputs from a stream
Compatible with local and virtual file systems
The plugin utilizes jsoup, a Java library, that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and Xpath selectors.
The step is located in the Input folder.
XPath (XML Path Language) is a query language for selecting nodes from an XML or HTML document. While Jsoup doesn't natively support XPath, we can use a combination of Jsoup and Java's built-in XPath capabilities to achieve this.
Here's an overview of some common XPath syntax:
/
- Selects from the root node
//
- Selects nodes anywhere in the document
.
- Selects the current node
..
- Selects the parent of the current node
@
- Selects attributes
[] -
Used for predicates (conditions)
Some examples:
//div
- Selects all div elements in the document
//div[@class='content']
- Selects all div elements with class 'content'
//h1/text()
- Selects the text content of all h1 elements
//div[@class='content']/p
- Selects all p elements that are direct children of div elements with class 'content'
The data source is referenced in a path.
Open the following transformation:
C:/Projects/genai/html/HTML Parser - Xpath.ktr
~/Projects/genai/html/HTML Parser - Xpath.ktr
Double-click on the hp: html and configure with the following settings:
Leaving Xpath field blank will result in all tags being removed and all the content returned.
3. RUN and preview the results.
These XPath queries will help you navigate and extract specific content from the homepage.html
Select the main title:
Select all navigation links:
Select all article titles (h3 elements within articles):
Select all paragraph content within articles:
Select all author names:
Select the latest news items:
Select the footer text:
Select all section titles (h2 elements that are direct children of section elements):
Select the second article:
Select all elements with a class attribute:
The data source is referenced as a filepath in a datastream field.
Enable the hop between: dg: filepath from stream -> hp: parse html xpath.
Disable the hop between: Data Grid -> hp: parse html xpath
dg: html from stream -> hp: parse html xpath
Double-click on the hp: html and configure with the following settings:
RUN and preview the results.
The data source is referenced as <html> in a data stream field.
Pentaho's data streams often use binary fields to handle various types of data, including large text objects like HTML. By using binary datum, you ensure that the entire HTML content is treated as a single, uninterpreted chunk of data within the Pentaho pipeline - represented as 0 or 1.
Storing the HTML as binary datum allows you to pass the raw content through various steps in your Pentaho transformation without Pentaho trying to interpret or modify the HTML prematurely.
Enable the hop between: dg: html from stream -> hp: parse html xpath.
Disable the hop between: dg: filepath from stream -> hp: parse html xpath
Data Grid -> hp: parse html xpath
Double-click on the hp: html and configure with the following settings:
RUN and preview the results.