Pentaho Data Integration
InstallationBusiness AnalyticsCToolsData CatalogData QualityLLMs
  • Overview
    • Pentaho Data Integration ..
  • Data Integration
    • Getting Started
      • Configuring PDI UI
      • KETTLE Variables
    • Concepts & Terminolgy
      • Hello World
      • Logging
      • Error Handling
    • Data Sources
      • Flat Files
        • Text
          • Text File Input
          • Text File Output
        • Excel
          • Excel Writer
        • XML
          • Read XML
        • JSON
          • Read JSON
      • Databases
        • CRUID
          • Database Connections
          • Create DB
          • Read DB
          • Update DB
          • Insert / Update DB
          • Delete DB
        • SCDs
          • SCDs
      • Object Stores
        • MinIO
      • SMB
      • Big Data
        • Hadoop
          • Apache Hadoop
    • Enrich Data
      • Merge
        • Merge Streams
        • Merge Rows (diff)
      • Joins
        • Cross Join
        • Merge Join
        • Database Join
        • XML Join
      • Lookups
        • Database Lookups
      • Scripting
        • Formula
        • Modified JavaScript Value
        • User Defined Java Class
    • Enterprise Solution
      • Jobs
        • Job - Hello World
        • Backward Chaining
        • Parallel
      • Parameters & Variables
        • Parameters
        • Variables
      • Scalability
        • Run Configurations
        • Partition
      • Monitoring & Scheduling
        • Monitoring & Scheduling
      • Logging
        • Logging
      • Dockmaker
        • BA & DI Servers
      • Metadata Injection
        • MDI
    • Plugins
      • Hierarchical Data Type
  • Use Cases
    • Streaming Data
      • MQTT
        • Mosquitto
        • HiveMQ
      • AMQP
        • RabbitMQ
      • Kafka
        • Kafka
    • Machine Learning
      • Prerequiste Tasks
      • AutoML
      • Credit Card
    • RESTful API
    • Jenkins
    • GenAI
  • Reference
    • Page 1
Powered by GitBook
On this page
  1. Data Integration
  2. Data Sources
  3. Flat Files
  4. Text

Text File Input

Onboarding text files ..

PreviousTextNextText File Output

Last updated 1 month ago

Workshop - Text File Input

Some of the Orders data that Steel Wheels process are in a text format. In this workshop, you will flatten the list, create capture groups, replace text, and finally format the order_value.

In this workshop, you will configure the following steps:

  • Text File Input

  • Flattener

  • RegEx Evaluation

  • Replace in String

  • Select values

The lab gives you an idea of some of the steps required to load flat files into a database table.

Lets take a look at the data, which will give us an idea of how to approach a possible solution.

  • each line is a record

  • 3rd line is 2 records: 'order status' & 'order date'

  • Order Value: in $

  • white space

So what do we need to do to get this into a database table?

  • Flatten rows

  • Extract values and associated with new data stream fields

  • String cut

  • Format fields - Date / Order Value


To create a new Transformation

Any one of these actions opens a new Transformation tab for you to begin designing your transformation.

  • By clicking File > New > Transformation

  • By using the CTRL-N hot key

Text File Input

The Text File Input step is used to read data from a variety of different text-file types. The most commonly used formats include Comma Separated Values (CSV files) generated by spreadsheets and fixed width flat files.

The Text File Input step provides you with the ability to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept filenames from a previous step making filename handling more even more generic.

  1. Drag the ‘Text File Input’ step onto the canvas.

  2. Double-click on the step, and configure the following properties:

Because the sample file is located in the same directory where the transformation resides, a good approach to naming the file in a way that is location independent is to use a system variable to parameterize the directory name where the file is located. In our case, the complete filename is:

${Internal.Transformation.Filename.Directory}/orders.txt

  1. Click on the ‘Content’ tab and configure the following properties:

  1. Click on ‘Get Fields’ button.

Click on the ‘Fields’ tab and notice the following properties:

The dataset is associated with ‘Field1’ with a data type of String, in the data stream.

  1. Close the Step.

Row Flattener

The Flattener step allows you flatten data sequentially.

  1. Drag the ‘Flattener’ step onto the canvas.

  2. Create a hop from the read order list step.

  3. Double-click on the step, and configure the following properties:

  1. Close the step.

The data has now been flattened into records. This step enables you to define new target fields that match the number of repeating records. So Target field 1 will map to repeating record 1, and so on..

RegEx Evaluation

This step type allows you to match the String value of an input field against a text pattern defined by a regular expression. Optionally, you can use the regular expression step to extract substrings from the input text field matching a portion of the text pattern into new output fields. This is known as "capturing".

In our example, we’re going to extract and create two capture groups order_status and order_date based on the regex expression: (Delivered|Returned):(.+)

  1. Drag the ‘RegEx Evaluation’ step on to the canvas.

  2. Create a hop from the ‘flatten rows’ step.

  3. Double-click on the step, and configure the following properties:

You will also need to set Trim: both for each field. This will ensure all white space is removed and the exact length of the field is returned.

  1. Close the step.


Summary

  • This RegEx uses 2 constructs, denoted by the brackets, and separated by a full colon.

  • (Delivered | Returned) – match against Delivered or Returned.

  • (.+) matches any character

  • You can Test regEx to see if the capture groups are correctly defined.

A good introduction can be found at:

Replace in string

Replace in string is a simple search and replace. It also supports regular expressions and group references. Group references are picked up in the replace by string as $n where n is the number of the group.

Time to tidy up the order_value stream field data. In this step, you replace the Order Value: with ‘nothing’.

  1. Drag the ‘Replace in String’ step onto the canvas.

  2. Create a hop from the ‘parse delivered’ step.

  3. Double-click on the step, and configure the following properties:

  1. Close the step.

Ensure you have correctly entered the Search: Order Value: $[white space here]

Select values

The Select Values step is useful for selecting, removing, renaming, changing data types and configuring the length and precision of the fields on the stream. These operations are organized into different categories:

• Select and Alter — Specify the exact order and name in which the fields should be placed in the output rows

• Remove — Specify the fields that should be removed from the output rows

• Meta-data - Change the name, type, length and precision (the metadata) of one or more fields

  1. Drag the Select values step onto the canvas.

  2. Create a hop from the ‘discard texts’ step.

  3. Double-click on the step, and configure the following properties:

Fieldname
Data Type
Format

order_value

Number

#.00

order_date

Date

MMM yyy

Run Transformation

Finally .. execute the transformation locally.

  1. Click the Run button in the Canvas Toolbar.

  2. Click on the Preview tab.

➡️ Next:

➡️ Next:

➡️ Next:

➡️ Next:

➡️ Next:

Flatten rows
Create capture Groups using RegEx evaluation
Replace in string
Select values
Finally RUN the transformation
regex101: build, test, and debug regexregex101
Link to: Online RegEx engine
Logo
Text files
orders.txt
Add path to file
Text file input - Content
Text File input - Fields
Row flattener
RegEx Evaluation
Replace in String
Select values
Preview data