Text File Input

The Text File Input step is used to read data from a variety of different text-file types. The most commonly used formats include Comma Separated Values (CSV files) generated by spreadsheets and fixed width flat files.

The Text File Input step provides you with the ability to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept filenames from a previous step making filename handling more even more generic.

Start Pentaho Data Integration.

Windows - PowerShell

Set-Location C:\Pentaho\design-tools\data-integration
.\spoon.bat

Linux

cd
cd ~/Pentaho/design-tools/data-integration
./spoon.sh

Drag the ‘Text File Input’ step onto the canvas.
Double-click on the step, and configure the following properties:

Because the sample file is located in the same directory where the transformation resides, a good approach to naming the file in a way that is location independent is to use a system variable to parameterize the directory name where the file is located. In our case, the complete filename is:

${Internal.Transformation.Filename.Directory}/orders.txt

Click on the ‘Content’ tab and configure the following properties:

Click on ‘Get Fields’ button.

Click on the ‘Fields’ tab and notice the following properties:

The dataset is associated with ‘Field1’ with a data type of String, in the data stream.

Close the Step.

➡️ Next: Flatten rows

Row Flattener

The Flattener step allows you flatten data sequentially.

Drag the ‘Flattener’ step onto the canvas.
Create a hop from the read order list step.
Double-click on the step, and configure the following properties:

Close the step.

The data has now been flattened into records. This step enables you to define new target fields that match the number of repeating records. So Target field 1 will map to repeating record 1, and so on..

➡️ Next: Create capture Groups using RegEx evaluation

RegEx Evaluation

This step type allows you to match the String value of an input field against a text pattern defined by a regular expression. Optionally, you can use the regular expression step to extract substrings from the input text field matching a portion of the text pattern into new output fields. This is known as "capturing".

In our example, we’re going to extract and create two capture groups order_status and order_date based on the regex expression: (Delivered|Returned):(.+)

Drag the ‘RegEx Evaluation’ step on to the canvas.
Create a hop from the ‘flatten rows’ step.
Double-click on the step, and configure the following properties:

You will also need to set Trim: both for each field. This will ensure all white space is removed and the exact length of the field is returned.

Close the step.

➡️ Next: Replace in string

Summary

This RegEx uses 2 constructs, denoted by the brackets, and separated by a full colon.
(Delivered | Returned) – match against Delivered or Returned.
(.+) matches any character
You can Test regEx to see if the capture groups are correctly defined.

A good introduction can be found at:

Replace in string

Replace in string is a simple search and replace. It also supports regular expressions and group references. Group references are picked up in the replace by string as $n where n is the number of the group.

Time to tidy up the order_value stream field data. In this step, you replace the Order Value: with ‘nothing’.

Drag the ‘Replace in String’ step onto the canvas.
Create a hop from the ‘parse delivered’ step.
Double-click on the step, and configure the following properties:

Close the step.

Ensure you have correctly entered the Search: Order Value: $[white space here]

➡️ Next: Select values

Select values

The Select Values step is useful for selecting, removing, renaming, changing data types and configuring the length and precision of the fields on the stream. These operations are organized into different categories:

• Select and Alter — Specify the exact order and name in which the fields should be placed in the output rows

• Remove — Specify the fields that should be removed from the output rows

• Meta-data - Change the name, type, length and precision (the metadata) of one or more fields

Drag the Select values step onto the canvas.
Create a hop from the ‘discard texts’ step.
Double-click on the step, and configure the following properties:

Fieldname

Data Type

Format

order_value

Number

#.00

order_date

Date

MMM yyy

➡️ Next: Finally RUN the transformation

Run Transformation

Finally .. execute the transformation locally.

Click the Run button in the Canvas Toolbar.
Click on the Preview tab.

Text File Input

Workshop - Text File Input

To create a new Transformation

Text File Input

Windows - PowerShell

Linux

Row Flattener

RegEx Evaluation

Summary

Replace in string

Select values

Run Transformation