Pentaho Data Integration
InstallationBusiness AnalyticsCToolsData CatalogData QualityLLMs
  • Overview
    • Pentaho Data Integration ..
  • Data Integration
    • Getting Started
      • Configuring PDI UI
      • KETTLE Variables
    • Concepts & Terminolgy
      • Hello World
      • Logging
      • Error Handling
    • Data Sources
      • Flat Files
        • Text
          • Text File Input
          • Text File Output
        • Excel
          • Excel Writer
        • XML
          • Read XML
        • JSON
          • Read JSON
      • Databases
        • CRUID
          • Database Connections
          • Create DB
          • Read DB
          • Update DB
          • Insert / Update DB
          • Delete DB
        • SCDs
          • SCDs
      • Object Stores
        • MinIO
      • SMB
      • Big Data
        • Hadoop
          • Apache Hadoop
    • Enrich Data
      • Merge
        • Merge Streams
        • Merge Rows (diff)
      • Joins
        • Cross Join
        • Merge Join
        • Database Join
        • XML Join
      • Lookups
        • Database Lookups
      • Scripting
        • Formula
        • Modified JavaScript Value
        • User Defined Java Class
    • Enterprise Solution
      • Jobs
        • Job - Hello World
        • Backward Chaining
        • Parallel
      • Parameters & Variables
        • Parameters
        • Variables
      • Scalability
        • Run Configurations
        • Partition
      • Monitoring & Scheduling
        • Monitoring & Scheduling
      • Logging
        • Logging
      • Dockmaker
        • BA & DI Servers
      • Metadata Injection
        • MDI
    • Plugins
      • Hierarchical Data Type
  • Use Cases
    • Streaming Data
      • MQTT
        • Mosquitto
        • HiveMQ
      • AMQP
        • RabbitMQ
      • Kafka
        • Kafka
    • Machine Learning
      • Prerequiste Tasks
      • AutoML
      • Credit Card
    • RESTful API
    • Jenkins
    • GenAI
  • SETUP
    • Windows 11 Pentaho Lab
  • FAQs
    • FAQs
Powered by GitBook
On this page
  1. Data Integration
  2. Data Sources
  3. Flat Files

Text

Ingesting Text Files ..

CSV & TXT

Onboarding / Reading from CSV, TXT files can be tricky ..

File Format Issues:

  • Inconsistent delimiters (e.g., mixing commas and tabs)

  • Incorrect line endings (Windows vs. Unix style)

  • Unexpected character encoding (e.g., UTF-8 vs. ASCII)

Data Quality Problems:

  • Missing values or incomplete record

  • Inconsistent data types within columns

  • Duplicate records

Header Row Handling:

  • Presence or absence of header row

  • Misaligned headers with data columns

Dynamic File Names:

  • Difficulty in handling files with changing names or timestamps

Complex Data Structures:

  • Nested data or hierarchical information in flat files

  • Multi-line records

Data Type Conversion:

  • Improper automatic type inference

  • Date and time format inconsistencies

Workshops

The Text File Input step in Pentaho Data Integration (PDI) allows you to extract data from various text file formats like CSV, fixed-width, and delimited files. It offers capabilities for handling headers, footers, compression, and complex file patterns, making it ideal for importing raw data from external systems, log files, or legacy exports.

The Text File Output step enables you to export transformation results to text files with configurable formats, delimiters, and compression options. This step is frequently used for creating reports, generating data exchange files for other systems, archiving processed data, and creating backup files. Together, these components form the foundation of many ETL workflows by facilitating seamless data import and export operations.

Text File Input

This simple workflow introduces some of the key steps used for ingesting text files. The use case refers to to text files where you need to:

  • change the layout / structure

  • extract key values - create new data stream fields

  • string cut or replace values

  • change the data stream field type

Text File Output

In this workshop, the focus is on the output - a Steel Wheels customer survey.

The survey components (customer name, survey instructions, questions), demonstration purposes are separate data sources.

The workshop introduces a number of key concepts:

  • how to replace the value in an existing data stream field.

  • how data streams are appended.

PreviousFlat FilesNextText File Input

Last updated 2 months ago

Text File Input
Text File Output
Text File Input
Text file output