Text

Ingesting Text Files ..

Onboarding / Reading from CSV, TxT files can be tricky ..

  1. File Format Issues:

    • Inconsistent delimiters (e.g., mixing commas and tabs)

    • Incorrect line endings (Windows vs. Unix style)

    • Unexpected character encoding (e.g., UTF-8 vs. ASCII)

  2. Data Quality Problems:

    • Missing values or incomplete records

    • Inconsistent data types within columns

    • Duplicate records

  3. Header Row Handling:

    • Presence or absence of header row

    • Misaligned headers with data columns

  4. Dynamic File Names:

    • Difficulty in handling files with changing names or timestamps

  5. Complex Data Structures:

    • Nested data or hierarchical information in flat files

    • Multi-line records

  6. Data Type Conversion:

    • Improper automatic type inference

    • Date and time format inconsistencies

Workshops

This simple workflow introduces some of the key steps used for ingesting text files. The use case refers to to text files where you need to:

• change the layout / structure

• extract key values - create new data stream fields

• string cut or replace values

• change the data stream field type

Last updated