Pentaho Data Integration
InstallationBusiness AnalyticsCToolsData CatalogData QualityLLMs
  • Overview
    • Pentaho Data Integration ..
  • Data Integration
    • Getting Started
      • Configuring PDI UI
      • KETTLE Variables
    • Concepts & Terminolgy
      • Hello World
      • Logging
      • Error Handling
    • Data Sources
      • Flat Files
        • Text
          • Text File Input
          • Text File Output
        • Excel
          • Excel Writer
        • XML
          • Read XML
        • JSON
          • Read JSON
      • Databases
        • CRUID
          • Database Connections
          • Create DB
          • Read DB
          • Update DB
          • Insert / Update DB
          • Delete DB
        • SCDs
          • SCDs
      • Object Stores
        • MinIO
      • SMB
      • Big Data
        • Hadoop
          • Apache Hadoop
    • Enrich Data
      • Merge
        • Merge Streams
        • Merge Rows (diff)
      • Joins
        • Cross Join
        • Merge Join
        • Database Join
        • XML Join
      • Lookups
        • Database Lookups
      • Scripting
        • Formula
        • Modified JavaScript Value
        • User Defined Java Class
    • Enterprise Solution
      • Jobs
        • Job - Hello World
        • Backward Chaining
        • Parallel
      • Parameters & Variables
        • Parameters
        • Variables
      • Scalability
        • Run Configurations
        • Partition
      • Monitoring & Scheduling
        • Monitoring & Scheduling
      • Logging
        • Logging
      • Dockmaker
        • BA & DI Servers
      • Metadata Injection
        • MDI
    • Plugins
      • Hierarchical Data Type
  • Use Cases
    • Streaming Data
      • MQTT
        • Mosquitto
        • HiveMQ
      • AMQP
        • RabbitMQ
      • Kafka
        • Kafka
    • Machine Learning
      • Prerequiste Tasks
      • AutoML
      • Credit Card
    • RESTful API
    • Jenkins
    • GenAI
  • Reference
    • Page 1
Powered by GitBook
On this page
  1. Data Integration
  2. Data Sources
  3. Flat Files
  4. Excel

Excel Writer

Working with Excel ..

PreviousExcelNextXML

Last updated 1 month ago

Workshop - Microsoft Excel Writer

Steel Wheels intends to automate their half-yearly Sales and Expenses Report. The process for extracting, transforming, and loading (ETL) data has been delineated into several workflows. Upon the completion of these workflows, data will be automatically populated in a predefined Excel template.

In this workshop, you will configure the following steps:

  • Excel Writer

  • Block Step


Create a new Transformation

Any one of these actions opens a new Transformation tab for you to begin designing your transformation.

  • By clicking File > New > Transformation

  • By using the CTRL-N hot key

Excel Template

The various stages of the transformation write data to a template.xlsx. The template has 2 worksheets:

  • Sales Chart - this worksheet creates a 3D stacked graph

  • SourceData - worksheet.

SourceData - the datasheet. Transformations write to the required cells that are used to create the graph.

Write Year

The first workflow is to write the current Year to the SourceData worksheet in the template.xlsx

Generate rows - Year

Generate rows outputs a specified number of rows. By default, the rows are empty; however, they can contain several static fields. This step is used primarily for testing purposes. It may be useful for generating a fixed number of rows, for example, you want exactly 12 rows for 12 months. Sometimes you may use Generate Rows to generate one row that is an initiating point for your transformation.

  1. Drag the ‘Generate Rows’ step onto the canvas.

  2. Double-click on the step, and configure the following properties:

  1. Close Step.


Summary

  • Generates a record that holds the Year value – 2023 – in the year stream field.

  • The Excel template will also need to be formatted yyyy to interpret the Date.

Excel Writer - Year

The Microsoft Excel Writer step writes incoming rows into an MS Excel file. It supports both the xls and xlsx file formats. The xlsx format is usually a good choice when working with template files, as it is more likely to preserve charts and other misc objects in the output.

The proprietary (binary) xls format is not as well understood and deciphered, so moving/replicating nontrivial xls content in non-MS software environments is usually problematic.

  1. Drag the ‘Excel writer’ step onto the canvas.

  2. Create a hop from the ‘Year’ step.

  3. Double-click on the step, and configure the following properties:

  • The filename sets the name and output directory for the Excel Workbook:

[current transformation directory]\Sales_and_Expenses_2023.xlsx

  • The option: replace with new output file, results in a new Excel Workbook file overwriting previous version every time the Transformation is Run.

  • Write the output to the SourceData Excel worksheet.

  • Use template.xlsx as the template.

[current transformation directory]\template.xlsx

4. Click on the Content tab, and configure the following properties:

  1. Click on ‘Get Fields’ button.

  2. Click OK.

Write Sales

The second workflow is to repeat the workflow for Sales, with the addition of a ‘Blockstep’, which ensures the ‘Write Year’ workflow has been completed, before writing to the template.xlsx.

Text file input - Read Sales

The Text File Input step is used to read data from a variety of different text-file types. The most commonly used formats include Comma Separated Values (CSV files) generated by spreadsheets and fixed width flat files.

The Text File Input step provides you with the ability to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept filenames from a previous step making filename handling more even more generic.

  1. Drag the ‘Text File Input’ step onto the canvas.

  2. Double-click on the step, and configure the following properties:

  1. Click on the Content tab, and configure the following properties:

  • Ensure the Header is selected.

  • No empty rows

  • Mixed Format

  1. Click on the Fields tab, and click on ‘Get Fields’ button:

Returns the Header values as stream fields.

  1. Click OK.

Block until Step Finish - Wait Year

This step simply waits until all the step copies that are specified in the dialog have finished. You can use it to avoid the natural concurrency (parallelism) that exists between transformation step copies.

  1. Drag the ‘Block this step until steps finish’ step onto the canvas.

  2. Create a hop from the ‘Read Sales’ step.

  3. Double-click on the step, and configure the following properties:

This will result in the workflow being blocked until the Write Year step has been completed.

Excel Writer - Write Sales

The Microsoft Excel Writer step writes incoming rows into an MS Excel file. It supports both the xls and xlsx file formats. The xlsx format is usually a good choice when working with template files, as it is more likely to preserve charts and other misc objects in the output.

The proprietary (binary) xls format is not as well understood and deciphered, so moving/replicating nontrivial xls content in non-MS software environments is usually problematic.

  1. Drag the ‘Excel writer’ step onto the canvas.

  2. Create a hop from the ‘Wait Year’ step.

  3. Double-click on the step, and configure the following properties:

  • The filename sets the name and output directory for the Excel Workbook:

[current transformation directory]\Sales_and_Expenses_2023.xlsx

  • The option: Use existing file for writing, writes the data to the existing Sales_and_Expenses_2023.xlsx file

  • Write the output to the SourceData Excel worksheet.

  • No need for the template as the data is now being written to the file.

  1. Click on the Content tab, and configure the following properties:

  1. Click on the ‘Get Fields’ button.

  2. Delete the productline field, as its not required. The template already has the fieldname and you are just writing the data, starting at cell B5.

  3. Click OK.

Write Expenses

The third workflow is like workflow 2, however, for Expenses.

Read Expenses

Start with loading the Expenses data into the datastream.

Block Step - sales

Block Step until Sales data has finished being written to the template.

Write Expenses to Sourcedata ..

Run Transformation

This workshop illustrates the concept of parallelism. As each Transformation step is initialized in parallel, i.e. their own thread, then blocking steps must be added to the workflow, to prevent data being simultaneously written to the Excel SourceData worksheet.

  1. Click the Run button in the Canvas Toolbar.

  2. Open the Sales_and_Expenses_2023.xlsx file.

➡️ Next:

➡️ Next:

➡️ Next:

➡️ Next:

➡️ Next:

➡️ Next:

➡️ Next:

➡️ Next:

➡️ Next:

Write Year
Excel Writer - Year
Write Sales
Block Step - Wait Year
Excel Writer - Write Sales
Write Expenses
Block Step - Sales
Excel Writer - Expenses
Finally RUN the transformation
Excel transformation
Excel Template
SourceData sheet
Year
Generate rows - Year
Excel Writer - Year
Excel Writer - cell
Write Sales
Text File input - sales
Text File input - Content
Text File input - fields
Block step
Excel Writer - Sales
Excel Writer - Content
Write Expenses
Text File input - Expenses
Text File input - Content
Text File input - Fields
Block 'Write Sales'
Excel Writer - Expenses
Excel Writer - Content
Excel Book