Lab: Kafka | Pentaho Data Integration

Kafka-docker-compose is a tool or method that allows you to easily configure and set up Apache Kafka along with its components such as Kafka Brokers, ZooKeeper, Kafka Connect, and more in a Docker environment. Using docker-compose, you can define and run multi-container Docker applications where each service (like a Kafka broker or ZooKeeper) is defined in a docker-compose.yml file.

This approach simplifies the complexities of network configurations between these services and ensures that you have a reproducible and isolated environment for development, testing, and potentially production scenarios. It allows for easy scaling of Kafka brokers and other services within your cluster.

Lets start with a simple cluster that consists of: 1 Broker & 1 Controller - Zookeeper mode.

Execute the following command (adds JMX agents for Prometheus & Grafana).

cd
cd ~/kafka-docker-composer
python3 kafka_docker_composer.py -b 1 -z 1 -p

Execute the generated docker-compose.yml file.

cd
cd ~/kafka-docker-composer
docker compose up -d

Conduktor

The Conduktor Console is a powerful user interface (UI) for managing Apache Kafka. It simplifies Kafka-related tasks and provides visibility into your Kafka ecosystem. Here are some key features:

Data Exploration:
- The Console allows you to explore Kafka data easily.
- You can troubleshoot and debug Kafka issues.
- Drill down into topic data and monitor streaming applications.
Single Interface:
- Concentrates all Kafka APIs into a unified interface.
- Provides a streamlined experience for Kafka users.
Collaborative Kafka Platform:
- Offers autonomy, automation, and advanced features for developers.
- Ensures security, standards, and governance for platform teams.
- Complements your Kafka provider with versatile solutions.

Start Konduktor.

cd
cd ~/Conduktor-2.24.9/bin
./conduktor

Click on: +New Kafka Cluster

Enter the following settings.

Cluster Name

Kafka

Bootstrap Servers

localhost:9091

Color

Choose a colour

Test Kafka Connectivity

Click Save.
Click on configured connection.

We'll generate IoT sensor data using PDI.

Start Pentaho Data Integration:

cd
cd Pentaho/design-tools/data-integration
sh spoon.sh

The Kafka Producer step allows you to publish messages in near-real-time to an Kafka broker. Within a transformation, the Kafka Producer step publishes a stream of records to a Kafka topic.

Open the following transformation:

~/Workshop--Data-Integration/Labs/Module 3 - Data Sources/Streaming Data/04 Kafka/tr_kafka_producer.ktr

• just for 1 vehicle_id 111 - every 5 seconds

• timestamp added

• remove some fields

• javascript to generate sensor data

• dummy step to collect data streams

• concat the fields into a 'message' payload

• Kafka Producer - connect to broker & publish message / payload

Double-click on Kafka producer step and configure with the following settings.

Setup

Option

Description

Connection

Select a connection type:

Direct: Specify the Bootstrap servers from which you want to receive the Kafka streaming data.

Cluster: Specify the Hadoop cluster configuration from which you want to retrieve the Kafka streaming data. In a Hadoop cluster configuration, you can specify information like host names and ports for HDFS, Job Tracker, security, and other big data cluster components. Multiple servers can be specified if these are part of the same cluster.

Client ID

The unique Client identifier, used to identify and set up a durable connection path to the server to make requests and to distinguish between different clients.

Topic

The category to which records are published.

Key Field

In Kafka, all messages can be keyed, allowing for messages to be distributed to partitions based on their keys in a default routing scheme. If no key is present, messages are randomly distributed to partitions.

Message Field

The individual record contained in a topic.