Pentaho Data Integration
InstallationBusiness AnalyticsCToolsData CatalogData QualityLLMs
  • Overview
    • Pentaho Data Integration ..
  • Data Integration
    • Getting Started
      • Configuring PDI UI
      • KETTLE Variables
    • Concepts & Terminolgy
      • Hello World
      • Logging
      • Error Handling
    • Data Sources
      • Flat Files
        • Text
          • Text File Input
          • Text File Output
        • Excel
          • Excel Writer
        • XML
          • Read XML
        • JSON
          • Read JSON
      • Databases
        • CRUID
          • Database Connections
          • Create DB
          • Read DB
          • Update DB
          • Insert / Update DB
          • Delete DB
        • SCDs
          • SCDs
      • Object Stores
        • MinIO
      • SMB
      • Big Data
        • Hadoop
          • Apache Hadoop
    • Enrich Data
      • Merge
        • Merge Streams
        • Merge Rows (diff)
      • Joins
        • Cross Join
        • Merge Join
        • Database Join
        • XML Join
      • Lookups
        • Database Lookups
      • Scripting
        • Formula
        • Modified JavaScript Value
        • User Defined Java Class
    • Enterprise Solution
      • Jobs
        • Job - Hello World
        • Backward Chaining
        • Parallel
      • Parameters & Variables
        • Parameters
        • Variables
      • Scalability
        • Run Configurations
        • Partition
      • Monitoring & Scheduling
        • Monitoring & Scheduling
      • Logging
        • Logging
      • Dockmaker
        • BA & DI Servers
      • Metadata Injection
        • MDI
    • Plugins
      • Hierarchical Data Type
  • Use Cases
    • Streaming Data
      • MQTT
        • Mosquitto
        • HiveMQ
      • AMQP
        • RabbitMQ
      • Kafka
        • Kafka
    • Machine Learning
      • Prerequiste Tasks
      • AutoML
      • Credit Card
    • RESTful API
    • Jenkins
    • GenAI
  • Reference
    • Page 1
Powered by GitBook
On this page
  1. Use Cases
  2. Streaming Data

Kafka

PreviousRabbitMQNextKafka

Last updated 1 month ago

Apache Kafka

Apache Kafka is a distributed streaming platform that enables users to publish, subscribe to, store, and process streams of records in real time. It is designed to handle high volumes of data efficiently, making it an excellent choice for large-scale message processing tasks. Kafka is built around the concept of a distributed commit log, providing fault tolerance, durability, and high throughput for both publishing and subscribing by leveraging cluster nodes.

It supports producers sending messages to topics, from which consumers can read and process these messages. This makes Kafka suitable for a variety of applications, including real-time analytics, event sourcing, log aggregation, and more.

Apache Kafka in Kraft (KRaft) mode, which stands for Kafka Raft metadata mode, simplifies Kafka's operational model by eliminating the need for an external Zookeeper cluster.

Below are the key components of a Kafka cluster running in Kraft mode:

Controller

Manages the state of the cluster and is responsible for administrative tasks such as topic creation, deletion, and partition reassignment. In Kraft mode, the controller logic is embedded within the Kafka broker itself, leveraging the Raft protocol for consensus.

Broker

A server in the Kafka cluster that stores data and serves client requests. In Kraft mode, brokers can handle both standard client requests and participate in cluster management operations.

Kraft Mode

Kafka Kraft (KRaft) is Apache Kafka's new controller architecture that eliminates the dependency on Apache ZooKeeper. Introduced as a major architectural enhancement to Kafka, KRaft consolidates the metadata management within Kafka itself, replacing the traditional ZooKeeper-based controller.

This simplifies Kafka's deployment and operational model by reducing the number of components to maintain, improves scalability by removing bottlenecks associated with ZooKeeper, and enhances performance through optimized metadata handling.

x

x

Schema Registry

Schema Registry provides a centralized repository for managing and validating schemas for topic message data, and for serialization and deserialization of the data over the network.

The Schema Registry is not part of Apache Kafka but there are several open source options to choose from.

Schema Registry lives outside of and separately from your Kafka brokers. Your producers and consumers still talk to Kafka to publish and read data (messages) to topics. Concurrently, they can also talk to Schema Registry to send and retrieve schemas that describe the data models for the messages.

Schema Registry is a distributed storage layer for schemas which uses Kafka as its underlying storage mechanism. Some key design decisions:

• Assigns globally unique ID to each registered schema. Allocated IDs are guaranteed to be monotonically increasing and unique, but not necessarily consecutive.

• Kafka provides the durable backend, and functions as a write-ahead changelog for the state of Schema Registry and the schemas it contains.

• Schema Registry is designed to be distributed, with single-primary architecture, and ZooKeeper/Kafka coordinates primary election (based on the configuration).

Kafka Architecture
Kafka Cluster
Schema Registry