Choose either option to integrate Kafka with a Data Platform Stack.
Create a new Kafka cluster within a Data Platform Stack.
Integrate externally created Kafka cluster with a Data Platform Stack.
(Option 1) Create a new Kafka cluster within the Data Platform (Recommended option)
Choose this option to create a new Kafka cluster managed by the Data Platform Kubernetes infrastructure and seamlessly integrate with other Data Platform components without affecting external applications outside the Data Platform.
There are several Kafka sub-components to be configured.
Kafka Broker
A Kafka cluster consists of one or more servers (Kafka brokers) running Kafka. Producers are processes that push records into Kafka topics within the broker. A consumer pulls records off a Kafka topic. Running a single Kafka broker is possible, but it doesn't give all the benefits that Kafka in a cluster can provide, for example, data replication, resiliency, throughput, etc.
Kafka Settings
Use Timeout in Seconds to control how long the Kubernetes controller waits for the successful creation of Kafka and its sub-components. If it times out, the Kafka component has failed. If it times out frequently, increase this value.
Kafka Additional Information
Learn more about Kafka configuration here.
See the configuration for running Kafka in production here.
Zookeeper
Kafka uses ZooKeeper to manage the cluster (it coordinate the brokers/cluster topology). ZooKeeper is a consistent file system for configuration information. ZooKeeper gets used for leadership election for Broker Topic Partition Leaders.
See the configuration for running ZooKeeper in production here.
KSQL Server
KSQL is the streaming SQL engine for Apache Kafka®. It provides an easy-to-use yet powerful interactive SQL interface for stream processing on Kafka without the need to write code in a programming language such as Java or Python. KSQL is scalable, elastic, fault-tolerant, and real-time. It supports a wide range of streaming operations, including data filtering, transformations, aggregations, joins, windowing, and sessionization.
Kafka Connect
Kafka Connect connects Kafka with external systems such as databases, key-value stores, search indexes, and file systems. Kafka Connect makes it simple to use existing connector implementations for common data sources and sinks to move data into and out of Kafka.
Schema Registry
The Schema Registry stores a versioned history of all schemas and allows for the evolution of schemas according to the configured compatibility settings and expanded Avro support.
See the configuration for running Schema Registry in production here.
Kafka Tools
Kafka Topic UI is a web tool to browse Kafka topics and understand what's happening on your cluster such as finding topics, viewing topic metadata, browsing topic data (Kafka messages), viewing topic configuration, and downloading Kafka data. This is a web tool for the confluentinc/kafka-rest proxy.
To provide secure access to the Kafka Topic UI dashboard, Admin User Name and Admin Password should be provided. The original Kafka Topic UI dashboard is accessible by anyone with the Kafka Topic UI URL. This opens up security and maintenance issues because unauthorized personnel can access the Kafka Topic UI dashboard. snapblocs adds basic authentication for the Kafka Topic UI dashboard login using User Name and Password defined at the stack configuration.
Motivation for CPU requests and limits
Configure the CPU requests and limits of the Containers that run in the cluster, efficiently using the CPU resources available on the cluster nodes. By keeping a Pod CPU request low, it gives the Pod a good chance of being scheduled. Having a CPU limit that is greater than the CPU request, accomplishes two things:
The Pod can have bursts of activity, making use of CPU resources that happen to be available.
The amount of CPU resources a Pod can use during a burst of activity is limited to a reasonable amount.
If CPU limit is not specified for a Container, it can result in one of these situations:
The Container has no upper bound on the CPU resources it can use. The Container could use all of the CPU resources available on the node where it is running.
The Container runs in a namespace with a default CPU limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the CPU limit.
(Option 2) Integrate externally created Kafka cluster with a Data Platform Stack
Choose this option for an external Kafka cluster within the Data Platform. Be aware that any changes and usages while running the Data Platform may impact external applications (system) that depend on the Kafka cluster. For example, new Kafka Topics that the Data Platform will add may conflict with those external applications. Also, Kafka messages generated/consumed by the Data Platform may impact the performance of those external applications.
What's Next?