How to configure a Kafka component

How to configure a Kafka component

Kafka is a distributed streaming platform used to publish and subscribe to streams of records. Kafka gets used for fault-tolerant storage. Kafka replicates topic log partitions to multiple servers.
Use it to stream data to other data platforms such as a Data Lake or different on-prem or Cloud data centers in real-time.
Choose either option to integrate Kafka with a Data Platform Stack.
  1. Create a new Kafka cluster within a Data Platform Stack.
  2. Integrate externally created Kafka cluster with a Data Platform Stack.
(Option 1) Create a new Kafka cluster within the Data Platform (Recommended option)
Choose this option to create a new Kafka cluster managed by the Data Platform Kubernetes infrastructure and seamlessly integrate with other Data Platform components without affecting external applications outside the Data Platform.
There are several Kafka sub-components to be configured.
Kafka Broker
A Kafka cluster consists of one or more servers (Kafka brokers) running Kafka. Producers are processes that push records into Kafka topics within the broker. A consumer pulls records off a Kafka topic. Running a single Kafka broker is possible, but it doesn't give all the benefits that Kafka in a cluster can provide, for example, data replication, resiliency, throughput, etc.
Kafka Settings
  • Use Timeout in Seconds to control how long the Kubernetes controller waits for the successful creation of Kafka and its sub-components. If it times out, the Kafka component has failed. If it times out frequently, increase this value.
  • Kafka Additional Information
    • Learn more about Kafka configuration here
    • See the configuration for running Kafka in production here.

Zookeeper
Kafka uses ZooKeeper to manage the cluster (it coordinate the brokers/cluster topology). ZooKeeper is a consistent file system for configuration information. ZooKeeper gets used for leadership election for Broker Topic Partition Leaders.
See the configuration for running ZooKeeper in production here.

KSQL Server
KSQL is the streaming SQL engine for Apache Kafka®. It provides an easy-to-use yet powerful interactive SQL interface for stream processing on Kafka without the need to write code in a programming language such as Java or Python. KSQL is scalable, elastic, fault-tolerant, and real-time. It supports a wide range of streaming operations, including data filtering, transformations, aggregations, joins, windowing, and sessionization.
Kafka Connect
Kafka Connect connects Kafka with external systems such as databases, key-value stores, search indexes, and file systems. Kafka Connect makes it simple to use existing connector implementations for common data sources and sinks to move data into and out of Kafka.
Schema Registry
The Schema Registry stores a versioned history of all schemas and allows for the evolution of schemas according to the configured compatibility settings and expanded Avro support.
See the configuration for running Schema Registry in production here.

Kafka Tools
  1. Kafka Topic UI is a web tool to browse Kafka topics and understand what's happening on your cluster such as finding topics, viewing topic metadata, browsing topic data (Kafka messages), viewing topic configuration, and downloading Kafka data. This is a web tool for the confluentinc/kafka-rest proxy.
  2. To provide secure access to the Kafka Topic UI dashboard, Admin User Name and Admin Password should be provided. The original Kafka Topic UI dashboard is accessible by anyone with the Kafka Topic UI URL. This opens up security and maintenance issues because unauthorized personnel can access the Kafka Topic UI dashboard. snapblocs adds basic authentication for the Kafka Topic UI dashboard login using User Name and Password defined at the stack configuration.
Motivation for CPU requests and limits
Configure the CPU requests and limits of the Containers that run in the cluster, efficiently using the CPU resources available on the cluster nodes. By keeping a Pod CPU request low, it gives the Pod a good chance of being scheduled. Having a CPU limit that is greater than the CPU request, accomplishes two things:
  • The Pod can have bursts of activity, making use of CPU resources that happen to be available.
  • The amount of CPU resources a Pod can use during a burst of activity is limited to a reasonable amount.
If CPU limit is not specified for a Container, it can result in one of these situations:
  • The Container has no upper bound on the CPU resources it can use. The Container could use all of the CPU resources available on the node where it is running.
  • The Container runs in a namespace with a default CPU limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the CPU limit.
Replication is the process of having multiple copies of the data for the sole purpose of availability if one of the brokers goes down and is unavailable to serve the requests. In Kafka, replication happens at the partition granularity, i.e., copies of the partition are maintained at multiple broker instances using the partition’s write-ahead log.
Leader for a partition: Every partition has exactly one partition leader which handles all the read/write requests of that partitionIf the replication factor is greater than 1, the additional partition replications act as partition followers. Every partition follower is reading messages from the partition leader (acts like a kind of consumer) and does not serve any consumers of that partition (only the partition leader serves read/writes).
Kafka Broker and Zookeeper scaling: The number of nodes should always be odd, and scaling operations are done offline with no producer or consumer connection.

(Option 2) Integrate externally created Kafka cluster with a Data Platform Stack
Choose this option for an external Kafka cluster within the Data Platform. Be aware that any changes and usages while running the Data Platform may impact external applications (system) that depend on the Kafka cluster. For example, new Kafka Topics that the Data Platform will add may conflict with those external applications. Also, Kafka messages generated/consumed by the Data Platform may impact the performance of those external applications. 
What's Next?

1.1-JPE

    • Related Articles

    • How to configure Kafka+ Platform

      You can initiate configuring a new stack from a few different places: On the Home page, "Configure stack" button on the Stacks statistics block. On the Stacks page, the "Configure new stack" button on the top page On the Projects page, select Project ...
    • How to customize Kafka+ Platform

      You can configure a new stack for Kafka+ Platform by following this. ​ Test / Proof of Concept (POC) Stack To create a simple test Kafka+ Platform stack, set the following parameters. On AWS & K8S component: Provider Key Name: Choose the AWS Provider ...
    • How to configure Istio component

      Choose one of two options to integrate StreamSets Data Collector with the Data Platform Stack. Create a new StreamSets Data Collector cluster within the Data Platform Stack Integrate externally created StreamSets Data Collector cluster with Data ...
    • How to configure a Grafana component

      Grafana is open-source visualization and analytics software. Query, visualize and explore key metrics, set an alert to quickly identifying problems with the system to minimize disruption to services. snapblocs uses Grafana with Elastic Stack (ELK) to ...
    • How to configure an Elastic Stack component

      snapblocs use Elastic Observability for providing Observability of the running Data Platform. Observability of the Data Platform ensures that DevOps can easily detect undesirable behaviors (service downtime, errors, slow responses, etc.). And have ...