Kafka Series (Part 1: Fundamentals)

Kafka Series (Part 1: Fundamentals)
This is the first tutorial in Kafka Series and will cover mostly the theoretical fundamentals of Kafka before diving into the actual implementation. 🤓
❔ What is Kafka?
[Reference: https://www.cloudkarafka.com/blog/part1-kafka-for-beginners-what-is-apache-kafka.html]
Kafka is an open source distributed system consisting of servers and clients that communicate via binary protocol over TCP. Some of the primary use cases of Kafka are real-time data processing, message queuing, and event sourcing.
Kafka is based on the "Event Streaming architecture" where events/data are received from various sources and stored in persistent memory in Kafka broker server to be consumed real-time or later as per the requirements. Here, Events simply refer to the data/message which is sent by the clients to the Kafka server.
By default, Kafka messages are stored for 7 days or until the log size reaches 1 GB. It can be configured based on the requirements. Performance effectively remains constant irrespective of this time period.
⚙️ Different ways to use Kafka
- To publish (write) and subscribe to (read) streams of events
- To store streams of events for later use.
- To process streams of events as they occur real-time.
🆚 KRaft vs Zookeeper: Key Differences
KRaft and Zookeeper are two different modes of managing metadata in Apache Kafka clusters for ensuring fault tolerance and high availability.
Kafka traditionally relied on Apache Zookeeper but with the introduction of KRaft mode (Kafka Raft Metadata), Kafka has eliminated its dependency on Zookeeper, simplifying the architecture and improving scalability.
Note: Quorum refers to the minimum number of controller nodes that must agree on a transaction (e.g., leader election, metadata updates) before it can be executed. In KRaft, a quorum ensures fault tolerance and consistency by requiring a majority of nodes to agree on decisions.
Below are the key differences between KRaft and Zookeeper-based Kafka:
| Feature | Zookeeper-Based Kafka | KRaft Mode |
|---|---|---|
| Metadata Management | Zookeeper stores and manages cluster metadata. | Metadata is managed internally using a Raft-based quorum of Kafka brokers. |
| Leader Election | Zookeeper handles leader election for partitions. | Leader election is managed by the Quorum Controller using the Raft protocol. |
| Architecture | Requires separate Zookeeper deployments. | No need for Zookeeper; Kafka is self-contained. |
| Operational Complexity | Higher due to managing both Kafka and Zookeeper. | Lower, as Zookeeper is no longer required. |
| Scalability | Limited by Zookeeper's performance and scalability. | Improved scalability due to a streamlined architecture. |
| Fault Tolerance | Zookeeper ensures fault tolerance for metadata. | Fault tolerance is handled by the Raft quorum within Kafka. |
| Deployment | Requires separate setup and maintenance of Zookeeper. | Simplified deployment with no external dependencies. |
| Maturity | Well-established and widely used in production. | Still evolving, but increasingly stable and recommended for new deployments. |
🛠️ Why We Use KRaft in This Implementation
In this tutorial, we are using KRaft mode for the following reasons:
- Simplified Architecture: By eliminating Zookeeper, KRaft reduces the operational complexity of running a Kafka cluster.
- Self-Contained System: Kafka manages its own metadata and leader election, making it easier to deploy and maintain.
- Improved Scalability: KRaft is designed to handle larger clusters and higher throughput more efficiently.
- Future-Proofing: KRaft is the future of Kafka, and adopting it now ensures compatibility with upcoming features and improvements.
Visual Representation of Consensus Protocol in KRaft Mode:
- Basic: https://raft.github.io/
- Advanced: https://thesecretlivesofdata.com/raft/
🏭 Practical use cases of Kafka in top companies
[Reference: https://axual.com/apache-kafka-use-cases-in-real-life/]
Check here for official references - Kafka's Practical Usecases 😲
LinkedIn's Use Case: "Apache Kafka is used at LinkedIn for activity stream data and operational metrics. This powers various products like LinkedIn Newsfeed, LinkedIn Today in addition to our offline analytics systems like Hadoop."
🧰 Key Components in Kafka

1. Servers:
Since Kafka is a distributed system, it can be deployed on one or more servers independent of the regions as a group of brokers and "Kafka Connect".
1.1. Broker: It is responsible for managing the storage/events
1.2. Kafka Connect: For integrating Kafka with existing app, database and/or other Kafka servers/clusters.
1.3. Kafka Controller: It is part of the Quorum Controller, which manages metadata and ensures consistency across the cluster using the Raft protocol. It handles tasks like leader election, topic creation, and partition management.
2. Clients:
These are 2 types of clients involved in Kafka:
2.1. Producer: This client is the publisher who sends the events.
2.2. Consumer: This client is the subscriber who receives/listens to the events.
There can be any number of producers and consumers in a given architecture.
🔍 How different consumers filter different type of events in an architecture?
Events are linked with topics. In simple words, Topic refers to a group (e.g., Car) and events refer to the actual data (e.g., car speed, fuel etc).
Let's take another example to understand it better. Hospital wants to keep track of health devices' metrics of its patients. So, they will create the following topics in Kafka Broker: HeartRate-Monitor, BloodPressure-Sensor, Temperature-Sensor. These sensors will send events to one of their corresponding topics and will contain sensor ID, sensor value, timestamp.
I hope you have a better understanding of Topics and Events now :)
🚀 Increasing Kafka Topics' Performance
For enabling parallel processing within a topic, we use partitions. (similar to the concept of Sharding in Databases) In case of multiple brokers, these partitions are placed on different broker servers to allow scalability. Partitions mean splitting the actual list of events into n separate groups based on some app level logic.
It is important to carefully pick the number of partitions as a very large number can degrade overall performance.
📊 Benchmarks
Detailed Benchmarks comparing with other alternatives. Check here
🚨 Fault Tolerance and High Availability
Every topic can be replicated across multiple regions/brokers so in case of failure, at least one broker will always have a copy of the original data. (Popular Replication Factor is 3. It means there will be 3 copies of the data.)
🔑 Data Security
[Reference: https://developer.ibm.com/learningpaths/develop-kafka-apps/more-resources/auth-kafka]
To avoid Man-in-the-middle attack (MITM), we have options for enabling SSL or SASL, keystores, Kerberos etc.
Also, We don't want producers / consumers to interact with data or topics which they are not allowed to interact with. For this reason, we have Admin API in Kafka which can be used to configure these settings and manage Access Control List (ACL) similar to those in AWS.
🤔 FAQ
1. Why not UDP or other protocols used in Kafka?
[Source: kafka.apache.org/protocol.html#protocol_philosophy]
- Client implementers can make use of some more advanced TCP features.
- It makes it very hard to add and test new features if they have to be ported across many protocol implementations.
- The mapping between binary log format and wire protocol is managed carefully in Kafka and this would not be possible with systems like Protocol Buffers or Thrift.
- Flow control in TCP ensures that Kafka's producers and consumers operate at compatible speeds, aligning with Kafka's requirements for durability and guaranteed ordering.
📌 Conclusion
I understand that it was just theory and might not be interesting at all but understanding these concepts is important before diving into the Kafka implementation.