Due to some recent changes, I have been forced to learn about Apache Kafka. It has been suddenly and, despite the fact I love to learn new things, a bit in a hurry for my taste. For this reason, I have planned to write a few articles trying to sort my learnings and my ideas and put some order into the chaos. During this extremely quick learning period, I have learned more complex concepts before than basics ones, and tricks and exceptions before than rules. I hope these articles help me to get a better understanding of the technology, and to be able to help anyone else in the same position.
What is Apache Kafka?
On the project’s webpage, we can see the following definition:
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Maybe, a more elaborated definition is that Apache Kafka is an event streaming platform, a highly scalable and durable system capable of continuously ingesting gigabytes of events per second from various sources and making them available in milliseconds, used to collect, store and process real-time data streams at scale. It has numerous use cases, including distributed logging, stream processing and Pub-Sub messaging.
At this point, probably still confused, I know how you feel, it happens to me the first time. Let’s see if we can dig a bit more into all of this. Splitting the definition into smaller parts, we have:
- An event is something that happens in a system. It can be anything from a particular user logging in, to a sensor registering a temperature.
- A data pipeline reliably processes and moves data from one system to another, in this case, events.
- A streaming application is an application that consumes streams of data.
How does Apache Kafka work?
Apache Kafka combines two messaging models:
- Queueing: Queuing allows for data processing to be distributed across many consumer instances, making it highly scalable. However, traditional queues are not multi-subscriber.
- Publish-subscribe: The publish-subscribe approach is multi-subscriber, but because every message goes to every subscriber it cannot be used to distribute work across multiple worker processes.
To be able to combine both approaches, Apache Kafka uses a partitioned log model, considering a log an ordered sequence of events distributed into different partitions. This allows the existence of multiple subscribers (one per partition) and, while it guarantees the delivery order per partition, this is not guaranteed per log.
This model provides replayability, which allows multiple independent applications to process the events in a log (data stream) at their own pace.
RabbitMQ vs Apache Kafka
One of the first questions that pop up into my mind when reviewing Apache Kafka for the first time was “why do not use RabbitMQ?” (I have. more extensive experience with it). Over time, I discovered that this was not the right question, a more accurate one should be “what each service excels at?“. The second question is closer to the mentality we should have as software engineers.
The one sentence to remember when deciding is that RabbitMQ’s message broker design excelled in use cases that had specific routing needs and per-message guarantees, whereas Kafka’s append-only log allowed developers access to the stream history and more direct stream processing.
RabbitMQ seems better suited for long-running tasks when it is needed to run reliable background jobs. And for communication and integration within, and between applications, i.e as a middleman between microservices. While Apache Kafka seems ideal if a framework for storing, reading (re-reading), and analyzing streaming data is needed.
Some more concrete scenarios are:
- Apache Kafka:
- High-throughput activity tracking: Kafka can be used for a variety of high-volume, high-throughput activity-tracking applications. For example, you can use Kafka to track website activity (its original use case), ingest data from IoT sensors, monitor patients in hospital settings, or keep tabs on shipments.
- Stream processing: Kafka enables you to implement application logic based on streams of events. You might keep a running count of types of events or calculate an average value over the course of an event that lasts several minutes. For example, if you have an IoT application that incorporates automated thermometers, you could keep track of the average temperature over time and trigger alerts if readings deviate from a target temperature range.
- Event sourcing: Kafka can be used to support event sourcing, in which changes to an app state are stored as a sequence of events. So, for example, you might use Kafka with a banking app. If the account balance is somehow corrupted, you can recalculate the balance based on the stored history of transactions.
- Log aggregation: Similar to event sourcing, you can use Kafka to collect log files and store them in a centralized place. These stored log files can then provide a single source of truth for your app.
- Complex routing: RabbitMQ can be the best fit when you need to route messages among multiple consuming apps, such as in a microservices architecture. RabbitMQ consistent hash exchange can be used to balance load processing across a distributed monitoring service, for example. Alternative exchanges can also be used to route a portion of events to specific services for A/B testing.
- Legacy applications: Using available plug-ins (or developing your own), you can deploy RabbitMQ as a way to connect consumer apps with legacy apps. For example, you can use a Java Message Service (JMS) plug-in and JMS client library to communicate with JMS apps.
Apache Kafka installation (on macOS)
The articles are going to be focused on a macOS system, but steps to install Apache Kafka in other systems should be easy enough, and the rest of the concepts should be indistinguishable from one system to another.
We are going to install Apache Kafka using Homebrew, a package manager for macOS. The installation is as simple as running:
brew install kafka
This command will install two services on our system:
- Apache Kafka
We can run them as services using the Homebrew commands ‘brew services’ (similar to the ‘systemctl’ if you are a Linux user), or execute them manually.
To run them as services:
brew services start zookeeper brew services start kafka
To run them manually:
zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties kafka-server-start /usr/local/etc/kafka/server.properties
Everything should work out of the box. With this, we will have a simple Apache Kafka server running locally.
I have seen some people having connection problems when starting Apache Kafka, if this happens, we can edit the file ‘
/usr/local/etc/kafka/server.properties‘, find the line ‘
#listeners=PLAINTEXT://:9092‘, uncommented and run it again.
With these steps, we should be ready to start making some progress.