Apache Kafka (II): Interesting learning resources

As explained in my last article, I needed to learn Apache Kafka at a fast pace. My plan after that was to write a series of articles ordering my learns, thoughts and ideas but something has changed. Despite I am confident in my communication skills, I found a series of videos by Confluent (I have no relation with them) I would have loved to discover when I was starting. They are four mini-courses that explain Apache Kafka, Kafka Stream and ksqlDB. I would have loved to have them before I started because I think they are clear enough, concise and platform agnostic (not linked to their platform).

For all these reasons, I have considered it is better to help people to point to the resources instead of just writing a few articles. Everyone that follows the blog knows that I do not do this, to share other materials, too often but I think they do a good job explaining all concepts with examples related to Apache Kafka and there is no point rewriting what they are doing. There is, maybe, space for one last article on this series with some configurations, tips and pitfalls (I have a long list of that).

The series of videos are:

  • Apache Kafka 101: A brief introduction, the explanation of basic concepts related to it and the basic use.
  • Kafka Streams 101: A nice, full of examples, course about Kafka Streams and related concepts.
  • ksqlDB 101: Same format as the previous on focus on ksqlDB.
  • Inside ksqlDB: Going one step further on ksqlDB showing some of the operations.
Apache Kafka (II): Interesting learning resources

Apache Kafka (I): Intro + Installation

Due to some recent changes, I have been forced to learn about Apache Kafka. It has been suddenly and, despite the fact I love to learn new things, a bit in a hurry for my taste. For this reason, I have planned to write a few articles trying to sort my learnings and my ideas and put some order into the chaos. During this extremely quick learning period, I have learned more complex concepts before than basics ones, and tricks and exceptions before than rules. I hope these articles help me to get a better understanding of the technology, and to be able to help anyone else in the same position.

What is Apache Kafka?

On the project’s webpage, we can see the following definition:

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Maybe, a more elaborated definition is that Apache Kafka is an event streaming platform, a highly scalable and durable system capable of continuously ingesting gigabytes of events per second from various sources and making them available in milliseconds, used to collect, store and process real-time data streams at scale. It has numerous use cases, including distributed logging, stream processing and Pub-Sub messaging.

At this point, probably still confused, I know how you feel, it happens to me the first time. Let’s see if we can dig a bit more into all of this. Splitting the definition into smaller parts, we have:

  • An event is something that happens in a system. It can be anything from a particular user logging in, to a sensor registering a temperature.
  • A data pipeline reliably processes and moves data from one system to another, in this case, events.
  • A streaming application is an application that consumes streams of data.

How does Apache Kafka work?

Apache Kafka combines two messaging models:

  • Queueing: Queuing allows for data processing to be distributed across many consumer instances, making it highly scalable. However, traditional queues are not multi-subscriber.
  • Publish-subscribe: The publish-subscribe approach is multi-subscriber, but because every message goes to every subscriber it cannot be used to distribute work across multiple worker processes.

To be able to combine both approaches, Apache Kafka uses a partitioned log model, considering a log an ordered sequence of events distributed into different partitions. This allows the existence of multiple subscribers (one per partition) and, while it guarantees the delivery order per partition, this is not guaranteed per log.

This model provides replayability, which allows multiple independent applications to process the events in a log (data stream) at their own pace.

RabbitMQ vs Apache Kafka

One of the first questions that pop up into my mind when reviewing Apache Kafka for the first time was “why do not use RabbitMQ?” (I have. more extensive experience with it). Over time, I discovered that this was not the right question, a more accurate one should be “what each service excels at?“. The second question is closer to the mentality we should have as software engineers.

The one sentence to remember when deciding is that RabbitMQ’s message broker design excelled in use cases that had specific routing needs and per-message guarantees, whereas Kafka’s append-only log allowed developers access to the stream history and more direct stream processing.

RabbitMQ seems better suited for long-running tasks when it is needed to run reliable background jobs. And for communication and integration within, and between applications, i.e as a middleman between microservices. While Apache Kafka seems ideal if a framework for storing, reading (re-reading), and analyzing streaming data is needed.

Some more concrete scenarios are:

  • Apache Kafka:
    • High-throughput activity tracking: Kafka can be used for a variety of high-volume, high-throughput activity-tracking applications. For example, you can use Kafka to track website activity (its original use case), ingest data from IoT sensors, monitor patients in hospital settings, or keep tabs on shipments.
    • Stream processing: Kafka enables you to implement application logic based on streams of events. You might keep a running count of types of events or calculate an average value over the course of an event that lasts several minutes. For example, if you have an IoT application that incorporates automated thermometers, you could keep track of the average temperature over time and trigger alerts if readings deviate from a target temperature range.
    • Event sourcing: Kafka can be used to support event sourcing, in which changes to an app state are stored as a sequence of events. So, for example, you might use Kafka with a banking app. If the account balance is somehow corrupted, you can recalculate the balance based on the stored history of transactions.
    • Log aggregation: Similar to event sourcing, you can use Kafka to collect log files and store them in a centralized place. These stored log files can then provide a single source of truth for your app.
  • RabbitMQ:
    • Complex routing: RabbitMQ can be the best fit when you need to route messages among multiple consuming apps, such as in a microservices architecture. RabbitMQ consistent hash exchange can be used to balance load processing across a distributed monitoring service, for example. Alternative exchanges can also be used to route a portion of events to specific services for A/B testing.
    • Legacy applications: Using available plug-ins (or developing your own), you can deploy RabbitMQ as a way to connect consumer apps with legacy apps. For example, you can use a Java Message Service (JMS) plug-in and JMS client library to communicate with JMS apps.

Apache Kafka installation (on macOS)

The articles are going to be focused on a macOS system, but steps to install Apache Kafka in other systems should be easy enough, and the rest of the concepts should be indistinguishable from one system to another.

We are going to install Apache Kafka using Homebrew, a package manager for macOS. The installation is as simple as running:

brew install kafka

This command will install two services on our system:

  • Apache Kafka
  • Zookeeper

We can run them as services using the Homebrew commands ‘brew services’ (similar to the ‘systemctl’ if you are a Linux user), or execute them manually.

To run them as services:

brew services start zookeeper
brew services start kafka

To run them manually:

zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties
kafka-server-start /usr/local/etc/kafka/server.properties

Everything should work out of the box. With this, we will have a simple Apache Kafka server running locally.

I have seen some people having connection problems when starting Apache Kafka, if this happens, we can edit the file ‘/usr/local/etc/kafka/server.properties‘, find the line ‘#listeners=PLAINTEXT://:9092‘, uncommented and run it again.

With these steps, we should be ready to start making some progress.

Apache Kafka (I): Intro + Installation

Spring Boot with Kafka

Today we are going to build a very simple demo code using Spring Boot and Kafka.

The application is going to contain a simple producer and consumer. In addition, we will add a simple endpoint to test our development and configuration.

Let’s start.

The project is going to be using:

  • Java 14
  • Spring Boot 2.3.4

A good place to start generating our project is Spring Initialzr. There we can create easily the skeleton of our project adding some basic information about our project. We will be adding two dependencies:

  • Spring Web.
  • Spring for Apache Kafka.
  • Spring Configuration Processor (Optional).

Once we are done filling the form we only need to generate the code and open it on our favourite code editor.

As an optional dependency, I have added the “Spring Boot Configuration Processor” dependency to be able to define some extra properties that we will be using on the “application.properties” file. As I have said is optional, we are going to be able to define and use the properties without it but, it going to solve the warning of them not been defined. Up to you.

Whit the three dependencies, our “pom.xml” should look something like:

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.kafka</groupId>
  <artifactId>spring-kafka</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-configuration-processor</artifactId>
  <optional>true</optional>
</dependency>

The next step is going to be creating our kafka producer and consumer to be able to send a message using the distributed event streaming platform.

For the producer code we are just going to create a basic method to send a message making use of the “KafkaTemplate” offered by Spring.

@Service
public class KafkaProducer {
    public static final String TOPIC_NAME = "example_topic";

    private final KafkaTemplate<String, String> kafkaTemplate;

    public KafkaProducer(KafkaTemplate<String, String> kafkaTemplate) {
        this.kafkaTemplate = kafkaTemplate;
    }

    public void send(final String message) {
        kafkaTemplate.send(TOPIC_NAME, message);
    }
}

The consumer code is going to be even more simple thanks to the “KafkaListener” provided by Spring.

@Service
public class KafkaConsumer {

    @KafkaListener(topics = {KafkaProducer.TOPIC_NAME}, groupId = "example_group_id")
    public void read(final String message) {
        System.out.println(message);
    }
}

And finally, to be able to test it, we are going to define a Controller to invoke the Kafka producer.

@RestController
@RequestMapping("/kafka")
public class KafkaController {

    private final KafkaProducer kafkaProducer;

    public KafkaController(KafkaProducer kafkaProducer) {
        this.kafkaProducer = kafkaProducer;
    }

    @PostMapping("/publish")
    public void publish(@RequestBody String message) {
        Objects.requireNonNull(message);

        kafkaProducer.send(message);
    }
}

With this, all the necessary code is done. Let’s now go for the configuration properties and the necessary Docker images to run all of this.

First, the “application.properties” file. It is going to contain some basic configuration properties for the producer and consumer.

server.port=8081

spring-boot-kafka.config.kafka.server=localhost
spring-boot-kafka.config.kafka.port=9092

# Kafka consumer properties
spring.kafka.consumer.bootstrap-servers=${spring-boot-kafka.config.kafka.server}:${spring-boot-kafka.config.kafka.port}
spring.kafka.consumer.group-id=example_group_id
spring.kafka.consumer.auto-offset-reset=earliest
spring.kafka.consumer.key-deserializer=org.apache.kafka.common.serialization.StringDeserializer
spring.kafka.consumer.value-deserializer=org.apache.kafka.common.serialization.StringDeserializer

# kafka producer properties
spring.kafka.producer.bootstrap-servers=${spring-boot-kafka.config.kafka.server}:${spring-boot-kafka.config.kafka.port}
spring.kafka.producer.key-serializer=org.apache.kafka.common.serialization.StringSerializer
spring.kafka.producer.value-serializer=org.apache.kafka.common.serialization.StringSerializer

In the line 8, we can see the property “spring.kafka.consumer.group-id”. If we look carefully at it, we will see that it match the previous definition of the “groupId” we have done on the consumer.

Lines 10, 11 and 15, 16 define the serialization and de-serialization classes.

Finally, in the list 3 and 4 we have defined a couple of properties to avoid repetition. This are the properties that are showing us a warning message.

To fix it, if we have previously added the “Spring Configuration Processor” dependency, now, we can add the file:

spring-boot-kafka/src/main/resources/META-INF/additional-spring-configuration-metadata.json

With the definition of this properties:

{
  "properties": [
    {
      "name": "spring-boot-kafka.config.kafka.server",
      "type": "java.lang.String",
      "description": "Location of the Kafka server."
    },
    {
      "name": "spring-boot-kafka.config.kafka.port",
      "type": "java.lang.String",
      "description": "Port of the Kafka server."
    }
  ]
}

We are almost there. The only thing remaining is the Apache Kafka. Because we do not want to deal with the complexity of setting an Apache Kafka server, we are going to leverage the power of Docker and create a “docker-compose” file to run it for us:

version: '3'

services:
  zookeeper:
    image: wurstmeister/zookeeper
    ports:
      - 2181:2181
    container_name: zookepper

  kafka:
    image: wurstmeister/kafka
    ports:
      - 9092:9092
    environment:
      KAFKA_ADVERTISED_HOST_NAME: localhost
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_CREATE_TOPIC: "example_topic:1:3"

As we can see, simple stuff, nothing out of the ordinary. Two images, one for Zookepper and one for Apache Kafka, the definition of some ports (remember to match them with the ones in the application.propeties file) and a few variables need for the Apache Kafka image.

With this, we can run the docker-compose file and obtain two containers running:

Now, we can test the end point we have built previously. In this case, to make it simple, we are going to use curl:

`curl -d '{"message":"Hello from Kafka!}' -H "Content-Type: application/json" -X POST http://localhost:8081/kafka/publish`

The result should be something like:

And, this is all. You can find the full source code here.

Enjoy it!

Spring Boot with Kafka