Klife - A Kafka client with performance gains over 10x

Announcing Klife: A High-Performance Kafka Client for Elixir

I’m thrilled to share the next chapter in a journey that began a couple of years ago—introducing Klife, a Kafka client for Elixir, designed with a strong focus on performance and efficiency.

Currently, Klife supports producer functionalities, and I’m excited to start designing the consumer system later this year, though it may take some time before it’s ready for release.

The journey began with Klife Protocol, where I explored reimplementing Kafka’s protocol entirely in Elixir. This process opened up new possibilities to optimize the protocol itself, and I found ways to achieve more with less overhead.

Key to Klife’s Performance Gains: Batching

One of Klife’s primary performance improvements is achieved through batching. By bundling data destined for the same broker into a single TCP request, Klife significantly enhances performance, especially in high-throughput scenarios.

In benchmarking against two fantastic, well-established libraries, brod and kafka_ex, Klife achieved up to 15x higher throughput for producing messages. You can find details about the benchmarking setup and results in the project’s documentation.

Key features

Klife includes several features designed to improve both performance and ease of use:

  • Efficient Batching: Batches data to the same broker in a single TCP request per producer.
  • Minimal Resource Usage: Only one connection per broker for each client, optimizing resource usage.
  • Synchronous and Asynchronous Produce Options: Synchronous produces return the offset, while asynchronous produces support callbacks.
  • Batch Produce API: Allows batching for multiple topics and partitions.
  • Automatic Cluster and Metadata Management: Automatically adapts to changes in cluster topology and metadata.
  • Testing Utilities: Includes helper functions for testing against a real broker without complex mocking.
  • Simple Configuration: Streamlined setup for straightforward use.
  • Comprehensive Documentation: Includes examples and explanations of trade-offs.
  • Custom Partitioner per Topic: Configurable partitioning for each topic.
  • Transactional Support: Supports transactions in an Ecto-like style.
  • SASL Authentication: Currently supports plain authentication.
  • Protocol Compatibility: Supports recent protocol versions, with forward compatibility in mind.

Looking ahead

I hope Klife proves useful for those in the community, especially for use cases where producer performance is crucial. While it’s currently producer-only, the efficiency and resource gains might be valuable for some applications.

I’d love to hear any feedback, ideas, or issues you may encounter while trying it out. Thank you to everyone in the Elixir and Kafka communities who has inspired and supported this journey—Klife wouldn’t exist without you! :purple_heart:

hexdocs: README — Klife v0.1.0

42 Likes

This sounds awesome.

My biggest problem with the current Kafka consumer libraries is flow control/feedback.

I work with an application that handles billions of messages a day. I am always working at the limits of the systems that I am receiving data into, e.g., Elasticsearch or Postgres. I can receive data from Kafka faster than I can the target systems can handle it. If there is a timeout or other error, then it blows up the connection to Kafka, which needs to be rebuilt.

I end up monitoring the load/insert time on the target system and sleeping before acking the Kafka batch. It would be great to have direct support for this kind of thing in the Kafka client,

This is a problem with Brod, which I mostly use now. Broadway/Flow has similar issues in general. The “demand” mechanism doesn’t deal well with messages that can’t be processed immediately.

2 Likes

This is news to me: the GenStage-based solutions don’t deal well with the fact that messages can’t be consumed as quickly as they are received?

Thanks for sharing! I feel your pain! This isn’t exactly an issue with Broadway itself, but rather a conceptual mismatch between how Kafka operates and certain use cases.

These libraries are designed to work with message queue systems where messages can be individually acknowledged, sometimes even out of order

Kafka, on the other hand, is a streaming system where typical consumers stay at a particular offset and won’t advance unless messages are acknowledged in sequence.

This characteristic of Kafka forces a choice:

  • Don’t acknowledge, and the flow stops.
  • Acknowledge and move on, even if the message wasn’t properly handled.

If you acknowledge a message that wasn’t fully processed, Kafka doesn’t have an out-of-the-box solution for this—unlike some queue systems.

To handle this purely with Kafka, here are a couple of approaches:

  • Reproduce the acknowledged-but-unprocessed message to the topic: This isn’t always ideal, but if your system is the only one consuming from the topic and can handle duplicates, it can work.

  • Use a retry topic to store the “acknowledged but not processed” messages, then consume them in a separate flow: If the consume here fails again, you can follow the first approach on this topic, because you will be the only one consuming it.

My goal with Klife’s consumer design is to eventually include these features out of the box. Ideally, something like:

use Klife.Consumer, retry_topic: {true, retry_config}

The retries would then be managed automatically, though it may be tricky to implement. Let’s see how it goes!

1 Like

As I’ve explained here Klife - A Kafka client with performance gains over 10x - #4 by oliveiragahenrique, it is not an issue of genstage or broadway it self, but rather a lack of an out of box solution for deal with this cases in kafka.

As broadwak_kafka docs says:

Handling failed messages
BroadwayKafka never stops the flow of the stream, i.e. it will always ack the messages even when they fail. Unlike queue-based connectors, where you can mark a single message as failed. In Kafka that’s not possible due to its single offset per topic/partition ack strategy. If you want to reprocess failed messages, you need to roll your own strategy. A possible way to do that is to implement Broadway.handle_failed/2 and send failed messages to a separated stream or queue for later processing.

1 Like

Exciting news and thanks for your work!

Kafka clients keep coming up as the biggest pain point in the BEAM eco system in a lot of my conversations. The issues usually aren’t performance but feature support etc… so much so that @sasajuric even gave an entire talk about the topic: https://www.youtube.com/watch?v=iVHpFoDXim4 (you’re probably aware of this).

I’d love for us to have a great and robust kafka client, but at the same time I also believe that it’s too big for any one person to tackle.

But Kafka is, in many scenarios, a default solution and our support for it being lacking locks us out of quite some use cases.

3 Likes

The issues usually aren’t performance but feature support

You are absolutelly right about performance not being the main issue for now!

As for features, most of the key gaps are on the consumer side, and I’m still evaluating the best approach there. I believe we can leverage BEAM’s distribution model to address certain issues uniquely—particularly rebalancing.

I have a few ideas to explore further before committing to a specific path, which may take some time. I’m estimating around six months to solidify the approach and potentially another year and a half to reach a releasable state, depending on my time availability.

There is some big changes comming for kafka such as:

I plan to incorporate support for these from the start, leveraging the clean-slate approach I’m taking with Klife.

In short, I agree that feature support is the main pain point. For the consumer side, I can only speak to plans at this stage since it’s still in development.

For the current producer, though, I believe it’s in strong shape, with support for batch production, synchronous and asynchronous modes, idempotency with EOS, transactions, custom partitioning, and extensive performance optimizations. If there’s any producer feature you find missing, please let me know so we can look into it!

I’d love for us to have a great and robust kafka client, but at the same time I also believe that it’s too big for any one person to tackle.

I think it may be true, because Kafka is indeed a constantly evolving platform that relies on clients to take on a lot of responsibilities.

This was the main reasons I decided to build this project entirely from scratch, starting with a full protocol rewrite. To keep up with Kafka’s rapid development pace sustainably, especially with limited resources (time and money), it’s essential to have a complete understanding of the stack, end to end.

One project that’s been a big source of inspiration is Franz-go from Golang. It’s an impressive Kafka client that quickly integrates a wide range of KIPs, sometimes even outpacing the official Java client—all while being 99.9% maintained by a single developer.

I’m not sure if I’ll fully achieve this goal, but it’s definitely worth a shot—and the knowledge gained along the way has already been rewarding!

5 Likes

Kafka interactions were one of the worst parts about my last job. It is good to see someone take up the mantle on this.

Brod is functional but not very good. Not their fault, Kafka is a mess and pushes a huge amount of complexity to the client, and so Kafka bindings are poor everywhere except Java.

Unfortunately messaging is difficult and I had a hard time explaining to other devs that wanted to use Broadway why that auto acking behavior is so problematic.

I’ve considered pushing toward NATS in the future due to its much better semantics and client bindings at my next opportunity, because even of you solve the client issue you still have the back pressure problem and the maintenance nightmare that is Kafka itself (as in the server).

3 Likes

Thanks for sharing your experience!

The semantic differences between Kafka and simpler message brokers often create a tricky middle ground—it’s not exactly what’s needed, but close enough to make it work. I think this leads to Kafka being overused in cases where it isn’t an ideal fit.

Kafka is, of course, an extremely valuable and powerful tool, especially for handling very high volumes. But more often than not, I see people reaching for it prematurely. Causing the sort of problems you have described.

I’m not entirely sure I understand your point here.

If you’re referring to consumers not keeping up with producers, Kafka’s pull-based approach gives consumers full control over when to retrieve new messages. So, backpressure doesn’t typically become an issue—consumers simply consume at their own pace, and any lag that builds up is usually manageable.

If you’re talking about producers overwhelming the broker, I’ve actually implemented mechanisms to address this. For instance, dynamic batching allows us to adjust batch sizes based on server response times, maximizing efficiency and reducing server load when needed. Although it’s quite rare to hit that limit, these mechanisms are in place to keep server load manageable.

If you’re open to it, I’d love to hear more about what you mean by “backpressure problems.” Any further elaboration would be most welcome! Thanks!

If you’re open to it, I’d love to hear more about what you mean by “backpressure problems.”

I don’t know if backpressure problem is the right term. In our application if we processed one event at a time it was inefficient, not to mention there were often duplicates . Each event caused a db query, so we wanted to process many at a time. Brod returns message sets, great, and as a bonus we can deduplicate messages and save more work. The problem is that the number of messages in a message set is barely configurable, so what would happen sometimes is rarely we’d receive thousands of small messages at one time on the same node and kafka leaves us no choice but to process them all in a timely manner and then ack all of them at once or kafka would rebalance you, leading to a cycle of failing to process the messages that had piled up.

There is no way to say hey kakfa, give me a batch of messages but no more than 100 at a time, or if there was, I couldn’t find it. There’s various things you can tweak like max message size to get fewer but you have to calculate it based on your expected message size. Or you could up the amount of time you can take to process messages, but then it is on a case by case basis and brod consumer needs special care. I also looked into just acking 100 at a time, but brod isn’t set up properly to let you ack the consumer supervisor directly, and I don’t even know if that would work.

Nats on the other hand has many ways to deal with this. You can tell it you are still processing a message (not an ack, more of an I’m still working on this, don’t panic). You can also implement request response, so send a message to a local processor that grabs the next 100 messages through a batch version of the topic and sends them to you.

Another problem we had was the developer friction of trying to use kafka locally, vs many devs sharing an instance, so all interactions with it were super difficult to reproduce. One single kafka node will try to take every spare amount of memory on a laptop, not to mention randomly spin out of control and write tons of logs to your disk, so we actually worked around it by reimplementing a basic kafka in our docker compose setup that was just pure elixir but basically did the same thing, and that ended up being the best we could do. So in short, testing these sorts of scenarios was difficult and most issues were found in production.

2 Likes

Oh, I see now! Thanks for the explanation, this kind of insights are really valuable for the upcoming consumer features.

In my opinion, some of these issues might be addressed by tweaking the consumer API to abstract certain Kafka concepts, creating a more streamlined experience.

As I mentioned in another post, regarding consumers I can only share plans at this stage, but my current approach is not only to serve as a “proxy” for Kafka communication. Instead, I aim to offer a “batteries-included” experience by building on the core Kafka API and leveraging the strengths of the BEAM.

Feel free to share any more Kafka pain points, I’d love to take them into account in the client design! (:

This new library has a future ahead and I’m all for a variety in this Kafka world.

Buut, the second best one, brod is good enough for our use case. And the main issue I have is not the producer side. It’s the consumption side.

Broadway is a single stage GenStage right? At work we use BroadwayKafka, and I mistakenly thought if I use Broadway I can connect another GenStage-esque Consumer that will do the dB and offer back pressure. Sadly my understanding was wrong.

Another issue is at the DB layer. Essentially I want to know if dB connection pool is full, and that will trigger a back pressure to Kafka consuming. That way, at the risk of increasing consumer lag - at least I protect the DB, and therefore data integrity.

But no matter how I search, the only answer is do something with the Ecto telemetries. Sure, the telemetries are awesome, and I have a gut feeling I can do something together with the telemetries and the erlang counter library, but I don’t know where to start.

1 Like

Yes, I’m totally aware of this. As I mentioned here, I can only speak to plans for the consumer side right now, and I’m very interested in learning about the current issues the community faces with Kafka to try to create a better experience. Thanks for sharing it!

The issue of “back pressure” keeps coming up, and I’m still working to fully understand it, especially since Kafka consumers inherently support back pressure by processing messages in pull mode. This means consumers should only request new records once they’ve fully handled the previous set, so if upstream services become slow, consumption naturally slows down as well.

However, given that back pressure still is a big pain, I believe that Kafka consumers shouldn’t merely slow down in proportion to upstream services; ideally, they would slow down even further to allow upstream services some breathing room to recover.

Several people have noted that bottlenecks often appear in upstream services like databases before Kafka consumption itself becomes an issue. With that in mind, here are a few approaches I’m considering:

  • The auto-acking behavior in some Kafka libraries might not be ideal for these scenarios. If a database is slow or times out, the Kafka consumer should slow down rather than auto-acknowledging and risking overload.

  • Providing a “rate limiting” option for consumers could allow manual control over processing rates, helping to prevent Kafka consumers from overwhelming upstream services during production peaks.

  • Tracking consumption times for each consumer could enable an adaptive rate-limiting feature. For instance, if a consumer usually processes a batch of 10 records within 1 second but suddenly takes 5 seconds, we could reduce the consumption rate to give upstream services a chance to recover. Exposing these metrics to users could also be valuable.

Let me know if I’m properly understanding your back pressure issues, and please feel free to share any others. I’d love to take them into account as I design the consumer features for Klife.

1 Like

It’d be great to see GitHub - silviucpp/erlkaf: Erlang kafka driver based on librdkafka compared against as well.

2 Likes

Yes, I’d definitely like to include that! I didn’t add it initially because, as far as I know, erlkaf doesn’t offer an official synchronous produce mode, and comparing async implementations can be a bit trickier.

I plan to run an async benchmark, though, and will definitely include erlkaf in the list!

1 Like