Klife - A Kafka client with performance gains over 10x

oliveiragahenrique · October 27, 2024, 12:01am

Announcing Klife: A High-Performance Kafka Client for Elixir

I’m thrilled to share the next chapter in a journey that began a couple of years ago—introducing Klife, a Kafka client for Elixir, designed with a strong focus on performance and efficiency.

Currently, Klife supports producer functionalities, and I’m excited to start designing the consumer system later this year, though it may take some time before it’s ready for release.

The journey began with Klife Protocol, where I explored reimplementing Kafka’s protocol entirely in Elixir. This process opened up new possibilities to optimize the protocol itself, and I found ways to achieve more with less overhead.

Key to Klife’s Performance Gains: Batching

One of Klife’s primary performance improvements is achieved through batching. By bundling data destined for the same broker into a single TCP request, Klife significantly enhances performance, especially in high-throughput scenarios.

In benchmarking against two fantastic, well-established libraries, brod and kafka_ex, Klife achieved up to 15x higher throughput for producing messages. You can find details about the benchmarking setup and results in the project’s documentation.

Key features

Klife includes several features designed to improve both performance and ease of use:

Efficient Batching: Batches data to the same broker in a single TCP request per producer.
Minimal Resource Usage: Only one connection per broker for each client, optimizing resource usage.
Synchronous and Asynchronous Produce Options: Synchronous produces return the offset, while asynchronous produces support callbacks.
Batch Produce API: Allows batching for multiple topics and partitions.
Automatic Cluster and Metadata Management: Automatically adapts to changes in cluster topology and metadata.
Testing Utilities: Includes helper functions for testing against a real broker without complex mocking.
Simple Configuration: Streamlined setup for straightforward use.
Comprehensive Documentation: Includes examples and explanations of trade-offs.
Custom Partitioner per Topic: Configurable partitioning for each topic.
Transactional Support: Supports transactions in an Ecto-like style.
SASL Authentication: Currently supports plain authentication.
Protocol Compatibility: Supports recent protocol versions, with forward compatibility in mind.

Looking ahead

I hope Klife proves useful for those in the community, especially for use cases where producer performance is crucial. While it’s currently producer-only, the efficiency and resource gains might be valuable for some applications.

I’d love to hear any feedback, ideas, or issues you may encounter while trying it out. Thank you to everyone in the Elixir and Kafka communities who has inspired and supported this journey—Klife wouldn’t exist without you!

hexdocs: README — Klife v0.1.0

jakemorrison · October 28, 2024, 12:10pm

This sounds awesome.

My biggest problem with the current Kafka consumer libraries is flow control/feedback.

I work with an application that handles billions of messages a day. I am always working at the limits of the systems that I am receiving data into, e.g., Elasticsearch or Postgres. I can receive data from Kafka faster than I can the target systems can handle it. If there is a timeout or other error, then it blows up the connection to Kafka, which needs to be rebuilt.

I end up monitoring the load/insert time on the target system and sleeping before acking the Kafka batch. It would be great to have direct support for this kind of thing in the Kafka client,

This is a problem with Brod, which I mostly use now. Broadway/Flow has similar issues in general. The “demand” mechanism doesn’t deal well with messages that can’t be processed immediately.

dimitarvp · October 28, 2024, 12:20pm

This is news to me: the GenStage-based solutions don’t deal well with the fact that messages can’t be consumed as quickly as they are received?

oliveiragahenrique · October 28, 2024, 12:56pm

Thanks for sharing! I feel your pain! This isn’t exactly an issue with Broadway itself, but rather a conceptual mismatch between how Kafka operates and certain use cases.

These libraries are designed to work with message queue systems where messages can be individually acknowledged, sometimes even out of order

Kafka, on the other hand, is a streaming system where typical consumers stay at a particular offset and won’t advance unless messages are acknowledged in sequence.

This characteristic of Kafka forces a choice:

Don’t acknowledge, and the flow stops.
Acknowledge and move on, even if the message wasn’t properly handled.

If you acknowledge a message that wasn’t fully processed, Kafka doesn’t have an out-of-the-box solution for this—unlike some queue systems.

To handle this purely with Kafka, here are a couple of approaches:

Reproduce the acknowledged-but-unprocessed message to the topic: This isn’t always ideal, but if your system is the only one consuming from the topic and can handle duplicates, it can work.
Use a retry topic to store the “acknowledged but not processed” messages, then consume them in a separate flow: If the consume here fails again, you can follow the first approach on this topic, because you will be the only one consuming it.

My goal with Klife’s consumer design is to eventually include these features out of the box. Ideally, something like:

use Klife.Consumer, retry_topic: {true, retry_config}

The retries would then be managed automatically, though it may be tricky to implement. Let’s see how it goes!

oliveiragahenrique · October 28, 2024, 1:09pm

As I’ve explained here Klife - A Kafka client with performance gains over 10x - #4 by oliveiragahenrique, it is not an issue of genstage or broadway it self, but rather a lack of an out of box solution for deal with this cases in kafka.

As broadwak_kafka docs says:

Handling failed messages
BroadwayKafka never stops the flow of the stream, i.e. it will always ack the messages even when they fail. Unlike queue-based connectors, where you can mark a single message as failed. In Kafka that’s not possible due to its single offset per topic/partition ack strategy. If you want to reprocess failed messages, you need to roll your own strategy. A possible way to do that is to implement Broadway.handle_failed/2 and send failed messages to a separated stream or queue for later processing.

PragTob · October 28, 2024, 3:39pm

Exciting news and thanks for your work!

Kafka clients keep coming up as the biggest pain point in the BEAM eco system in a lot of my conversations. The issues usually aren’t performance but feature support etc… so much so that @sasajuric even gave an entire talk about the topic: https://www.youtube.com/watch?v=iVHpFoDXim4 (you’re probably aware of this).

I’d love for us to have a great and robust kafka client, but at the same time I also believe that it’s too big for any one person to tackle.

But Kafka is, in many scenarios, a default solution and our support for it being lacking locks us out of quite some use cases.

oliveiragahenrique · October 28, 2024, 5:08pm

The issues usually aren’t performance but feature support

You are absolutelly right about performance not being the main issue for now!

As for features, most of the key gaps are on the consumer side, and I’m still evaluating the best approach there. I believe we can leverage BEAM’s distribution model to address certain issues uniquely—particularly rebalancing.

I have a few ideas to explore further before committing to a specific path, which may take some time. I’m estimating around six months to solidify the approach and potentially another year and a half to reach a releasable state, depending on my time availability.

There is some big changes comming for kafka such as:

I plan to incorporate support for these from the start, leveraging the clean-slate approach I’m taking with Klife.

In short, I agree that feature support is the main pain point. For the consumer side, I can only speak to plans at this stage since it’s still in development.

For the current producer, though, I believe it’s in strong shape, with support for batch production, synchronous and asynchronous modes, idempotency with EOS, transactions, custom partitioning, and extensive performance optimizations. If there’s any producer feature you find missing, please let me know so we can look into it!

I’d love for us to have a great and robust kafka client, but at the same time I also believe that it’s too big for any one person to tackle.

I think it may be true, because Kafka is indeed a constantly evolving platform that relies on clients to take on a lot of responsibilities.

This was the main reasons I decided to build this project entirely from scratch, starting with a full protocol rewrite. To keep up with Kafka’s rapid development pace sustainably, especially with limited resources (time and money), it’s essential to have a complete understanding of the stack, end to end.

One project that’s been a big source of inspiration is Franz-go from Golang. It’s an impressive Kafka client that quickly integrates a wide range of KIPs, sometimes even outpacing the official Java client—all while being 99.9% maintained by a single developer.

I’m not sure if I’ll fully achieve this goal, but it’s definitely worth a shot—and the knowledge gained along the way has already been rewarding!

mindreader · October 30, 2024, 11:10am

Kafka interactions were one of the worst parts about my last job. It is good to see someone take up the mantle on this.

Brod is functional but not very good. Not their fault, Kafka is a mess and pushes a huge amount of complexity to the client, and so Kafka bindings are poor everywhere except Java.

Unfortunately messaging is difficult and I had a hard time explaining to other devs that wanted to use Broadway why that auto acking behavior is so problematic.

I’ve considered pushing toward NATS in the future due to its much better semantics and client bindings at my next opportunity, because even of you solve the client issue you still have the back pressure problem and the maintenance nightmare that is Kafka itself (as in the server).

oliveiragahenrique · October 30, 2024, 12:37pm

Thanks for sharing your experience!

The semantic differences between Kafka and simpler message brokers often create a tricky middle ground—it’s not exactly what’s needed, but close enough to make it work. I think this leads to Kafka being overused in cases where it isn’t an ideal fit.

Kafka is, of course, an extremely valuable and powerful tool, especially for handling very high volumes. But more often than not, I see people reaching for it prematurely. Causing the sort of problems you have described.

I’m not entirely sure I understand your point here.

If you’re referring to consumers not keeping up with producers, Kafka’s pull-based approach gives consumers full control over when to retrieve new messages. So, backpressure doesn’t typically become an issue—consumers simply consume at their own pace, and any lag that builds up is usually manageable.

If you’re talking about producers overwhelming the broker, I’ve actually implemented mechanisms to address this. For instance, dynamic batching allows us to adjust batch sizes based on server response times, maximizing efficiency and reducing server load when needed. Although it’s quite rare to hit that limit, these mechanisms are in place to keep server load manageable.

If you’re open to it, I’d love to hear more about what you mean by “backpressure problems.” Any further elaboration would be most welcome! Thanks!

mindreader · October 30, 2024, 2:46pm

If you’re open to it, I’d love to hear more about what you mean by “backpressure problems.”

I don’t know if backpressure problem is the right term. In our application if we processed one event at a time it was inefficient, not to mention there were often duplicates . Each event caused a db query, so we wanted to process many at a time. Brod returns message sets, great, and as a bonus we can deduplicate messages and save more work. The problem is that the number of messages in a message set is barely configurable, so what would happen sometimes is rarely we’d receive thousands of small messages at one time on the same node and kafka leaves us no choice but to process them all in a timely manner and then ack all of them at once or kafka would rebalance you, leading to a cycle of failing to process the messages that had piled up.

There is no way to say hey kakfa, give me a batch of messages but no more than 100 at a time, or if there was, I couldn’t find it. There’s various things you can tweak like max message size to get fewer but you have to calculate it based on your expected message size. Or you could up the amount of time you can take to process messages, but then it is on a case by case basis and brod consumer needs special care. I also looked into just acking 100 at a time, but brod isn’t set up properly to let you ack the consumer supervisor directly, and I don’t even know if that would work.

Nats on the other hand has many ways to deal with this. You can tell it you are still processing a message (not an ack, more of an I’m still working on this, don’t panic). You can also implement request response, so send a message to a local processor that grabs the next 100 messages through a batch version of the topic and sends them to you.

Another problem we had was the developer friction of trying to use kafka locally, vs many devs sharing an instance, so all interactions with it were super difficult to reproduce. One single kafka node will try to take every spare amount of memory on a laptop, not to mention randomly spin out of control and write tons of logs to your disk, so we actually worked around it by reimplementing a basic kafka in our docker compose setup that was just pure elixir but basically did the same thing, and that ended up being the best we could do. So in short, testing these sorts of scenarios was difficult and most issues were found in production.

oliveiragahenrique · October 30, 2024, 4:56pm

Oh, I see now! Thanks for the explanation, this kind of insights are really valuable for the upcoming consumer features.

In my opinion, some of these issues might be addressed by tweaking the consumer API to abstract certain Kafka concepts, creating a more streamlined experience.

As I mentioned in another post, regarding consumers I can only share plans at this stage, but my current approach is not only to serve as a “proxy” for Kafka communication. Instead, I aim to offer a “batteries-included” experience by building on the core Kafka API and leveraging the strengths of the BEAM.

Feel free to share any more Kafka pain points, I’d love to take them into account in the client design! (:

mekusigjinn · November 2, 2024, 7:50pm

This new library has a future ahead and I’m all for a variety in this Kafka world.

Buut, the second best one, brod is good enough for our use case. And the main issue I have is not the producer side. It’s the consumption side.

Broadway is a single stage GenStage right? At work we use BroadwayKafka, and I mistakenly thought if I use Broadway I can connect another GenStage-esque Consumer that will do the dB and offer back pressure. Sadly my understanding was wrong.

Another issue is at the DB layer. Essentially I want to know if dB connection pool is full, and that will trigger a back pressure to Kafka consuming. That way, at the risk of increasing consumer lag - at least I protect the DB, and therefore data integrity.

But no matter how I search, the only answer is do something with the Ecto telemetries. Sure, the telemetries are awesome, and I have a gut feeling I can do something together with the telemetries and the erlang counter library, but I don’t know where to start.

oliveiragahenrique · November 3, 2024, 1:33pm

Yes, I’m totally aware of this. As I mentioned here, I can only speak to plans for the consumer side right now, and I’m very interested in learning about the current issues the community faces with Kafka to try to create a better experience. Thanks for sharing it!

The issue of “back pressure” keeps coming up, and I’m still working to fully understand it, especially since Kafka consumers inherently support back pressure by processing messages in pull mode. This means consumers should only request new records once they’ve fully handled the previous set, so if upstream services become slow, consumption naturally slows down as well.

However, given that back pressure still is a big pain, I believe that Kafka consumers shouldn’t merely slow down in proportion to upstream services; ideally, they would slow down even further to allow upstream services some breathing room to recover.

Several people have noted that bottlenecks often appear in upstream services like databases before Kafka consumption itself becomes an issue. With that in mind, here are a few approaches I’m considering:

The auto-acking behavior in some Kafka libraries might not be ideal for these scenarios. If a database is slow or times out, the Kafka consumer should slow down rather than auto-acknowledging and risking overload.
Providing a “rate limiting” option for consumers could allow manual control over processing rates, helping to prevent Kafka consumers from overwhelming upstream services during production peaks.
Tracking consumption times for each consumer could enable an adaptive rate-limiting feature. For instance, if a consumer usually processes a batch of 10 records within 1 second but suddenly takes 5 seconds, we could reduce the consumption rate to give upstream services a chance to recover. Exposing these metrics to users could also be valuable.

Let me know if I’m properly understanding your back pressure issues, and please feel free to share any others. I’d love to take them into account as I design the consumer features for Klife.

tristan · November 3, 2024, 6:44pm

It’d be great to see GitHub - silviucpp/erlkaf: Erlang kafka driver based on librdkafka compared against as well.

oliveiragahenrique · November 3, 2024, 6:52pm

Yes, I’d definitely like to include that! I didn’t add it initially because, as far as I know, erlkaf doesn’t offer an official synchronous produce mode, and comparing async implementations can be a bit trickier.

I plan to run an async benchmark, though, and will definitely include erlkaf in the list!

oliveiragahenrique · November 18, 2024, 2:01am

Hey @tristan! Thanks to @escobera we now have an async produce benchmark for Klife. The results are the following:

Given the nature of the async test (different semantics and guarantees between libraries, and the BEAM overload caused by tight producing loops), I believe it’s less conclusive than the synchronous benchmark—but still valuable to explore.

Let us know if you have any suggestions to improve this benchmark!

And huge thanks again to @escobera for the work!

oliveiragahenrique · June 12, 2025, 8:44am

Hi everyone!

Some time ago, I released the first version of Klife (Klife - A Kafka client with performance gains over 10x), a Kafka client written from scratch in Elixir. At that time, it only supported producing messages. Since then, I’ve been working on implementing consumer group support, and as it nears completion, I’d love to gather feedback from the community — especially around the design of the API and any Kafka-related pain points you’ve experienced, even if they’re not directly tied to Klife’s interface.

Why This Post?

This post is a starting point for collaboration — a space to discuss Kafka consumer pain points and how Klife might help. I’m sharing the proposed API and internals to collect early feedback, but also to invite anyone who’s wrestled with Kafka on the BEAM to weigh in — whether it’s design suggestions, performance headaches or any small detail that’s made Kafka harder than it should be. No detail is too small.

Basic Usage Example

Here’s a simple example of how a consumer group is defined in Klife:

defmodule MyConsumerGroup do
  use Klife.Consumer.ConsumerGroup,
    client: MyClient,
    group_name: "my_group_name",
    topics: [
      [name: "my_consumer_topic"],
      [name: "my_consumer_topic_2"]
    ]

  @impl true
  def handle_record_batch(topic, partition, record_lists) do
    # Do some processing here!
  end
end

You define a module that implements the consumer group behaviour and pass configuration either via use (compile-time) or at start_link (runtime) depending on your needs. Then start it on your supervision tree.

Consumer Group Behaviour

@type action ::
        :commit | {:skip, String.t()} | {:move_to, String.t()} | :retry

@type callback_opts :: [
        {:handler_cooldown_ms, non_neg_integer()}
      ]

@callback handle_record_batch(topic :: String.t(), partition :: integer, list(Klife.Record.t())) ::
            action
            | {action, callback_opts}
            | list({action, Klife.Record.t()})
            | {list({action, Klife.Record.t()}), callback_opts}

@callback handle_consumer_start(topic :: String.t(), partition :: integer) :: :ok
@callback handle_consumer_stop(topic :: String.t(), partition :: integer, reason :: term) :: :ok

@optional_callbacks [handle_consumer_start: 2, handle_consumer_stop: 3]

The main callback is handle_record_batch/3, where the actual record processing happens. The records are guaranteed to be ordered, and the return value tells Klife what to do with each one:

:commit - everything went fine, commit the record
{:skip, reason} - commit the record but note the reason for skipping it on commit metadata
{:move_to, topic} - commit the record and send it to another topic (for retries or DLQs), also note it on commit metadata
:retry - do NOT commit; requeue the record on the internal queue for another processing cycle with bumped attempt count

Response example:

[
  {:commit, rec1},
  {{:skip, "validation failed"}, rec2},
  {{:move_to, "my_dlq_topic"}, rec3},
  {:commit, rec4},
  {:retry, rec5}
]

If you return a single action (:commit, {:skip, reason}, etc), it will be applied to all records in the batch (i.e., it’s shorthand for the full list).

There are some edge cases to handle, such as:

What should happen if the user returns a list shorter than the number of records?
What if the list contains an invalid order (e.g., retrying a record before committing one with a higher offset)?

The current plan is to raise on these cases to avoid inconsistent consumer group state.

You can also return callback_opts to control things like cooldown timing (more on that below).

Consumer Group Config

A consumer group is responsible for maintaining heartbeat, reacting to rebalances, and managing consumer lifecycles. Here’s a summary of its configuration:

[
  client: [
    type: :atom,
    required: true,
    doc: "The name of the klife client to be used by the consumer group"
  ],
  topics: [
    type: {:list, {:keyword_list, TopicConfig.get_opts()}},
    required: true,
    doc: "List of topic configurations that will be handled by the consumer group"
  ],
  group_name: [
    type: :string,
    required: true,
    doc: "Name of the consumer group"
  ],
  instance_id: [
    type: :string,
    doc: "Value to identify the consumer across restarts (static membership). See KIP-345"
  ],
  rebalance_timeout_ms: [
    type: :non_neg_integer,
    default: 30_000,
    doc:
      "The maximum time in milliseconds that the kafka broker coordinator will wait on the member to revoke it's partitions"
  ],
  fetcher_name: [
    type: :atom,
    doc:
      "Fetcher name to be used by the consumers of the group. Defaults to client's default fetcher"
  ],
  committers_count: [
    type: :pos_integer,
    default: 1,
    doc: "How many committer processes will be started for the consumer group"
  ],
  isolation_level: [
    type: {:in, [:read_committed, :read_uncommitted]},
    default: :read_committed,
    doc:
      "Define if the consumers of the consumer group will receive uncommitted transactional records"
  ]
]

A few Klife-specific highlights:

:fetcher_name - lets you group fetches to brokers under a named fetcher, enabling custom batching behavior. Similar to the current :producer_name options on the producer feature.
:committers_count - since commit requests can not be grouped across groups, each consumer group has it’s specific commiter process which may be a bottleneck when consuming many partitions, with this option you can share the load among more committer processes

Per-Topic Consumer Options

Each consumer group manages a set of consumers—one per assigned partition. Each consumer runs its own processing loop, backed by an internal queue, and performs asynchronous fetch and commit operations to maximize throughput.

While some configuration is inherited from the consumer group, most settings are defined in the TopicConfig, which is passed via the topics option in the consumer group. These options include:

[
  name: [
    type: :string,
    required: true,
    doc: "Name of the topic the consumer group will subscribe to"
  ],
  fetcher_name: [
    type: {:or, [:atom, :string]},
    doc:
      "Fetcher name to be used by the consumers of this topic. Overrides the one defined on the consumer group."
  ],
  isolation_level: [
    type: {:in, [:read_committed, :read_uncommitted]},
    doc: "May override the isolation level defined on the consumer group"
  ],
  offset_reset_policy: [
    type: {:in, [:latest, :earliest, :error]},
    default: :latest,
    doc:
      "Define from which offset the consumer will start processing records when no previous committed offset is found."
  ],
  fetch_max_bytes: [
    type: :non_neg_integer,
    default: 50_000,
    doc:
      "The maximum amount of bytes to fetch in a single request. Must be lower than fetcher config `max_bytes_per_request`"
  ],
  fetch_interval_ms: [
    type: :non_neg_integer,
    default: 5000,
    doc: """
    Time in milliseconds that the consumer will wait before trying to fetch new data from the broker after it runs out of records to process.

    The consumer always tries to optimize fetch requests wait times by issuing requests before it's internal queue is empty. Therefore
    this option is only used for the wait time after a fetch request returns empty.

    TODO: Add backoff description
    """
  ],
  handler_cooldown_ms: [
    type: :non_neg_integer,
    default: 0,
    doc: """
    Time in milliseconds that the consumer will wait before handling new records. Can be overrided for one cycle by the handler return value.
    """
  ],
  handler_max_commits_in_flight: [
    type: :non_neg_integer,
    default: 0,
    doc: """
    Controls how many commit messages can be waiting for confirmation before the consumer stops processing new records.

    When this limit is reached, processing pauses until confirmations are received. Set to 0 to process records one batch at a time - each batch must be fully confirmed before starting the next.
    """
  ],
  handler_max_batch_size: [
    type: :pos_integer,
    default: 10,
    doc:
      "The maximum amount of records that will be delivered to the handler in each processing cycle."
  ]
]

Notable options:

:fetch_max_bytes - Controls the size of each fetch request (not the full queue size). Actual memory use may exceed this due to async fetch prefetching.
:fetch_interval_ms - This setting only applies when a fetch request returns no records. On busy topics, the consumer fetches on demand as soon as the internal queue drops below a threshold. But if a fetch returns empty, the consumer enters a progressive linear backoff, gradually increasing the wait time until it reaches :fetch_interval_ms, after which it waits that full interval between retries until new data becomes available.
:handler_cooldown_ms - Adds a post-commit cooldown between batches, helping throttle consumption without blocking useful work. Aimed to address issues like the ones reported on the original post when .
:handler_max_commits_in_flight - Allows processing new batches while waiting for previous commits to complete. A performance vs consistency tradeoff — useful when strict ordering isn’t needed. I’m also planning to add an ETS-based temporary offset store (optionally replicated cluster-wide) to reduce the risk of duplicate processing on crashes.

Caveats

KIP-848 Only: The current implementation is built around the new rebalance protocol introduced in KIP-848, which became general available on Kafka 4.0. This gives us access to modern consumer features and address ont of the biggest pain points afaik (costly rebalances), but may limit compatibility with older clusters.
Still a Work in Progress: The commit logic is still being finalized, but partition assignment, rebalancing, and record handling are already working well. You can check out the current implementation here: GitHub - oliveigah/klife: Kafka client for elixir

Wrap-up

Thank you for reading this far!

Again, if you’ve struggled with Kafka in Elixir before, I’d love to hear from you — whether it’s feedback on the proposed interface, missing features in existing clients, or design tradeoffs you’d like to see better addressed. Even non-technical frustrations are valuable at this stage!

Let’s use this thread as an open forum for discussing Elixir + Kafka. Your insights will directly help shape Klife’s future direction.

Thanks in advance!

oliveiragahenrique · June 12, 2025, 8:57am

cc: @jakemorrison @mindreader @mekusigjinn

You mentioned some Kafka pain points in the original announcement post, would love to hear your thoughts here as well. Thanks!

mindreader · June 12, 2025, 12:08pm

It looks like you’ve done a lot of good work. I’ve been mostly trying to keep kafka out of my companies because it tends to be the wiring between services and can actually cause the company to stop using elixir in favor of mainstream languages.

That said it looks like you’ve put a lot of thought into it, and if I can no longer avoid it I will definitely take it for a spin. I know that writing fully feature kafka client is a rough job (due to the way kafka works).

One thing that helps is adding telemetry. Seeing information about telemetry in for example README — gnat v1.10.0 give me some assurance that the author is thinking about production use.

I like the async produce function. Hopefully you can send a pid along so that upon receipt, it can send a message back to the original process.

I worry that using the phash2 function for partitioning is a problem. Kafka has a default partition algorithm and if your algorithm by default doesn’t match how the official bindings hash, it will select a different place to send messages to than the official bindings would, which believe it or not, could cause issues. Not something you would notice immediately, but I’ve written applications where it would matter (production from different languages consistently into partitions).

I dig the testing stuff. That was always lacking when I used to use it. Running an instance of kafka for testing or dev work is absolutely the worst.

Actually one thing I ended up having to do, is in order to simulate kafka without actually running kafka in dev was to basically just write a genserver that would run on all nodes and distribute messages the way they would be distributed in actual kafka. If that came in your package it would be a life saver, and its pretty easy to do in elixir.

Good luck.

oliveiragahenrique · June 12, 2025, 11:21pm

Thank you so much for your feedback! (:

Yes! I will add some telemetry before the first oficial 1.0 for sure! I’m not sure exactly about which ones yet, but I will work on it!

This is straightforward to support in produce_async/2, but not in produce_batch_async/2. By design, each record in a batch results in a separate message sent to the target PID. Ideally, we would deliver the entire batch as a single message, but that would require an intermediary process to collect and forward the full list.

To keep the interfaces consistent for now, this behavior is not currently supported. But I would like to support if I can find a good way to do it.

Great catch! I had not considered how using a different default partitioning strategy could cause issues in multi-clients environments. I will update the default in version 1.0, thanks for pointing it out!

That said, it’s still easy to support custom behavior by plugging in a custom partitioner.

Yes! Right now, tests require a local Kafka instance and support assertions on produced records. Some folks have mentioned they are not too comfortable using it in development (personally, I do not think it is that bad, though I might just be too deep into it haha). That said, I definitely plan to add support for testing without a local Kafka. Just need to think through the best way to implement it.