I’m trying to understand the concurrency settings in Broadway and what rule of thumb I might use to set them.
We are using BroadwayKafka. I’ve read the docs about concurrency and partitioning.
There are 3 concurrency settings, for the producer, processor and batcher.
I can see how these are arranged in the Broadway dashboard, it looks like the producer processes (labeled as prod_{0..2}
) hand off to the processors (labeled as proc_{0..9}
), these connect to the batcher processes (labeled as cdc_documents
), which hand off to the processes which handle the messages (labeled as proc_{0..3}
). That’s my reading of it anyway, there is some more information here.
Our topic, “cdc_documents”, has 4x partitions. The messages on the topic have a key which maps them deterministically to a topic partition. We want to process messages with the same key in strict sequential order.
We have 2x nodes in our Elixir cluster and Broadway is configured to use the same group_id, so they are part of the same consumer group.
def start_link(_opts) do
Broadway.start_link(__MODULE__,
name: __MODULE__,
producer: [
module: {BroadwayKafka.Producer, producer_opts()},
concurrency: 3
],
processors: [
default: [
concurrency: 10
]
],
batchers: [
default: [
batch_size: 100,
batch_timeout: 200,
concurrency: 10
],
cdc_documents: [
batch_size: 100,
batch_timeout: 200,
concurrency: 4
],
]
)
end
We have more topics, I’ve just removed them to shorten the code example.
With a certain topics which have a certain amount of partitions how might we configure the concurrency settings.
Our prepare_messages
function maps over the messages and decodes the message data (Avro).
Our handle_message
just sets the batcher based on the topic name, so one batcher per topic.
Our batch_handler
hands off each message to be processed.
Many thanks!