Multiple batching processes with Broadway

lukyanov · June 4, 2020, 11:59am

Hello,

I’d like to ask for some advice about how to organize my data pipeline with Broadway.

Background:

There is a stream of user-generated events (user actions in an application). I need to send notification email about those actions to users subscribed.

Requirements:

First, I need to group events into batches according to some event data: merge similar events, cancel out opposite events, etc. So this process should reduce the number of events and make them more meaningful. Then I need to group those new merged events again to send them in batches (one email may contain several notifications).

Problem:

This brings to the idea of using Broadway as it seems a very good fit at first glance. But the thing is I would need to do batching twice: one is to merge similar events, another is to group notifications before sending. The batching criteria will be different (read “batch key”) for those.

What would you suggest? As far as I understand, one instance of Broadway can only batch once.
Is it feasible to set up the second Broadway pipeline to do the second batching? There will be an issue with “acking” though: the first pipeline will ack messages in the initial producer without waiting for the second pipeline.

I’d appreciate any thoughts on the subject.

Thank you.

chasers · June 4, 2020, 11:39pm

You don’t need Broadway unless you’re doing like 1000s events a second. And even then Broadways batching is really in the context of what you can keep in memory. If you have that kind of volume you could use Broadway to batch inserts into your db table for example.

Then query that periodically to generate your notifications and maybe queue those in Oban to send the email.

chasers · June 4, 2020, 11:42pm

I guess if you’re using Broadway at that point just use another pipeline to queue the emails.

lukyanov · June 5, 2020, 6:26am

You don’t need Broadway unless you’re doing like 1000s events a second.

I think Broadway is good as a way to model data pipelines and structure your code in general. Using it doesn’t necessarily require 1000s events a second. I like the batching mechanism and the fact that you can plug any producer. Otherwise, I would need to repeat the same logic myself.

Thanks for your suggestions! Another queue in the middle really makes sense here.