Building a task queue with serialization required on subqueues

I need to build a task queue with an interesting specialization: The queue is partitioned by an ID which I’ll call a “pool ID.”

Any number of jobs may run in parallel and in any sequence, but the jobs for a given pool ID must run in the order that they were queued and must not overlap. Both the number of queued jobs and the number of pool IDs are unconstrained and potentially very large.

I’m looking at GenStage.PartitionDispatcher as an implementation approach. Does this make sense, or is there some other more appropriate approach?

This depends a bit on how your underlying queue mechanism works, but in theory it sounds like you could just create more queues. Items in each queue stay ordered and sequential, queues are unordered and parallel with respect to one another.

There’s a few things here that are tricky though. The following supposition isn’t feasible:

Unless you’ve got unlimited memory the maximal number of parallel jobs that can run is fixed, and the significance of this is that you really do need some notion of sequence between these queues, because otherwise if items keep getting added to pool ID 1, jobs from pool ID 2 may never be run at all.

Some immediate questions which come to mind:

  • What does a single task do? Is it CPU or I/O bound? How long is it running?
  • How many simultaneous tasks do you expect at the same time?
  • Can you somehow determine that tasks for a particular pool id are exhausted (i.e. that no more tasks with that pool id will be produced)?

A simple solution could be to have a GenServer per each “pool id”. You could use Registry to register and discover pools for the given id. With this approach you get serialization per each pool id, and concurrency for different IDs. If you don’t expect too many simultaneous tasks pending, and they are not very long, this solution might be good enough on its own, without needing to resort to any 3rd party libraries.

If you expect many simultaneous tasks, and want to limit parallelism, you could use poolboy or jobs. This would allow you to keep a common queue (shared by all the pools), and ensure that at any point in time you’re executing at most N tasks.

The important thing here is to somehow stop a GenServer for a specific pool (hence the third question).

GenStage might also work, though I don’t have practical experience with it, so can’t help you here.

Also, somewhat related to your question, I recommend reading this article by Fred Hebert, on dealing with overloads.

1 Like