Hello, we are using Oban Pro and are looking at strategies to deal with a throughput problem. Our platform sends text messages. A text job could consist of millions of contacts. Our current process that we inherited uses a workflow and creates multiple Oban jobs per message. Of course that is inefficient, and causes heavy loads on our database, so we have been working on optimizing it.
At a base level, we are using Oban Pro to rate limit the queues and limit global and local concurrency. We also implemented broadway to batch the Oban jobs via AWS SQS. This was fine when we were small, but at scale, this naturally leads to heavy load on the database. Most importantly, we also need to ensure if multiple clients are processing their text jobs at once, each processing or inserting potentially millions of rows, that they are not starving each other of resources.
In addition to optimizing our queries and inserts, we are specifically looking at the strategies of worker chunking to reduce load on the database, and queue/chunk partitioning to ensure multiple campaign concurrency. I think we can use worker chunking to grab a number of jobs from the SQS queue to work on. Does this make sense?
Are there any other strategies we can use to increase our concurrency and prevent resource starvation and database load? Also the docs on chunking and chunk/queue partitioning are informative but sparse. Is there another resource I can look at to get a better idea how to properly implement these strategies?
1 Like
Yes, it makes complete sense. The best way to scale ingestion/event handling is with chunking in a worker.
Using bulk inserts, preferably without uniqueness, and then using a chunk worker is a great place to start.
There’s a more concrete example of using a chunk worker in the old composing jobs with pro article. We’re in the process of updating it and converting it into a guide, but that’s not ready yet.
We’d be happy to help guide you toward a proper implementation and hopefully use that experience to inform the composition guide. For now, a few points to consider that may not be spelled out in the chunk docs:
Every job that the queue starts, up to the local_limit
or global_limit
, represents a single chunk. The first job is the “leader”, and it will fetch jobs up to the specified size. For example, if you have a queue limit of 4
and a chunk size
of 100, you would potentially run 400 jobs at once.
Chunk leaders will compete with each other for other jobs in the chunk. Using the example before, if there are less than 100 jobs then you’ll still have 4 separate chunks, some of which may have 1 job, and another may get the remaining 96 jobs. It all comes down to timing. You’ll need to tweak both queue limit and chunk size to get optimal throughput.
We were looking at using the chunk and queue partitioning to limit the number of chunk leaders that could run simultaneously for each text job as described in the docs. That should help with preventing resource starvation between text jobs, right? A single text job is what some would call the overall campaign initiated by a user, and we want to have as much concurrency between these text jobs as possible so client B doesn’t have to wait for client A’s large job to finish.
Also, is there any way that Dynamic Prioritization could potentially help with this issue?
That’s a great plan. It will help prevent resource starvation because each chunk is guaranteed to be “full”, or timeout and run; you won’t have many competing chunks.
Possibly, but it’s hard to say without more information. If prioritization would help, and you want to be sure that de-prioritized jobs eventually run, then the DynamicPrioritizer
could help.
1 Like