Bad distribution of jobs across nodes

arkanttus · July 1, 2025, 7:48pm

I’m using Oban PRO and I’m facing a job distribution issue.
When executing some queries on the database, I noticed that some nodes were executing much more jobs than others. For example, during one day, there was a node that executed 1M jobs, while another executed only 100k. This caused a CPU and memory increase in nodes was running more jobs.

My configs:

Oban 2.17.12
Oban Pro 1.4.10
Queue :foo with limit: 20, EngineSmart, partitioned and plugins: Reindexer, DynamicLifeline and DynamicPartitioner
10 nodes (pods) in Kubernetes
Some jobs are running in priority 0 and others in priority 1 or 2, depending on the args in the job

Here is my complete config:

peer: Oban.Peers.Postgres,
repo: MyRepo,
notifier: Oban.Notifiers.PG,
shutdown_grace_period: my_timeout,
stage_interval: :timer.seconds(2),
prefix: "partitioned_oban",
engine: Oban.Pro.Engines.Smart,
plugins: [
   {Oban.Plugins.Reindexer, schedule: my_schedule, timezone: my_timezone, timeout: :timer.minutes(2)},
   {Oban.Pro.Plugins.DynamicLifeline, rescue_interval: :timer.minutes(5)},
   {
	  Oban.Pro.Plugins.DynamicPartitioner,
	  buffer: 2,
	  schedule: my_schedule,
	  timezone: my_timezone,
	  retention: [completed: 1, cancelled: 1, discarded: 30]
   }
],
queues: [
  foo_jobs: 20,
  rate_limited_foo_jobs: [
    local_limit: 20,
    rate_limit: [
      allowed: 10,
      period: 60
    ]
  ]
]

sorentwo · July 1, 2025, 9:06pm

Thanks for posting your question on the forum!

This could be because of a lock race between the two nodes. The node that’s processing more jobs is acquiring the lock first, so it takes the jobs first, and because of the limit the second node isn’t getting as many jobs. It’s most likely that the leader node is the one running more jobs, as it receives the notification about available jobs first (the message is local, there’s nothing over the wire for the local node with the PG notifier).

Fairness between nodes isn’t enforced, e.g. there isn’t a round robin that ensures each node takes some number of jobs before allowing another node. If fairness is essential, then you can explore using separate queues for each node (foo_jobs_1 and foo_jobs_2), then consistently hashing new jobs into one queue or the other.

A couple of side note about DynamicPartitioner:

Unless you’re processing a tremendous number of jobs each day (tens of millions), it may cause more trouble than benefit. It has performance tradeoffs for unique jobs in particular because of how Postgres does partitioning.
Be careful with retaining 30 days of jobs because it requires a separate table for each day. Queries for workflows, batches, chains, etc. that need to check each table have to query 30+ tables as a result.

arkanttus · July 1, 2025, 10:59pm

Thanks for the answer and the tips.
It’s not just 1 node that has the highest load. I have 10 nodes, at least 3 of which have more load than the others. I noticed that none of these 3 are the leaders. In fact, the leader is one of the nodes with the lowest load.

But isn’t the configured queue limit local? How can the fact that node1 takes more jobs interfere with node2’s limit, since the limit is local per node? Assuming that node1 reaches the limit of 20 jobs, node2 still has a remaining limit of 20 jobs.

sorentwo · July 2, 2025, 10:17pm

The local node/leader bit was a hypothesis, but I believe it’s the same root cause. There’s also a possibility of lingering data in the producer metadata on the Pro version you’re using. There has been a tremendous amount of bug fixes and performance tuning from your versions up to Oban v2.19 and Pro v1.6.

I thought you were asking about rate_limited_foo_jobs, in which case the limit is local, but the allowed jobs in one minute is only 10, well below the local limit of 20.

Rate limits apply globally, otherwise you’d have to dynamically recalculate the limit based on the number of nodes (especially difficult with rolling deploys).