Oban queue is not picking up jobs

Hi there,

I have a queue with the following opts:

{"paused": false, "ack_async": null, "rate_limit": null, "local_limit": 100, "global_limit": {"allowed": 1, "partition": {"keys": ["our_key"], "fields": ["args"]}}, "retry_backoff": 1000, "retry_attempts": 5, "refresh_interval": null}

I have thousands of jobs in this queue, with 7 different our_key, so I’m expecting to see 7 jobs running concurrently in my cluster, however, only 5 are running.

I have two producers, with the following meta:

{"paused": false, "rate_limit": null, "local_limit": 100, "global_limit": {"allowed": 1, "tracked": {"27783521": {"args": {"our_key": "SW5zdGFuY2U6YjU5MWI4ZmItYjk1OS00ODkyLTgzZTEtMmQ3NmVmMjhjMDc2"}, "count": 1, "worker": null}, "126742740": {"args": {"our_key": "SW5zdGFuY2U6MjI1NzI2MzgtZDg0ZC00MDYwLWIwZDItMDQ0MDNiZTc5ODFm"}, "count": 1, "worker": null}}, "partition": {"keys": ["our_key"], "fields": ["args"]}}, "retry_backoff": 1000, "retry_attempts": 5}
{"paused": false, "rate_limit": null, "local_limit": 100, "global_limit": {"allowed": 1, "tracked": {"73912501": {"args": {"our_key": "SW5zdGFuY2U6YmYyY2IzZDYtMjhmMS00ZGRjLWE5NDEtOTU4OGM3Y2Y1YTFl"}, "count": 1, "worker": null}, "133456980": {"args": {"our_key": "SW5zdGFuY2U6NDg3MzFhN2ItOGRjNC00ZTM2LWFlMzgtOTM2N2Q1M2E4MDkx"}, "count": 1, "worker": null}, "133940658": {"args": {"our_key": "SW5zdGFuY2U6NzA5NzE1ZDEtYWM4OS00ZTgyLTg3N2UtZDhhZGU5NGE5NWZk"}, "count": 1, "worker": null}}, "partition": {"keys": ["our_key"], "fields": ["args"]}}, "retry_backoff": 1000, "retry_attempts": 5}

How can I know what’s preventing the other 2 jobs to start?

The oban jobs table aggregation by our key:

select args->>'our_key', count(*) from oban_jobs where queue = 'our_queue' group by 1;
SW5zdGFuY2U6YjU5MWI4ZmItYjk1OS00ODkyLTgzZTEtMmQ3NmVmMjhjMDc2	8316
SW5zdGFuY2U6ODQ4ZDUyODktYjcxMy00Mjg1LTg5YWItMzJiNjgwYjI4ODZl	322
SW5zdGFuY2U6YmYyY2IzZDYtMjhmMS00ZGRjLWE5NDEtOTU4OGM3Y2Y1YTFl	1172
SW5zdGFuY2U6NDg3MzFhN2ItOGRjNC00ZTM2LWFlMzgtOTM2N2Q1M2E4MDkx	713
SW5zdGFuY2U6ZWM1NzA0Y2YtNDk3MC00YzI3LWE5ZjgtYmY0ZTY1YjJlYzJi	401
SW5zdGFuY2U6NzA5NzE1ZDEtYWM4OS00ZTgyLTg3N2UtZDhhZGU5NGE5NWZk	1951
SW5zdGFuY2U6MjI1NzI2MzgtZDg0ZC00MDYwLWIwZDItMDQ0MDNiZTc5ODFm	255

The queue is running only the 5 jobs, so it’s not the local limit.

Oban versions:

"oban": {:hex, :oban, "2.18.3"
"oban_met": {:hex, :oban_met, "0.1.7"
"oban_pro": {:hex, :oban_pro, "1.4.13"
"oban_web": {:hex, :oban_web, "2.10.5"

Oban config:

engine: Oban.Pro.Engines.Smart,
notifier: Oban.Notifiers.PG,
1 Like

This is from an optimization in the partitioning query that has an unfortunate side effect of bottlenecking when there are more jobs from a subset of partitions. There’s a limit applied to the jobs it checks before partitioning. When that limit is only large enough to cover some of the partitions, the ones that aren’t covered can’t run.

The default limit is 5k to conserve memory, but you can increase it with a compile-time option. For example, to bump it to 10k:

config :oban_pro, Oban.Pro.Engines.Smart, partition_limit: 10_000

It’s a sticking point that we’re aware of and plan on fixing eventually. For now, you can work around it with a large enough partition_limit.

3 Likes

Got it, I couldn’t find it in the documentation. Thanks for clarifying.

So, since I have ~13k jobs in this queue, for the 7 different partitions, I need a partition limit higher than it, right? If I have even more, I need to increase it again, right?

Technically you only need a limit higher than the largest partition. Something else to consider, if the exact order they’re processed in doesn’t matter, is to randomize the priority:

priority: Enum.random(0..9)

That effectively mixes the jobs together enough that the lower limit doesn’t matter.

I’ve increased it to 10k, but it still didn’t start the 6th and 7th partitions.

Application.get_env(:oban_pro, Oban.Pro.Engines.Smart)
[partition_limit: 10000]

The largest partition has 8246 jobs.

I think it expects the opts to be a map, I will try that

I’ve changed it to be a map, but it didn’t help.

Application.get_env(:oban_pro, Oban.Pro.Engines.Smart, %{}) |> Map.get(:partition_limit, 5_000)
10000

The jobs are still in available state.

Even increasing the limit to 15k, it doesn’t work.

Are you able to “shuffle” the jobs by changing the priority?

update oban_jobs set priority = mod(id, 9) where state = 'available' and queue = 'name-of-queue';

I cannot do this, I need to respect the order the were scheduled. I have the feeling it’s not reading my conf, otherwise it would have raised an error when it was a keyword, as Map.get(keyword, atom) should fail. I’ve set it in my runtime.exs file:

config :oban_pro, Oban.Pro.Engines.Smart, %{
  partition_limit: String.to_integer(System.get_env("OBAN_PARTITION_LIMIT", "5000"))
}

I was checking some spans and it indeed is still using 5k as limit:

WITH "subset" AS (SELECT ss0."id" AS "id" FROM (SELECT sso0."id" AS "id", sso0."priority" AS "priority", sso0."scheduled_at" AS "scheduled_at", sso0."worker" AS "worker", sso0."args" AS "args", dense_rank() OVER "partition" AS "rank" FROM "public"."oban_jobs" AS sso0 INNER JOIN (SELECT ssso0."id" AS "id" FROM "public"."oban_jobs" AS ssso0 WHERE ((ssso0."state" = 'available') AND (ssso0."queue" = $1)) ORDER BY ssso0."priority", ssso0."scheduled_at", ssso0."id" LIMIT 5000) AS sss1 ON sso0."id" = sss1."id" WINDOW "partition" AS (PARTITION BY sso0."args"->>$2 ORDER BY sso0."priority", sso0."scheduled_at", sso0."id")) AS ss0 WHERE (NOT (ss0."args" @> $3) AND (NOT (ss0."args" @> $4) AND (NOT (ss0."args" @> $5) AND (NOT (ss0."args" @> $6) AND (NOT (ss0."args" @> $7) AND (ss0."rank" <= $8)))))) ORDER BY ss0."priority", ss0."scheduled_at", ss0."id" LIMIT $9) UPDATE "public"."oban_jobs" AS o0 SET "state" = $10, "attempted_at" = $11, "attempted_by" = $12, "attempt" = o0."attempt" + $13 FROM "subset" AS f1 WHERE (((o0."id" = f1."id") AND (o0."state" = 'available')) AND (o0."attempt" < o0."max_attempts")) RETURNING o0."id", o0."state", o0."queue", o0."worker", o0."args", o0."meta", o0."tags", o0."errors", o0."attempt", o0."attempted_by", o0."max_attempts", o0."priority", o0."attempted_at", o0."cancelled_at", o0."completed_at", o0."discarded_at", o0."inserted_at", o0."scheduled_at"

I’ve noticed that it uses the compile_env to read it, so I’m moving it to my prod.exs.

Ah, yes, it’s using compile_env and isn’t changeable at runtime.

1 Like

That worked now :tada: thanks

It’s actually the number of available jobs in the queue.

The query is:

SELECT ssso0."id" AS "id" FROM "public"."oban_jobs" AS ssso0 WHERE ((ssso0."state" = 'available') AND (ssso0."queue" = $1)) ORDER BY ssso0."priority", ssso0."scheduled_at", ssso0."id" LIMIT 10000

So, the more available jobs, the bigger the partition size should be.