Oban jobs stuck as available if using global_limit

I have a queue that is configured like this:

  queues: [
    default: [
      local_limit: 100,
      global_limit: [
        allowed: 4,
        partition: [args: :tenant_id]
      ]
    ]
  ],

When I start a job in it, it never runs it, it will stuck in available state forever. But, if I remove the global_limit config and leave it with just the local_limit line, then the jobs run immediately.

I also noticed that, if I have the global_limit config enabled, Oban Web will not show any job at all, even if they do exist in the DB.

Not sure if it is related to the issue or not, but when I start my backend, I see this log message:

17:30:28.294 [info] {"message":"job staging switched to local mode. local mode polls for jobs for every queue; restore global mode with a functional notifier","mode":"local","source":"oban","event":"stager:switch"}

Here is my worker implementation:

defmodule Core.Workers.ApiCaller do
  @moduledoc false

  use Oban.Pro.Worker,
    queue: :default,
    recorded: [limit: 128_000_000]

  require Logger

  args_schema do
    field :url, :string, required: true
    field :args, :map, required: true
    field :tenant_id, :integer, required: true
  end

  @impl Oban.Pro.Worker
  def process(%Job{args: args} = _job) do
    %{url: _url, args: _args, tenant_id: tenant_id} = args

    dbg("start: #{inspect(self())} #{tenant_id}")

    Process.sleep(5_000)

    dbg("end: #{inspect(self())} #{tenant_id}")

    {:ok, "output"}
  end
end

Which version of Pro are you using? This could be related to poor handling of missing partition keys in slightly older versions of Pro v1.6.

That’s not related, but it does indicate that you don’t have functional pubsub set up for that environment. It could be from a pooler like pgbouncer if you’re using the default Postgres notifier, or because you don’t have a functional cluster if you’re using the PG notifier.

1 Like

I’m using Oban Pro 1.6.11 and Oban 2.20.3.

Regarding the postgres instance, it is my local instance, basically I don’t need or have a clustered postgres, I’m just using the global_limit because of the partition feature.

What I want is to be able to partition the queue by the tenant so one tenant can’t use all the queue and affect negativelly the other tenants.

Of course, that’s exactly what it’s for!

Considering you’re on a current version of Pro, can you confirm that:

  1. You’re running the DynamicLifeline plugin
  2. None of the partition_key values for those rows are null

You’re running the DynamicLifeline plugin

Yes, here is my config for it: {Oban.Pro.Plugins.DynamicLifeline, retry_exhausted: true}

None of the partition_key values for those rows are null

Yep, they all have the tenant_id arg filled, here is how I’m adding it to the DB to test:

  def create! do
    %{
      url: "http://localhost:4000/test_api",
      args: %{a: 1},
      tenant_id: 1
    }
    |> new()
    |> Core.Repo.insert!()
  end

So, I cleaned my DB and now it is running, but, it takes a long time to start, around 1 minute from the time the job is inserted in the Db to start running.

If I remove the global_limit lines, then it starts instantly.

Edit: Also, I noticed that, if I remove the DynamicLifeline the jobs don’t run anymore.

Also, I changed the DynamicLifeline rescue_interval to 10 seconds, and every time the rescue runs, right after it, my job will run.

So, it seems that, for whatever reason, the jobs is “lost” and DynamicLifeLine rescues and runs it.

I can guarantee that the jobs are not orphans jobs that are in the executing state, here is a job that I just added from the db as an example:

  9 | available | default | Core.Workers.ApiCaller | {"url": "http://localhost:4000/test_api", "args": {"a": 1}, "tenant_id": 1} | {}     |       0 |           20 | 2026-01-31 03:48:00.456332 | 2026-01-31 03:48:00.456332 |                            |                            |                                                 |              |        0 | {}   | {"recorded": true, "structured": true}                                                                                           |              |          |

As we can see, the job state is available.

What’s happening is that you’re inserting jobs without the partiton key calculated, then the DynamicLifeline backfills them so they can run. For partitioned queues jobs without a partition key won’t be ran. The trick here is figuring out why those jobs aren’t inserted with a partition_key.

  1. Is this in development only? Are you restarting the app quickly and changing the config between restarts?
  2. Do you have multiple configurations for the same queue? Meaning, is there a worker node that has the default queue set as global, and a web queue that has it set with a simple local limit?

Indeed, I can see here that the jobs do not have partition_id set.

Here is how I’m adding them, I have this function in the worker module:

  def create! do
    %{
      url: "http://localhost:4000/test_api",
      args: %{a: 1},
      tenant_id: 1
    }
    |> new()
    |> Core.Repo.insert!()
  end

  1. I didn’t test this in prod yet since I’m still developing the system. I’m not changing the config.
  2. No, I have 3 workers for that queue, but they do not customize it.

Oh! That explains it. You’re using Core.Repo.insert!() rather than Oban.insert!(), so there’s no opportunity for Oban to calculate the partitioning key. You’d have the same problem with uniqueness, chains, etc. It should look like this:

  def create! do
    %{url: "http://localhost:4000/test_api", args: %{a: 1}, tenant_id: 1}
    |> new()
    |> Oban.insert!()
  end
4 Likes

Ah, damn, that’s correct! I forgot that Oban.insert is not just a delegate to the Repo insert but actually to stuff behind the scene.. Now it is working great!

Thanks for the help!