Oban queues disappearing randomly in production

This is going on for a while now. I think it’s best described with an example:

We have an Oban queue “postgresql” with 100K jobs available.
We deploy production, the nodes start and the queues start processing the jobs.
Then the queues start to disappear (no errors in logs) and the processing is slow-ish, it is still going but on the nodes the queues appear / disappear kind of randomly and at one point usually 1/3 of them are running.
And this happens only with a few queues, usually the ones that have a larger number of jobs waiting.

Anybody ever had this problem?

Which Oban and Pro versions? Also, which hosting provider? The only ways I’m aware of that the producers would slow down and potentially crash is from query timeouts, but that would log errors.

Are these inserted all at once, or was it accumulated over time? It shouldn’t change whether the queue stays active, but Pro v1.7 shipped with a feature called automatic spacing, which prevents a big dump of jobs from overwhelming the queue.

Latest hex version but this has been going on for a while, can’t remember when it started but this was the only issue we couldn’t figure out in private messages back then and now it got really annoying so I’m just looking for somebody who experienced the same and maybe solved it.

Our DB is a hyper optimized postgres with beefy hardware 256 GB RAM and this happens even after a full vacuum so the speed is not the problem I believe. Jobs are accumulated over time.

Looks like this, all those queues should have 52 limit (13 node * 4) and running 52 jobs but they have the limit jumping up and down (as queues disappear and restart on nodes). This slows down processing, that’s why there is a postgres2 and postgres3 queue as a mitigation for this problem (so processing goes faster).

Latest Oban version and latest Pro version? The latest Pro release has fine-grained telemetry around the fetch transaction (the main thing a producer does), so if you’re on v1.7+ we can get some better instrumentation in place to see what’s happening.

* oban 2.22.1 (Hex package) (mix)
  locked at 2.22.1 (oban) af2508c1
* oban_met 1.2.0 (Hex package) (mix)
  locked at 1.2.0 (oban_met) 5c81fd33
* oban_pro 1.7.3 (Hex package) (mix)
  locked at 1.7.3 (oban/oban_pro) 76c8b2ed
* oban_web 2.12.4 (Hex package) (mix)
  locked at 2.12.4 (oban_web) f6262dc5

Okay how should I provide you the data then?

Attach a telemetry handler like this, which you can optionally scope to a single queue/producer, and gather metrics for a healthy instance as it starts to degrade. This shows a Logger, but you can output however you like to a CSV or something, then email it to us at support.

  defmodule MyApp.ObanFetchTelemetry do
    require Logger

    @events [
      [:oban, :engine, :fetch_jobs, :demand, :stop],
      [:oban, :engine, :fetch_jobs, :fetch, :stop],
      [:oban, :engine, :fetch_jobs, :flush, :stop],
      [:oban, :engine, :fetch_jobs, :ack, :stop]
    ]

    def attach do
      :telemetry.attach_many("oban-fetch-timing", @events, &__MODULE__.handle/4, nil)
    end

    def handle([:oban, :engine, :fetch_jobs, stage, :stop], %{duration: duration}, meta, _) do
      Logger.info(
        message: "oban fetch_jobs stop",
        stage: stage,
        duration_us: System.convert_time_unit(duration, :native, :microsecond),
        queue: meta[:queue],
        count: meta[:count]
      )
    end
  end