Oban jobs stuck in available state

karlosmid · April 17, 2025, 1:57pm

Hi! We have following oban versions:
Oban v2.17.12
Oban.Web v2.9.7
Oban.Pro v1.4.13
Elixir:
erlang 27.0.1
elixir 1.17.2-otp-27

We have 20.8 million of messages in available state, all from same queue.
And those are stuck in that state.
Application restart does not help.
In oban web, we can not load available jobs, because we got connection error:

13:39:59.043 [error] Ch.Connection (#PID<0.12344.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.94175.0> timed out because it queued and checked out the connection for longer than 15000ms

We have a lot of DBConnection errors.

How to unstuck those available jobs?

Thank you!

hubertlepicki · April 17, 2025, 2:35pm

What’s your database set up? You may need to bump up the database instance so it can handle that amount of messages and query it effectively

karlosmid · April 17, 2025, 5:40pm

Hi, thank you for reply.
Database instance is on rds, setup up for production. Yes, traffic is too much due to database connection errors.
Actually my question is what does job status available means. Is it for jobs that failed to successfully finish?

benwilson512 · April 17, 2025, 5:45pm

No, that is for jobs waiting in the queue to run.

madclaws · April 17, 2025, 5:50pm

Similar issue as here.

sorentwo · April 17, 2025, 6:03pm

There are several possible reasons why jobs won’t run, but it seems specifically due to those timeout errors. Some immediate changes you can make to try and get the system unstuck:

Rebuild the indexes (reindex table oban_jobs), force analyze (vacuum analyze oban_jobs)
Increase the size of your RDS instance (it may be fine for normal usage, but not for a backlog of 20m jobs)

Manually move most of those jobs to a scheduled state to space them out. Something like this could work to reduce the available count down to 100k jobs:

update oban_jobs set state = 'scheduled', scheduled_at = now() + '1 hour'::interval
where id < (select min(id) from oban_jobs where state = 'available') - 100000

Beyond that, here are some other suggestions once the system is moving again:

Apply some of the options from the official scaling guide.
Upgrade to Oban v2.19 and Pro v1.5+, primarily for the new check_available/1 implementation that optimizes the query to check for available jobs (benchmark info in the linked changelog)
Upgrade to Oban Web v2.10+ (v2.11 if possible), because it uses Oban.Met for much less database impact (especially count estimates)

sorentwo · April 17, 2025, 6:09pm

This doesn’t appear to be the same issue. This issue is from database queries timing out, most likely the staging query that checks for available jobs. Restarting the app can’t fix the performance of that query; it either needs a more powerful DB or improved queries.

Since restarting the app fixed your scenario, it looks more like a pubsub failure. If you’re using the Postgres notifier and postgrex prior to v0.20, there is a known issue with silent disconnects that broke pubusb.

madclaws · April 17, 2025, 6:18pm

In our case, we are not clustered. But we are indeed using Postgres notifier and postgrex 0.19. Will upgrading postgrex fix the issue?

sorentwo · April 17, 2025, 6:20pm

It most likely will. I’ve replied in the other thread with more details, including some suggestions on how to monitor for that situation.

karlosmid · April 21, 2025, 4:57pm

@sorentwo, thank you very much for this advice. We first did 1 and 3, and that helped.
We changed 3. to:

WITH jobs_to_update AS (
  SELECT id
  FROM oban_jobs
  WHERE state = 'available'
  LIMIT 100000
)
UPDATE oban_jobs
SET state = 'scheduled',
    scheduled_at = now() + INTERVAL '1 hour'
WHERE id IN (SELECT id FROM jobs_to_update);

Because for first query we got UPDATED 0 records.

I have a question for status discarded. When I click on a discarded job, I can see 20/20 retries with status completed. What does it mean completed? That we depleted max number of retirees and that job was not actually done?
In that case, could we also apply advice from point 3?

Also, I checked Oban docs, but could not found what is job lifecycle between states.

We also followed instructions for official scaling guide:

notifier: Oban.Notifiers.PG,
          insert_trigger: false

as we are in node cluster on aws using libcluster

Thank you!

sorentwo · April 21, 2025, 8:31pm

The discarded status means it exhausted all of its available attempts. It’s not completed, which means it ran successfully.

There’s nothing comprehensive in the oban docs, but we’ll add a guide. You can learn more in the oban_training introduction though.

sorentwo · April 22, 2025, 4:16pm

There’s a Job Lifecycle guide now

dimitarvp · April 22, 2025, 4:23pm

Now that is an extremely valuable write-up. Respects for providing it.