linusdm

How to investigate Oban jobs stuck in 'available' state, across all queues?

Hi!

This weekend I experienced jobs accumulating in a state available across all queues. We run a two-node setup. One of them was the leader while jobs were accumulating (checking with Oban.Peer.leader?/1), and none of the queues were paused (checking with Oban.check_queue/2). All Oban config is fairly default (the :peer option is not set, two plugins: Cron and Pruner).

Unfortunately I had no time to investigate. Respawning two new nodes (we’re on k8s) solved the issue and jobs were being processed again.

I want to prevent this in the future, but I’m not sure what I can do next time to get to the bottom of this. I checked the troubleshooting guide, but I don’t think there is something in there for this situation (I’m not using PgBouncer for example). I read something about the Stager modes (local vs global), but I’m not sure how to query for that (and if it’s useful information at all).

20 comments

/oban

0 1147 20

2025-05-19 08:01:53 UTC

Most Liked

sorentwo

Oban Core Team

The peer, notifier, and stager events are all helpful for diagnosing whatever happened in your use-case.

Post #4

sorentwo

Oban Core Team

This seems like an issue with postgres pubsub. A disconnect error in versions prior to v0.20 would leave the Oban.Notifier in an undetectable disconnected state. The node would keep emitting notifications about available jobs, but they wouldn’t broadcast.

There are several ways you could monitor this:

Use checkins in a tracker like sentry or honeybadger. Send an from a cron job every minute and get alerted if they stop running.
Use cron monitors in sentry, which have native support for oban in recent versions.
Poll jobs and send it to an external tool like grafana, datadog, etc. and set up alerts on those. If you’re using the PG notifier, then Oban.Met.latest/2 is a good option. Otherwise, query the actual database every minute or so.

Post #8

sorentwo

Oban Core Team

Without a running instance it’s difficult to say what the problem may have been. The fact that restarting got things running again points to an ephemeral issue, rather than something with the jobs table itself.

That would be useful information because it would indicate whether the nodes were able to communicate with eachother, and whether the stager “thought” it was notifying queues about available jobs. Staging events are included in the default logger (and is recommended as part of the Ready for Production guide).

Post #2

Last Post!

sorentwo

Oban Core Team

Alright. If you encounter this again, please share the check_queue information for that queue.

That’s most likely. The meta from other, possibly dead, producers changed fetching for the current nodes. Now that those are cleaned up they don’t have any impact. That’s something we’ve worked on improving in the upcoming v1.6 RC.

Post #21

Where Next?

View thread on forum (has 20 responses!)

oban

Home Questions & Help>Questions

/oban

8 1147 20

Last post

How to investigate Oban jobs stuck in 'available' state, across all queues?

linusdm

How to investigate Oban jobs stuck in 'available' state, across all queues?

Most Liked

sorentwo

sorentwo

sorentwo

Last Post!

sorentwo

Where Next?

Popular in Questions

How to get the server ip address?

How to decode a JSON into a struct safely?

IEX in Windows Powershell?

Write while loop equivalent in elixir

Learning Elixir, frst impressions ( plz don't kill me ! )

How to set environment variables in dev.exs?

How can I write a raw sql query?

Other popular topics

Difference in between :utc_datetime and :naive_datetime in Ecto

Hex version - ** (Mix) The task "phx.new" could not be found

Dialyzer - Function foo/1 has no local return

Erlang and Elixir on Apple Silicon/M1 Chip

Mint vs Finch vs Gun vs Tesla vs HTTPoison etc

How to fix *Bad argument in call to erlang:'++'(<<"xxx/crash.log">>, ".3") in lager_rotator_default:rotate_logfile/2 line 84*

Latest Oban Threads

Questions & Help>Questions

Latest on Elixir Forum

Sponsor Spotlight

Our Sponsors

Categories:

Sub Categories:

Forums

Popular Tags

Our Sponsors

We're in Beta

How to fix Bad argument in call to erlang:'++'(<<"xxx/crash.log">>, ".3") in lager_rotator_default:rotate_logfile/2 line 84