linusdm
How to investigate Oban jobs stuck in 'available' state, across all queues?
Hi! ![]()
This weekend I experienced jobs accumulating in a state available across all queues. We run a two-node setup. One of them was the leader while jobs were accumulating (checking with Oban.Peer.leader?/1), and none of the queues were paused (checking with Oban.check_queue/2). All Oban config is fairly default (the :peer option is not set, two plugins: Cron and Pruner).
Unfortunately I had no time to investigate. Respawning two new nodes (we’re on k8s) solved the issue and jobs were being processed again.
I want to prevent this in the future, but I’m not sure what I can do next time to get to the bottom of this. I checked the troubleshooting guide, but I don’t think there is something in there for this situation (I’m not using PgBouncer for example). I read something about the Stager modes (local vs global), but I’m not sure how to query for that (and if it’s useful information at all).
Most Liked
sorentwo
The peer, notifier, and stager events are all helpful for diagnosing whatever happened in your use-case.
sorentwo
This seems like an issue with postgres pubsub. A disconnect error in versions prior to v0.20 would leave the Oban.Notifier in an undetectable disconnected state. The node would keep emitting notifications about available jobs, but they wouldn’t broadcast.
There are several ways you could monitor this:
- Use checkins in a tracker like sentry or honeybadger. Send an from a cron job every minute and get alerted if they stop running.
- Use cron monitors in sentry, which have native support for oban in recent versions.
- Poll jobs and send it to an external tool like grafana, datadog, etc. and set up alerts on those. If you’re using the
PGnotifier, then Oban.Met.latest/2 is a good option. Otherwise, query the actual database every minute or so.
sorentwo
Without a running instance it’s difficult to say what the problem may have been. The fact that restarting got things running again points to an ephemeral issue, rather than something with the jobs table itself.
That would be useful information because it would indicate whether the nodes were able to communicate with eachother, and whether the stager “thought” it was notifying queues about available jobs. Staging events are included in the default logger (and is recommended as part of the Ready for Production guide).







