How to investigate Oban jobs stuck in 'available' state, across all queues?

Hi! :wave:

This weekend I experienced jobs accumulating in a state available across all queues. We run a two-node setup. One of them was the leader while jobs were accumulating (checking with Oban.Peer.leader?/1), and none of the queues were paused (checking with Oban.check_queue/2). All Oban config is fairly default (the :peer option is not set, two plugins: Cron and Pruner).

Unfortunately I had no time to investigate. Respawning two new nodes (we’re on k8s) solved the issue and jobs were being processed again.

I want to prevent this in the future, but I’m not sure what I can do next time to get to the bottom of this. I checked the troubleshooting guide, but I don’t think there is something in there for this situation (I’m not using PgBouncer for example). I read something about the Stager modes (local vs global), but I’m not sure how to query for that (and if it’s useful information at all).

Without a running instance it’s difficult to say what the problem may have been. The fact that restarting got things running again points to an ephemeral issue, rather than something with the jobs table itself.

That would be useful information because it would indicate whether the nodes were able to communicate with eachother, and whether the stager “thought” it was notifying queues about available jobs. Staging events are included in the default logger (and is recommended as part of the Ready for Production guide).

1 Like

We only log job-related events right now, so we’re missing out on other events. That’s something actionable I can fix. Thanks for the suggestion!

My first idea was to attach handlers for all events that end with :exception, but that might be a bit too much. I’ll start with the ones that are listed in the default logger. Are there any other events you think of that might be of interest for this exact use-case?

The peer, notifier, and stager events are all helpful for diagnosing whatever happened in your use-case.

2 Likes

After adding the handlers for for more Oban events, I got to see these sequence of events (I took the log messages from the default implementation in Oban.Telemetry):

  • Oban notifier only receiving messages from its own node, functionality may be degraded
  • Oban staging switched to local mode. Local mode polls for jobs for every queue.
  • Oban notifier is receiving messages from other nodes
  • Oban staging switched back to global mode

(this sequence happened twice with only a few seconds in between)

Nothing bad happened, and it only happened once today. So I’m not worried. But it raises some questions about how to interpret this.

Is this expected? Or could this be a hint that network is flaky or that the underlying Erlang Distribution is not stable? Or is the database network connection going bad when this happens?

That is normal if the network or database connection is flaky. Really, that’s why the events exist and why the stager switches modes. If you’re using the Postgres notifier, then we recommend switching to the PG notifier if possible (Scaling Applications — Oban v2.18.3).

1 Like