This weekend I experienced jobs accumulating in a state available across all queues. We run a two-node setup. One of them was the leader while jobs were accumulating (checking with Oban.Peer.leader?/1), and none of the queues were paused (checking with Oban.check_queue/2). All Oban config is fairly default (the :peer option is not set, two plugins: Cron and Pruner).
Unfortunately I had no time to investigate. Respawning two new nodes (we’re on k8s) solved the issue and jobs were being processed again.
I want to prevent this in the future, but I’m not sure what I can do next time to get to the bottom of this. I checked the troubleshooting guide, but I don’t think there is something in there for this situation (I’m not using PgBouncer for example). I read something about the Stager modes (local vs global), but I’m not sure how to query for that (and if it’s useful information at all).
Without a running instance it’s difficult to say what the problem may have been. The fact that restarting got things running again points to an ephemeral issue, rather than something with the jobs table itself.
That would be useful information because it would indicate whether the nodes were able to communicate with eachother, and whether the stager “thought” it was notifying queues about available jobs. Staging events are included in the default logger (and is recommended as part of the Ready for Production guide).
We only log job-related events right now, so we’re missing out on other events. That’s something actionable I can fix. Thanks for the suggestion!
My first idea was to attach handlers for all events that end with :exception, but that might be a bit too much. I’ll start with the ones that are listed in the default logger. Are there any other events you think of that might be of interest for this exact use-case?
After adding the handlers for for more Oban events, I got to see these sequence of events (I took the log messages from the default implementation in Oban.Telemetry):
Oban notifier only receiving messages from its own node, functionality may be degraded
Oban staging switched to local mode. Local mode polls for jobs for every queue.
Oban notifier is receiving messages from other nodes
Oban staging switched back to global mode
(this sequence happened twice with only a few seconds in between)
Nothing bad happened, and it only happened once today. So I’m not worried. But it raises some questions about how to interpret this.
Is this expected? Or could this be a hint that network is flaky or that the underlying Erlang Distribution is not stable? Or is the database network connection going bad when this happens?
That is normal if the network or database connection is flaky. Really, that’s why the events exist and why the stager switches modes. If you’re using the Postgres notifier, then we recommend switching to the PG notifier if possible (Scaling Applications — Oban v2.18.3).
@sorentwo We also faced exact issue sometime ago, and we too had to restart the server for jobs to move into executing state. But we have only single node. Any idea on how can we monitor/debug issue like this.
This seems like an issue with postgres pubsub. A disconnect error in versions prior to v0.20 would leave the Oban.Notifier in an undetectable disconnected state. The node would keep emitting notifications about available jobs, but they wouldn’t broadcast.
There are several ways you could monitor this:
Use checkins in a tracker like sentry or honeybadger. Send an from a cron job every minute and get alerted if they stop running.
Use cron monitors in sentry, which have native support for oban in recent versions.
Poll jobs and send it to an external tool like grafana, datadog, etc. and set up alerts on those. If you’re using the PG notifier, then Oban.Met.latest/2 is a good option. Otherwise, query the actual database every minute or so.
Oban.Notifiers.PG works like a charm on single machine, unlike default config. So, i think, guide doesn’t explain it clearly. I think nice improvement would be "if you’re getting these errors, likely cause is notifier and you could try alternative implementation "