We experienced two complete Oban job dispatch stalls in a 12-hour window — something that had never happened before in our prod environment. Jobs accumulated in an available state for 70-85 minutes each time before self-resolving. We recently upgraded Oban from 2.17.12 to 2.20.3 and Oban Pro from 1.4.10 to 1.6.12 (running all migrations).
Environment
- Oban 2.20.3 / Oban Pro 1.6.12 (latest)
- Elixir 1.17.1 / OTP 27.0.1
- PostgreSQL 18.1 (Cloud SQL)
- Infrastructure: GKE Autopilot, 2 app pods, Cloud SQL Proxy sidecar
- Clustering: libcluster with Cluster.Strategy.Kubernetes.DNS (verified healthy — Horde distributed registry works fine)
Oban Config
[
repo: DB.Repo,
engine: Oban.Pro.Engines.Smart,
notifier: Oban.Notifiers.Postgres,
insert_trigger: false,
plugins: [
{Oban.Pro.Plugins.DynamicPruner, state_overrides: [completed: {:max_age, 3600}]},
{Oban.Plugins.Cron, crontab: [...]},
Oban.Pro.Plugins.DynamicLifeline,
{Oban.Pro.Plugins.DynamicCron, crontab: []},
Oban.Plugins.Reindexer
],
queues: [...]
]
What happened
Two stalls on the same day:
- 06:51–08:16 UTC (~85 min, 761+ jobs accumulated)
- 10:27–11:39 UTC (~72 min, similar accumulation)
Both times: jobs kept inserting into oban_jobs with state = ‘available’, but no node picked them up. Both times the system self-resolved.
What we found
- Postgrex connection churn. We found 115+ Postgrex.Protocol disconnected entries on the day of the stalls. Searching 14 day of logs, there are zero such entries before that day. This coincided with a deployment that rolled pods to new GKE Autopilot nodes.
- Chronic payload string too long errors on pg_notify. These existed before the stalls (~2-5/day for weeks) but roughly tripled to ~10-12/day after a release that changed Oban job retry behavior (more jobs retrying instead of being silently swallowed). These errors alone didn’t cause stalls - they’d been happening for a week without incident.
- Duplicate key violations throughout the stall. In the postgres logs, we see
oban_jobs_unique_indexviolations occurred continuously during both stall windows. Perhaps evidence that each Oban node is in solitary state?
Our theory of the failure mode
We believe the Cloud SQL Proxy connection instability silently broke the PostgreSQL LISTEN channel that Oban.Notifiers.Postgres depends on. Specifically:
- Deployment rolls pods to new GKE nodes → Cloud SQL Proxy connections are flaky on the new nodes
- The LISTEN/NOTIFY channel dies silently (TCP connection may still appear open)
- Sonar pings are sent successfully (pg_notify returns no error) but never received
- Each node sees only itself → :solitary status
- The leader in :solitary stays in global stager mode and dispatches via Notifier.notify() - which still uses pg_notify
- The dispatch notification vanishes into the broken LISTEN channel
- No producers on any node wake up - not even the leader’s own, since it receives dispatch signals through the same broken LISTEN channel
- Eventually (~70-85 min) something reconnects and the backlog flushes
Does this seem right, @sorentwo ? If so, the fix would be to switch to Oban.Notifiers.PG for the notifier. We’ll also add telemetry handlers for [:oban, :notifier, :switch], [:oban, :stager, :switch], and [:oban,
:peer, :election, :stop] so we can observe these events going forward. Any other advice or ideas? The broken LISTEN channel is just a theory, as we can’t see that in the logs.
Thanks!






















