Oban Smart Engine stalls - Broken PostgreSQL LISTEN channel?

We experienced two complete Oban job dispatch stalls in a 12-hour window — something that had never happened before in our prod environment. Jobs accumulated in an available state for 70-85 minutes each time before self-resolving. We recently upgraded Oban from 2.17.12 to 2.20.3 and Oban Pro from 1.4.10 to 1.6.12 (running all migrations).

Environment

  • Oban 2.20.3 / Oban Pro 1.6.12 (latest)
  • Elixir 1.17.1 / OTP 27.0.1
  • PostgreSQL 18.1 (Cloud SQL)
  • Infrastructure: GKE Autopilot, 2 app pods, Cloud SQL Proxy sidecar
  • Clustering: libcluster with Cluster.Strategy.Kubernetes.DNS (verified healthy — Horde distributed registry works fine)

Oban Config

  [
    repo: DB.Repo,
    engine: Oban.Pro.Engines.Smart,
    notifier: Oban.Notifiers.Postgres,
    insert_trigger: false,
    plugins: [
      {Oban.Pro.Plugins.DynamicPruner, state_overrides: [completed: {:max_age, 3600}]},
      {Oban.Plugins.Cron, crontab: [...]},
      Oban.Pro.Plugins.DynamicLifeline,
      {Oban.Pro.Plugins.DynamicCron, crontab: []},
      Oban.Plugins.Reindexer
    ],
    queues: [...]
  ]

What happened

Two stalls on the same day:

  1. 06:51–08:16 UTC (~85 min, 761+ jobs accumulated)
  2. 10:27–11:39 UTC (~72 min, similar accumulation)

Both times: jobs kept inserting into oban_jobs with state = ‘available’, but no node picked them up. Both times the system self-resolved.

What we found

  • Postgrex connection churn. We found 115+ Postgrex.Protocol disconnected entries on the day of the stalls. Searching 14 day of logs, there are zero such entries before that day. This coincided with a deployment that rolled pods to new GKE Autopilot nodes.
  • Chronic payload string too long errors on pg_notify. These existed before the stalls (~2-5/day for weeks) but roughly tripled to ~10-12/day after a release that changed Oban job retry behavior (more jobs retrying instead of being silently swallowed). These errors alone didn’t cause stalls - they’d been happening for a week without incident.
  • Duplicate key violations throughout the stall. In the postgres logs, we see oban_jobs_unique_index violations occurred continuously during both stall windows. Perhaps evidence that each Oban node is in solitary state?

Our theory of the failure mode

We believe the Cloud SQL Proxy connection instability silently broke the PostgreSQL LISTEN channel that Oban.Notifiers.Postgres depends on. Specifically:

  1. Deployment rolls pods to new GKE nodes → Cloud SQL Proxy connections are flaky on the new nodes
  2. The LISTEN/NOTIFY channel dies silently (TCP connection may still appear open)
  3. Sonar pings are sent successfully (pg_notify returns no error) but never received
  4. Each node sees only itself → :solitary status
  5. The leader in :solitary stays in global stager mode and dispatches via Notifier.notify() - which still uses pg_notify
  6. The dispatch notification vanishes into the broken LISTEN channel
  7. No producers on any node wake up - not even the leader’s own, since it receives dispatch signals through the same broken LISTEN channel
  8. Eventually (~70-85 min) something reconnects and the backlog flushes

Does this seem right, @sorentwo ? If so, the fix would be to switch to Oban.Notifiers.PG for the notifier. We’ll also add telemetry handlers for [:oban, :notifier, :switch], [:oban, :stager, :switch], and [:oban,
:peer, :election, :stop] so we can observe these events going forward. Any other advice or ideas? The broken LISTEN channel is just a theory, as we can’t see that in the logs.

Thanks!

1 Like

Following up here - we applied the new listener in prod but had another stall, so that wasn’t it. Any other ideas here. @sorentwo ?

For future reference:

Turned out our sync workers had unique: [states: [:available, :scheduled]]. We recently changed them to return {:error, …} on transient failures instead of swallowing errors, which meant jobs entered :retryable for the first time.

Since :retryable wasn’t in the states list, the generated uniq_key column went NULL, the unique index stopped preventing duplicates, and a second job with the same identity got inserted. When oban moved the original back to available, the key reappeared and boom - unique violation.

Fix was just adding the missing states: states: [:available, :scheduled, :executing, :retryable]

Very subtle gotcha, might be worth calling out in the docs?

2 Likes

Yeah I unique state gotchas are tough. I think that’s in part why there are now the new options to unique that cover common buckets of related states to help avoid missing some.