Oban Smart Engine stalls - Broken PostgreSQL LISTEN channel?

CodyD · February 27, 2026, 2:39pm

We experienced two complete Oban job dispatch stalls in a 12-hour window — something that had never happened before in our prod environment. Jobs accumulated in an available state for 70-85 minutes each time before self-resolving. We recently upgraded Oban from 2.17.12 to 2.20.3 and Oban Pro from 1.4.10 to 1.6.12 (running all migrations).

Environment

Oban 2.20.3 / Oban Pro 1.6.12 (latest)
Elixir 1.17.1 / OTP 27.0.1
PostgreSQL 18.1 (Cloud SQL)
Infrastructure: GKE Autopilot, 2 app pods, Cloud SQL Proxy sidecar
Clustering: libcluster with Cluster.Strategy.Kubernetes.DNS (verified healthy — Horde distributed registry works fine)

Oban Config

  [
    repo: DB.Repo,
    engine: Oban.Pro.Engines.Smart,
    notifier: Oban.Notifiers.Postgres,
    insert_trigger: false,
    plugins: [
      {Oban.Pro.Plugins.DynamicPruner, state_overrides: [completed: {:max_age, 3600}]},
      {Oban.Plugins.Cron, crontab: [...]},
      Oban.Pro.Plugins.DynamicLifeline,
      {Oban.Pro.Plugins.DynamicCron, crontab: []},
      Oban.Plugins.Reindexer
    ],
    queues: [...]
  ]

What happened

Two stalls on the same day:

06:51–08:16 UTC (~85 min, 761+ jobs accumulated)
10:27–11:39 UTC (~72 min, similar accumulation)

Both times: jobs kept inserting into oban_jobs with state = ‘available’, but no node picked them up. Both times the system self-resolved.

What we found

Postgrex connection churn. We found 115+ Postgrex.Protocol disconnected entries on the day of the stalls. Searching 14 day of logs, there are zero such entries before that day. This coincided with a deployment that rolled pods to new GKE Autopilot nodes.
Chronic payload string too long errors on pg_notify. These existed before the stalls (~2-5/day for weeks) but roughly tripled to ~10-12/day after a release that changed Oban job retry behavior (more jobs retrying instead of being silently swallowed). These errors alone didn’t cause stalls - they’d been happening for a week without incident.
Duplicate key violations throughout the stall. In the postgres logs, we see oban_jobs_unique_index violations occurred continuously during both stall windows. Perhaps evidence that each Oban node is in solitary state?

Our theory of the failure mode

We believe the Cloud SQL Proxy connection instability silently broke the PostgreSQL LISTEN channel that Oban.Notifiers.Postgres depends on. Specifically:

Deployment rolls pods to new GKE nodes → Cloud SQL Proxy connections are flaky on the new nodes
The LISTEN/NOTIFY channel dies silently (TCP connection may still appear open)
Sonar pings are sent successfully (pg_notify returns no error) but never received
Each node sees only itself → :solitary status
The leader in :solitary stays in global stager mode and dispatches via Notifier.notify() - which still uses pg_notify
The dispatch notification vanishes into the broken LISTEN channel
No producers on any node wake up - not even the leader’s own, since it receives dispatch signals through the same broken LISTEN channel
Eventually (~70-85 min) something reconnects and the backlog flushes

Does this seem right, @sorentwo ? If so, the fix would be to switch to Oban.Notifiers.PG for the notifier. We’ll also add telemetry handlers for [:oban, :notifier, :switch], [:oban, :stager, :switch], and [:oban,
:peer, :election, :stop] so we can observe these events going forward. Any other advice or ideas? The broken LISTEN channel is just a theory, as we can’t see that in the logs.

Thanks!

CodyD · February 27, 2026, 5:23pm

Following up here - we applied the new listener in prod but had another stall, so that wasn’t it. Any other ideas here. @sorentwo ?

CodyD · February 27, 2026, 7:54pm

For future reference:

Turned out our sync workers had unique: [states: [:available, :scheduled]]. We recently changed them to return {:error, …} on transient failures instead of swallowing errors, which meant jobs entered :retryable for the first time.

Since :retryable wasn’t in the states list, the generated uniq_key column went NULL, the unique index stopped preventing duplicates, and a second job with the same identity got inserted. When oban moved the original back to available, the key reappeared and boom - unique violation.

Fix was just adding the missing states: states: [:available, :scheduled, :executing, :retryable]

Very subtle gotcha, might be worth calling out in the docs?

benwilson512 · February 28, 2026, 2:34pm

Yeah unique state gotchas are tough. I think that’s in part why there are now the new options to unique that cover common buckets of related states to help avoid missing some.

sorentwo · March 1, 2026, 4:20pm

Yes, it is a subtle and annoying gotcha. It’s the reason we deprecated using individual states entirely in Oban v2.20 (they aren’t documented at all anymore, only state groups are). This is unfortunately a bigger problem in Oban Pro because the guarantees are stronger.

There is a bug fix for this specific issue in the upcoming Pro v1.6.13, plus telemetry event and logging for when this event occurs as of Pro v1.6.10.

This is exactly right