rathoud96

All Oban queue producers crash simultaneously due to ObanRepo pool exhaustion on Cloud Run (VPC idle timeout)

Environment

Elixir 1.18.3-otp-27 / OTP 27.3.4
Oban 2.20.2
Phoenix 1.7.x
db_connection 2.8.1 / Postgrex 0.21.1
Infrastructure: Google Cloud Run (serverless, min 1 instance, max 10) behind a private VPC (vpc-egress=private-ranges-only)

Setup

We run Oban on a dedicated ObanRepo (separate from our main Repo) with the following config:

runtime.exs

config :myapp, ObanRepo,
url: database_url,
pool_size: 5,
prepare: :unnamed,
idle_interval: 15_000,
connect_timeout: 10_000,
socket_options: \[keepalive: true\]

config :myapp, Oban,
repo: ObanRepo,
peer: Oban.Peers.Postgres,
notifier: Oban.Notifiers.PG,
queues: \[
metadata_discovery_high: 3,
metadata_download_high: 1,
metadata_discovery: 2,
metadata_download: 1,
metadata_enrichment: 1,
metadata_search: 1,
token_refresh: 2,
default: 5
\]

Problem

Roughly once a day (seemingly unprovoked — no active retrieval jobs running), all Oban queue producers crash simultaneously and Oban becomes non-functional until the instance restarts.

The failure always follows the same cascade:

Step 1 --- SSL connections drop silently:
\[error\] Postgrex.Protocol (#PID<0.3205.0>) disconnected:
\*\* (DBConnection.ConnectionError) ssl recv (idle): closed

\[error\] Postgrex.Protocol (#PID<0.3206.0>) failed to connect:
\*\* (DBConnection.ConnectionError) ssl send: closed

Step 2 --- Postgrex reconnection attempts time out:
\[error\] Postgrex.Protocol (#PID<0.3209.0>) timed out because it was
handshaking for longer than 15000ms

Step 3 --- Every queue producer terminates:
\[error\] GenServer {Oban.Registry, {Oban, {:producer, "metadata_discovery_high"}}} terminating
\*\* (DBConnection.ConnectionError) connection not available and request was
dropped from queue after 700ms.

\[error\] GenServer {Oban.Registry, {Oban, {:producer, "metadata_download"}}} terminating
\*\* (DBConnection.ConnectionError) connection not available and request was
dropped from queue after 5201ms.

…same for all 8 queues

Step 4 --- Peer loses leader election:
\[warning\] Oban.Peer.leader?/2 check failed due to
{:timeout, {GenServer, :call, \[#PID<0.3276.0>, :leader?, 5000\]}}

Questions

Is poll_interval the right lever here? With Oban.Notifiers.PG handling real-time wakeups, is there any meaningful downside to a 30-second poll interval beyond a max 30-second delay on missed notifications?
What is the recommended minimum ObanRepo pool size for a setup with Oban.Peers.Postgres + Oban.Notifiers.PG + 8 queues? We’re trying to right-size rather than just throw connections at it.
Is there an Oban-level setting for environments with network-enforced idle timeouts (serverless/VPC) that we’re missing — beyond idle_interval (which only fires every 15s, potentially too slow) and
socket-level keepalive: true?

Any guidance appreciated — especially from folks running Oban on GCP Cloud Run or similar ephemeral/serverless infrastructure.

2 comments

/phoenix /oban #ecto #deployment #troubleshooting

1 161 2

2026-05-21 12:53:26 UTC

Most Liked

sorentwo

Oban Core Team

There are numerous improvements to safe querying in Oban v2.21 and v2.22 that should help with frequent queries causing a failure cascade.

There isn’t a poll_interval option anymore, it was renamed to stage_interval long ago. You could set that to a reduced rate and have fewer intermittent queries. However, it would change the granularity of scheduled jobs (effectively at most every 30 seconds rather than down to the second).

That depends entirely on throughput for those queues. You could easily run that with 10 connections. The issue you’re encountering is about a missing database, not pool exhaustion.

Nothing in particular that I’m aware of. Oban is designed for long-running processes with a consistent database connection. Using it in an ephemeral environment is bound to cause some issues.

Post #2

Where Next?

View thread on forum (has 2 responses!)

phoenix

oban

ecto

deployment

troubleshooting

Home Questions & Help>Troubleshooting

/phoenix /oban #ecto #deployment #troubleshooting

1 161 2

Last post

All Oban queue producers crash simultaneously due to ObanRepo pool exhaustion on Cloud Run (VPC idle timeout)

rathoud96

All Oban queue producers crash simultaneously due to ObanRepo pool exhaustion on Cloud Run (VPC idle timeout)

runtime.exs

Most Liked

sorentwo

Where Next?

Popular in Troubleshooting

Invalid Unicode in unquoted atom

Workflow: downstream dependency on a graft is dropped when the grafting workflow is itself grafted (nested graft)

All Oban queue producers crash simultaneously due to ObanRepo pool exhaustion on Cloud Run (VPC idle timeout)

Making Sense of Ash install problems

Workflow.add_workflow/4 with deps: permanently suspends a sub-workflow's put_context job

Cache hit ratio on oban_jobs at 65% - large completed backlog, is this expected?

Supervision child dependencies: the process is not alive or there’s no process currently associated with the given name

Nested grafts + Oban.Pro.Workflow.status = Infinite recursion

Docker can’t install build essentials error: Package build-essential is not available

Getting tsvectors: error] ** (Postgrex.Error) ERROR 42703 (undefined_column) record “new” has no field “business_id”

Other popular topics

Erlang and Elixir on Apple Silicon/M1 Chip

Behaviours, defoverridable and implementations

What to learn first - Rust or Elixir?

Failed to run 'elixir' command error in vs code

How To Get Phoenix & VueJS working Together?

No such input `xxxxx` for action ResourceName1.create

Checking if an enum is empty - Credo vs Compiler

IntelliJ Elixir - Elixir plugin for JetBrain's IntelliJ Platform

How to set up the Elixir SDK in Intellij IDEA with the intellij-elixir plugin?

Visual Studio Code - how to highlight html closing tags in html.eex?

Latest Phoenix Threads

Latest Oban Threads

Questions & Help>Troubleshooting

Latest on Elixir Forum

Sponsor Spotlight

Our Sponsors

Categories:

Sub Categories:

Forums

Popular Tags

Our Sponsors

We're in Beta