All Oban queue producers crash simultaneously due to ObanRepo pool exhaustion on Cloud Run (VPC idle timeout)

Environment

  • Elixir 1.18.3-otp-27 / OTP 27.3.4
  • Oban 2.20.2
  • Phoenix 1.7.x
  • db_connection 2.8.1 / Postgrex 0.21.1
  • Infrastructure: Google Cloud Run (serverless, min 1 instance, max 10) behind a private VPC (vpc-egress=private-ranges-only)

Setup

We run Oban on a dedicated ObanRepo (separate from our main Repo) with the following config:

runtime.exs

config :myapp, ObanRepo,
url: database_url,
pool_size: 5,
prepare: :unnamed,
idle_interval: 15_000,
connect_timeout: 10_000,
socket_options: \[keepalive: true\]

config :myapp, Oban,
repo: ObanRepo,
peer: Oban.Peers.Postgres,
notifier: Oban.Notifiers.PG,
queues: \[
metadata_discovery_high: 3,
metadata_download_high: 1,
metadata_discovery: 2,
metadata_download: 1,
metadata_enrichment: 1,
metadata_search: 1,
token_refresh: 2,
default: 5
\]

Problem

Roughly once a day (seemingly unprovoked — no active retrieval jobs running), all Oban queue producers crash simultaneously and Oban becomes non-functional until the instance restarts.

The failure always follows the same cascade:

Step 1 --- SSL connections drop silently:
\[error\] Postgrex.Protocol (#PID<0.3205.0>) disconnected:
\*\* (DBConnection.ConnectionError) ssl recv (idle): closed

\[error\] Postgrex.Protocol (#PID<0.3206.0>) failed to connect:
\*\* (DBConnection.ConnectionError) ssl send: closed

Step 2 --- Postgrex reconnection attempts time out:
\[error\] Postgrex.Protocol (#PID<0.3209.0>) timed out because it was
handshaking for longer than 15000ms

Step 3 --- Every queue producer terminates:
\[error\] GenServer {Oban.Registry, {Oban, {:producer, "metadata_discovery_high"}}} terminating
\*\* (DBConnection.ConnectionError) connection not available and request was
dropped from queue after 700ms.

\[error\] GenServer {Oban.Registry, {Oban, {:producer, "metadata_download"}}} terminating
\*\* (DBConnection.ConnectionError) connection not available and request was
dropped from queue after 5201ms.

…same for all 8 queues

Step 4 --- Peer loses leader election:
\[warning\] Oban.Peer.leader?/2 check failed due to
{:timeout, {GenServer, :call, \[#PID<0.3276.0>, :leader?, 5000\]}}

Questions

  1. Is poll_interval the right lever here? With Oban.Notifiers.PG handling real-time wakeups, is there any meaningful downside to a 30-second poll interval beyond a max 30-second delay on missed notifications?
  2. What is the recommended minimum ObanRepo pool size for a setup with Oban.Peers.Postgres + Oban.Notifiers.PG + 8 queues? We’re trying to right-size rather than just throw connections at it.
  3. Is there an Oban-level setting for environments with network-enforced idle timeouts (serverless/VPC) that we’re missing — beyond idle_interval (which only fires every 15s, potentially too slow) and
    socket-level keepalive: true?

Any guidance appreciated — especially from folks running Oban on GCP Cloud Run or similar ephemeral/serverless infrastructure.

There are numerous improvements to safe querying in Oban v2.21 and v2.22 that should help with frequent queries causing a failure cascade.

There isn’t a poll_interval option anymore, it was renamed to stage_interval long ago. You could set that to a reduced rate and have fewer intermittent queries. However, it would change the granularity of scheduled jobs (effectively at most every 30 seconds rather than down to the second).

That depends entirely on throughput for those queues. You could easily run that with 10 connections. The issue you’re encountering is about a missing database, not pool exhaustion.

Nothing in particular that I’m aware of. Oban is designed for long-running processes with a consistent database connection. Using it in an ephemeral environment is bound to cause some issues.

1 Like

Thanks for the correction.
For now increasing the pool size for oban repo solved the issue.