Environment
- Elixir 1.18.3-otp-27 / OTP 27.3.4
- Oban 2.20.2
- Phoenix 1.7.x
- db_connection 2.8.1 / Postgrex 0.21.1
- Infrastructure: Google Cloud Run (serverless, min 1 instance, max 10) behind a private VPC (vpc-egress=private-ranges-only)
Setup
We run Oban on a dedicated ObanRepo (separate from our main Repo) with the following config:
runtime.exs
config :myapp, ObanRepo,
url: database_url,
pool_size: 5,
prepare: :unnamed,
idle_interval: 15_000,
connect_timeout: 10_000,
socket_options: \[keepalive: true\]
config :myapp, Oban,
repo: ObanRepo,
peer: Oban.Peers.Postgres,
notifier: Oban.Notifiers.PG,
queues: \[
metadata_discovery_high: 3,
metadata_download_high: 1,
metadata_discovery: 2,
metadata_download: 1,
metadata_enrichment: 1,
metadata_search: 1,
token_refresh: 2,
default: 5
\]
Problem
Roughly once a day (seemingly unprovoked — no active retrieval jobs running), all Oban queue producers crash simultaneously and Oban becomes non-functional until the instance restarts.
The failure always follows the same cascade:
Step 1 --- SSL connections drop silently:
\[error\] Postgrex.Protocol (#PID<0.3205.0>) disconnected:
\*\* (DBConnection.ConnectionError) ssl recv (idle): closed
\[error\] Postgrex.Protocol (#PID<0.3206.0>) failed to connect:
\*\* (DBConnection.ConnectionError) ssl send: closed
Step 2 --- Postgrex reconnection attempts time out:
\[error\] Postgrex.Protocol (#PID<0.3209.0>) timed out because it was
handshaking for longer than 15000ms
Step 3 --- Every queue producer terminates:
\[error\] GenServer {Oban.Registry, {Oban, {:producer, "metadata_discovery_high"}}} terminating
\*\* (DBConnection.ConnectionError) connection not available and request was
dropped from queue after 700ms.
\[error\] GenServer {Oban.Registry, {Oban, {:producer, "metadata_download"}}} terminating
\*\* (DBConnection.ConnectionError) connection not available and request was
dropped from queue after 5201ms.
…same for all 8 queues
Step 4 --- Peer loses leader election:
\[warning\] Oban.Peer.leader?/2 check failed due to
{:timeout, {GenServer, :call, \[#PID<0.3276.0>, :leader?, 5000\]}}
Questions
- Is poll_interval the right lever here? With Oban.Notifiers.PG handling real-time wakeups, is there any meaningful downside to a 30-second poll interval beyond a max 30-second delay on missed notifications?
- What is the recommended minimum ObanRepo pool size for a setup with Oban.Peers.Postgres + Oban.Notifiers.PG + 8 queues? We’re trying to right-size rather than just throw connections at it.
- Is there an Oban-level setting for environments with network-enforced idle timeouts (serverless/VPC) that we’re missing — beyond idle_interval (which only fires every 15s, potentially too slow) and
socket-level keepalive: true?
Any guidance appreciated — especially from folks running Oban on GCP Cloud Run or similar ephemeral/serverless infrastructure.






















