Oban logs extensively when connection with Postgres is lost

twopoint718 · February 6, 2024, 11:06pm

Recently, because of a hosting provider’s hardware issue, the production DB was shut down and then the app was moved to a new DB server. The log filled with lines like the following:

State: %Oban.Peers.Postgres.State{conf: %Oban.Config{dispatch_cooldown: 5, engine: Oban.Engines.Basic, get_dynamic_repo: nil, log: false, name: Oban, node: "codex", notifier: Oban.Notifiers.Postgres, peer: Oban.Peers.Postgres, plugins: [{Oban.Plugins.Reindexer, []}, {Oban.Plugins.Pruner, [max_age: 300]}], prefix: "public", queues: [default: [limit: 10], imports: [limit: 10], mailer: [limit: 10]], repo: SupplierLink.Repo, shutdown_grace_period: 15000, stage_interval: 1000, testing: :disabled}, name: {:via, Registry, {Oban.Registry, {Oban, Oban.Peer}}}, timer: nil, interval: 30000, leader?: false, leader_boost: 2}
[error] GenServer {Oban.Registry, {Oban, Oban.Stager}} terminating
** (stop) exited in: GenServer.call(#PID<0.824.0>, :leader?, 5000)
    ** (EXIT) an exception was raised:
        ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 1997ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:

  1. Ensuring your database is available and that you can connect to it
  2. Tracking down slow queries and making sure they are running fast enough
  3. Increasing the pool_size (although this increases resource consumption)
  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval

See DBConnection.start_link/2 for more information

            (db_connection 2.5.0) lib/db_connection.ex:972: DBConnection.transaction/3
            (oban 2.16.3) lib/oban/peers/postgres.ex:94: anonymous fn/2 in Oban.Peers.Postgres.handle_info/2
            (telemetry 1.2.1) /Users/cjw/dev/supplier_link/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
            (oban 2.16.3) lib/oban/peers/postgres.ex:92: Oban.Peers.Postgres.handle_info/2
            (stdlib 5.0.2) gen_server.erl:1067: :gen_server.try_handle_continue/3
            (stdlib 5.0.2) gen_server.erl:977: :gen_server.loop/7
            (stdlib 5.0.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
    (elixir 1.15.7) lib/gen_server.ex:1074: GenServer.call/3
    (oban 2.16.3) lib/oban/peer.ex:99: Oban.Peer.leader?/2
    (oban 2.16.3) lib/oban/stager.ex:101: Oban.Stager.check_leadership_and_stage/1
    (oban 2.16.3) lib/oban/stager.ex:75: anonymous fn/2 in Oban.Stager.handle_info/2
    (telemetry 1.2.1) /Users/cjw/dev/supplier_link/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
    (oban 2.16.3) lib/oban/stager.ex:74: Oban.Stager.handle_info/2
    (stdlib 5.0.2) gen_server.erl:1077: :gen_server.try_handle_info/3
    (stdlib 5.0.2) gen_server.erl:1165: :gen_server.handle_msg/6
    (stdlib 5.0.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Last message: :stage

And these log messages repeat every few seconds seemingly without limit. Is there a way to configure Phoenix/Oban/Postgrex to take some sort of backoff strategy about connecting to the database? And would it be possible to detect that the connection has returned and reconnect gracefully? Thanks!

sorentwo · February 7, 2024, 3:58pm

Without a database connection, those messages will continue as supervised processes crash and reboot. There’s no way to configure Oban with a backoff/circuit breaker policy to avoid it, but that is something we’d like to consider.