Oban supervisor not restarting Oban.Notifiers.Postgres?

I’ve recently had a production issue and I’m puzzled about what happened.

Was going to submit a Github issue but not sure to which repo yet.


I received an alert that basically said “hey, your Oban jobs are not being processed!”.

I went to look at the logs and found this

05:39:33.249 [error] :gen_statem {Oban.Registry, {Oban, Oban.Notifier}} terminating
** (FunctionClauseError) no function clause matching in Postgrex.Protocol.msg_decode/1
    (postgrex 0.19.1) lib/postgrex/protocol.ex:3320: Postgrex.Protocol.msg_decode(nil)
    (postgrex 0.19.1) lib/postgrex/protocol.ex:3291: Postgrex.Protocol.msg_recv/3
    (postgrex 0.19.1) lib/postgrex/protocol.ex:1214: Postgrex.Protocol.recv_simple/6
    (postgrex 0.19.1) lib/postgrex/simple_connection.ex:435: Postgrex.SimpleConnection.handle/5
    (stdlib 5.2.3.1) gen_statem.erl:1397: :gen_statem.loop_state_callback/11
    (stdlib 5.2.3.1) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Queue: [{{:call, {#PID<0.1231406.0>, [:alias | #Reference<0.0.19208019.3129316008.509673475.233699>]}}, {{:listen, #PID<0.1231406.0>, [:oban_job_complete]}, #PID<0.1231406.0>}}]
Postponed: []

One can point out that there is some issue related to Postgrex here since msg_decode(nil) is not expected.

I’m curious why that happened, but I’m kinda fine with that. What puzzled me is that the Notifier wasn’t restarted and it was stuck in a broken state with no jobs being processed further.

My quick fix was simply restarting the entire app (it was a production app so didn’t have time to poke around while customers are struggling), but now I’m scared that it might happen again and I won’t be there to handle it.

I can see that roughly the chain of calls is this: Oban.init -> Oban.Notifiers.Postgres.start_link -> Postgrex.SimpleConnection.start_link -> :gen_statem.start_link

Judging from the process info, Oban is trapping exists for the Notifier but I failed to find where.

iex(app@sunrise-5242144a70d7.weasel-interval.ts.net)1> Oban.Registry.whereis(Oban, Oban.Notifier) |> Process.info
[
  current_function: {:gen_statem, :loop_receive, 3},
  initial_call: {:proc_lib, :init_p, 5},
  status: :waiting,
  message_queue_len: 0,
  links: [#PID<0.3694.0>, #Port<0.66>, #PID<0.3838.0>, #PID<0.3531.0>],
  dictionary: [
    {Oban.Notifiers.Postgres,
     [
       types: Postgrex.DefaultTypes,
       sync_connect: false,
       telemetry_prefix: [:app, :repo],
       otp_app: :app,
       timeout: 15000,
       ssl: false,
       pool_size: 60,
       socket_options: [],
       prefix: "sunrise",
       after_connect: {Postgrex, :query!,
        ["SET search_path TO sunrise, public", []]},
       hostname: "...",
       scheme: "...",
       username: "...",
       password: "...",
       database: "...",
       port: 10999
     ]},
    {:"$initial_call", {Postgrex.SimpleConnection, :init, 1}},
    {:"$ancestors", [#PID<0.3694.0>, App.Supervisor, #PID<0.3591.0>]}
  ],
  trap_exit: true,
  error_handler: :error_handler,
  priority: :normal,
  group_leader: #PID<0.3590.0>,
  total_heap_size: 8370,
  heap_size: 4185,
  stack_size: 10,
  reductions: 844600,
  garbage_collection: [
    max_heap_size: %{
      error_logger: true,
      include_shared_binaries: false,
      kill: true,
      size: 0
    },
    min_bin_vheap_size: 46422,
    min_heap_size: 233,
    fullsweep_after: 65535,
    minor_gcs: 17
  ],
  suspending: []
]

Initially I thought that the issue is related to SimpleConnection not handling gen_statem termination properly, but now I feel like it’s somewhere in Oban since SimpleConnection itself doesn’t seem to trap anything.

Unfortunately, I couldn’t find a way to manually reproduce this issue. Also, killing the Notifier process explicitly (Process.exit(pid, :kill)) does lead to an immediate restart.

I would appreciate any help!

Again, a gen_statem termination is okay with me, but I want to understand how to make sure the Notifier would be restarted.

1 Like

The silent shutdown here seems to have effected a few applications. As you noted, somehow SimpleConnection isn’t handling the termination and reconnecting. Thus far I can’t recreate the scenario, even with invasive methods like shutting down the :gen_tcp socket.

Surely something will work and we’ll figure out the issue between Oban and Postgrex.

1 Like

Thanks for your reply!

I might have an idea about what’s going on:

  1. SimpleConnection calls Protocol.connect here: postgrex/lib/postgrex/simple_connection.ex at master · elixir-ecto/postgrex · GitHub
  2. Protocol.connect sets the trap_exit flag here: postgrex/lib/postgrex/protocol.ex at cebb02f923a7e3c443aa7501bc89a8b2beda626f · elixir-ecto/postgrex · GitHub
  3. SimpleConnection does not define the terminate callback

Does it sound plausible? Am I missing something?

Happy to file an issue to Postgrex if this makes sense.

1 Like

That seems plausible. I’m still working on a recreation and I’ll submit a fix upstream in Postgrex if needed.

1 Like

I’m unable to recreate a situation where the Notifier (and subsequently other processes that rely on the Notifier) don’t crash on an exit. Here’s a minimal test recreation:

  test "restarting after a simple connection exit" do
    Sandbox.unboxed_run(Repo, fn ->
      name = start_supervised_oban!(notifier: Postgres)

      assert catch_exit(Oban.Notifier.listen(name, :oban_job_complete))
    end)
  end

With a minor modification to Postgrex.Protocol.recv_simple/6 that forces a function clause error after the first query response:

  defp recv_simple(s, status, results, columns, rows, buffer) do
    buffer = if is_binary(buffer), do: nil, else: buffer

    case msg_recv(s, :infinity, buffer) do

The test passes because the processes all exit:

15:01:31.156 [error] :gen_statem {Oban.Registry, {#Reference<0.4006229912.3558866945.105622>, Oban.Notifier}} terminating
** (FunctionClauseError) no function clause matching in Postgrex.Protocol.msg_decode/1
    (postgrex 0.19.1) lib/postgrex/protocol.ex:3322: Postgrex.Protocol.msg_decode(nil)
    (postgrex 0.19.1) lib/postgrex/protocol.ex:3293: Postgrex.Protocol.msg_recv/3
    (postgrex 0.19.1) lib/postgrex/protocol.ex:1216: Postgrex.Protocol.recv_simple/6
    (postgrex 0.19.1) lib/postgrex/simple_connection.ex:435: Postgrex.SimpleConnection.handle/5
    (stdlib 6.0) gen_statem.erl:3115: :gen_statem.loop_state_callback/11
    (stdlib 6.0) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Queue: [{{:call, {#PID<0.334.0>, [:alias | #Reference<0.0.42755.4006229912.3558932487.105329>]}}, {{:listen, #PID<0.334.0>, [:signal]}, #PID<0.334.0>}}]
Postponed: []

15:01:31.160 [error] GenServer {Oban.Registry, {#Reference<0.4006229912.3558866945.105622>, Oban.Midwife}} terminating
** (stop) exited in: :gen_statem.call(#PID<0.331.0>, {{:listen, #PID<0.334.0>, [:signal]}, #PID<0.334.0>}, 5000)
    ** (EXIT) an exception was raised:
        ** (FunctionClauseError) no function clause matching in Postgrex.Protocol.msg_decode/1
            (postgrex 0.19.1) lib/postgrex/protocol.ex:3322: Postgrex.Protocol.msg_decode(nil)
            (postgrex 0.19.1) lib/postgrex/protocol.ex:3293: Postgrex.Protocol.msg_recv/3
            (postgrex 0.19.1) lib/postgrex/protocol.ex:1216: Postgrex.Protocol.recv_simple/6
            (postgrex 0.19.1) lib/postgrex/simple_connection.ex:435: Postgrex.SimpleConnection.handle/5
            (stdlib 6.0) gen_statem.erl:3115: :gen_statem.loop_state_callback/11
            (stdlib 6.0) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
    (stdlib 6.0) gen.erl:270: :gen.do_call/4
    (stdlib 6.0) gen_statem.erl:2633: :gen_statem.call/3
    (postgrex 0.19.1) lib/postgrex/simple_connection.ex:217: Postgrex.SimpleConnection.call/3
    (oban 2.18.2) lib/oban/midwife.ex:67: Oban.Midwife.handle_continue/2
    (stdlib 6.0) gen_server.erl:2163: :gen_server.try_handle_continue/3
    (stdlib 6.0) gen_server.erl:2072: :gen_server.loop/7
    (stdlib 6.0) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Last message: {:continue, :start}

15:01:31.160 [error] GenServer {Oban.Registry, {#Reference<0.4006229912.3558866945.105622>, Oban.Sonar}} terminating
** (stop) exited in: :gen_statem.call(#PID<0.331.0>, {{:listen, #PID<0.337.0>, [:sonar]}, #PID<0.337.0>}, 5000)
    ** (EXIT) an exception was raised:
        ** (FunctionClauseError) no function clause matching in Postgrex.Protocol.msg_decode/1
            (postgrex 0.19.1) lib/postgrex/protocol.ex:3322: Postgrex.Protocol.msg_decode(nil)
            (postgrex 0.19.1) lib/postgrex/protocol.ex:3293: Postgrex.Protocol.msg_recv/3
            (postgrex 0.19.1) lib/postgrex/protocol.ex:1216: Postgrex.Protocol.recv_simple/6
            (postgrex 0.19.1) lib/postgrex/simple_connection.ex:435: Postgrex.SimpleConnection.handle/5
            (stdlib 6.0) gen_statem.erl:3115: :gen_statem.loop_state_callback/11
            (stdlib 6.0) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
    (stdlib 6.0) gen.erl:270: :gen.do_call/4
    (stdlib 6.0) gen_statem.erl:2633: :gen_statem.call/3
    (postgrex 0.19.1) lib/postgrex/simple_connection.ex:217: Postgrex.SimpleConnection.call/3
    (oban 2.18.2) lib/oban/sonar.ex:47: Oban.Sonar.handle_continue/2
    (stdlib 6.0) gen_server.erl:2163: :gen_server.try_handle_continue/3
    (stdlib 6.0) gen_server.erl:2072: :gen_server.loop/7
    (stdlib 6.0) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Last message: {:continue, :start}

15:01:31.165 [error] :gen_statem {Oban.Registry, {#Reference<0.4006229912.3558866945.105622>, Oban.Notifier}} terminating
** (FunctionClauseError) no function clause matching in Postgrex.Protocol.msg_decode/1
    (postgrex 0.19.1) lib/postgrex/protocol.ex:3322: Postgrex.Protocol.msg_decode(nil)
    (postgrex 0.19.1) lib/postgrex/protocol.ex:3293: Postgrex.Protocol.msg_recv/3
    (postgrex 0.19.1) lib/postgrex/protocol.ex:1216: Postgrex.Protocol.recv_simple/6
    (postgrex 0.19.1) lib/postgrex/simple_connection.ex:435: Postgrex.SimpleConnection.handle/5
    (stdlib 6.0) gen_statem.erl:3115: :gen_statem.loop_state_callback/11
    (stdlib 6.0) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Queue: [{{:call, {#PID<0.340.0>, [:alias | #Reference<0.0.43523.4006229912.3558932483.111091>]}}, {{:listen, #PID<0.340.0>, [:signal]}, #PID<0.340.0>}}]
Postponed: []
.
Finished in 0.03 seconds (0.00s async, 0.03s sync)
1 test, 0 failures

Are you using any dynamically started queues? Not Pro’s DynamicQueue plugin, but queues started via Oban.start_queue/2?

No, just a static queue. However, I do use dynamic concurrency via set_concurrency and I also pause/continue it dynamically if it makes any difference (race conditions potentially?).

Other than that, nothing special.

That could make a difference, but it’s hard to say without the state of the system after that error. What I’m hoping to identify is whether the queues were restarted after the crash.

Which version of Oban are you running? I don’t see any indicators in the stacktrace.

Below is the line from my mix.lock

  "oban": {:hex, :oban, "2.18.3", "1608c04f8856c108555c379f2f56bc0759149d35fa9d3b825cb8a6769f8ae926", [:mix], [{:ecto_sql, "~> 3.10", [hex: :ecto_sql, repo: "hexpm", optional: false]}, {:ecto_sqlite3, "~> 0.9", [hex: :ecto_sqlite3, repo: "hexpm", optional: true]}, {:jason, "~> 1.1", [hex: :jason, repo: "hexpm", optional: false]}, {:postgrex, "~> 0.16", [hex: :postgrex, repo: "hexpm", optional: true]}, {:telemetry, "~> 0.4 or ~> 1.0", [hex: :telemetry, repo: "hexpm", optional: false]}], "hexpm", "36ca6ca84ef6518f9c2c759ea88efd438a3c81d667ba23b02b062a0aa785475e"},

Also on 2.18.3 oban. Don’t know if related, but I’m seeing this in my Gigalixir logs after upgrading to Elixir 1.18.1 (from 1.17.3). I don’t see this locally. Have not tried to reproduce.

Seems to get logged a handful of time when the application boots up and then don’t see it again.

[warning] Event dropped due to being a duplicate of a previously-captured event.  
[error] :gen_statem {Oban.Registry, {Oban, Oban.Notifier}} terminating 
** (FunctionClauseError) no function clause matching in Postgrex.Protocol.cancel_request/1 
    (postgrex 0.19.3) lib/postgrex/protocol.ex:3563: Postgrex.Protocol.cancel_request(nil) 
    (postgrex 0.19.3) lib/postgrex/protocol.ex:268: Postgrex.Protocol.disconnect/2 
    (postgrex 0.19.3) lib/postgrex/simple_connection.ex:341: Postgrex.SimpleConnection.handle_event/4 
    (stdlib 6.2) gen_statem.erl:3737: :gen_statem.loop_state_callback/11 
    (stdlib 6.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3

Didn’t get this one, but few month ago I did get this:

Failure while translating Erlang's logger event
** (MatchError) no match of right hand side value: %{label: {:gen_statem, :terminate}, name: {Oban.Registry, {Oban, Oban.Notifier}}, reason: {:error, :function_clause, [{Postgrex.Protocol, :msg_decode, [nil], [file: ~c"lib/postgrex/protocol.ex", line: 3320]}, {Postgrex.Protocol, :msg_recv, 3, [file: ~c"lib/postgrex/protocol.ex", line: 3291]}, {Postgrex.Protocol, :recv_simple, 6, [file: ~c"lib/postgrex/protocol.ex", line: 1214]}, {Postgrex.SimpleConnection, :handle, 5, [file: ~c"lib/postgrex/simple_connection.ex", line: 435]}, {:gen_statem, :loop_state_callback, 11, [file: ~c"gen_statem.erl", line: 1397]}, {:proc_lib, :init_p_do_apply, 3, [file: ~c"proc_lib.erl", line: 241]}]}, log: [], state: %Postgrex.SimpleConnection{idle_interval: 5000, protocol: %Postgrex.Protocol{sock: {:gen_tcp, #Port<0.70>}, connection_id: 1422550, connection_key: -2128768823, peer: {{100, 70, 42, 126}, 10999}, types: {Postgrex.DefaultTypes, #Reference<0.421905508.620363779.60353>}, null: nil, timeout: 15000, ping_timeout: 15000, parameters: #Reference<0.421905508.620232706.59994>, queries: #Reference<0.421905508.620363778.59977>, postgres: :idle, transactions: :naive, buffer: nil, disconnect_on_error_codes: [], scram: %{iterations: 4096, auth_message: "n=,r=hU4Qf4deLswWBuFvTXxMXvkZ,r=hU4Qf4deLswWBuFvTXxMXvkZvNFZzWAsS7GDH5zpQRI9odg8,s=k/zBN95ZGDCrZf7OXgDOhw==,i=4096,c=biws,r=hU4Qf4deLswWBuFvTXxMXvkZvNFZzWAsS7GDH5zpQRI9odg8", salt: <<147, 252, 193, 55, 222, 89, 24, 48, 171, 101, 254, 206, 94, 0, 206, 135>>}, disable_composite_types: false, messages: []}, auto_reconnect: true, reconnect_backoff: 500, state: {Oban.Notifiers.Postgres, %Oban.Notifiers.Postgres{conf: %Oban.Config{dispatch_cooldown: 5, engine: Oban.Engines.Basic, get_dynamic_repo: nil, insert_trigger: true, log: false, name: Oban, node: "app@sunrise-0a090c861036", notifier: {Oban.Notifiers.Postgres, []}, peer: {Oban.Peers.Postgres, []}, plugins: [{Oban.Plugins.Cron, [crontab: [{"* * * * *", App.Billing.UserCardTrial.NotifyUsersAboutExpirationJob}]]}, {Oban.Plugins.Lifeline, [rescue_after: 600000]}, {Oban.Plugins.Pruner, [max_age: 86400]}], prefix: "sunrise", queues: [gpu_requests: [limit: 1, paused: true], gpu_requests_incubator: [limit: 1, paused: true], general: [limit: 1000000]], repo: App.Repo, shutdown_grace_period: 50000, stage_interval: 1000, testing: :disabled}, from: nil, key: nil, channels: %{"sunrise.oban_insert" => [#PID<0.3506.0>, #PID<0.3502.0>, #PID<0.3498.0>], "sunrise.oban_leader" => [#PID<0.3508.0>], "sunrise.oban_signal" => [#PID<0.3495.0>, #PID<0.3506.0>, #PID<0.3502.0>, #PID<0.3498.0>], "sunrise.oban_sonar" => [#PID<0.3509.0>]}, connected?: true, listeners: %{#PID<0.3495.0> => {#Reference<0.421905508.620232705.61033>, MapSet.new(["sunrise.oban_signal"])}, #PID<0.3498.0> => {#Reference<0.421905508.620232706.59997>, MapSet.new(["sunrise.oban_insert", "sunrise.oban_signal"])}, #PID<0.3502.0> => {#Reference<0.421905508.620232705.61029>, MapSet.new(["sunrise.oban_insert", "sunrise.oban_signal"])}, #PID<0.3506.0> => {#Reference<0.421905508.620232705.61031>, MapSet.new(["sunrise.oban_insert", "sunrise.oban_signal"])}, #PID<0.3508.0> => {#Reference<0.421905508.620232705.61035>, MapSet.new(["sunrise.oban_leader"])}, #PID<0.3509.0> => {#Reference<0.421905508.620232706.60072>, MapSet.new(["sunrise.oban_sonar"])}}}}}, queue: [{{:call, {#PID<0.320516.0>, [:alias | #Reference<0.0.7471635.421905508.699203586.235576>]}}, {{:listen, #PID<0.320516.0>, [:oban_job_complete]}, #PID<0.320516.0>}}], modules: [Postgrex.SimpleConnection], client_info: {#PID<0.320516.0>, {#PID<0.320516.0>, [{:gen, :do_call, 4, [file: ~c"gen.erl", line: 259]}, {:gen_statem, :call, 3, [file: ~c"gen_statem.erl", line: 923]}, {Postgrex.SimpleConnection, :call, 3, [file: ~c"lib/postgrex/simple_connection.ex", line: 217]}, {App.ObanJobReporter, :wait_for_job, 2, [file: ~c"lib/app/oban_job_reporter.ex", line: 26]}, {App.ObanJobReporter, :wait_for_job_success!, 1, [file: ~c"lib/app/oban_job_reporter.ex", line: 59]}, {App.ImageRequests.Img2Img.Controller, :generate_new, 2, [file: ~c"lib/app/image_requests/img2img/controller.ex", line: 35]}, {App.ImageRequests.Img2Img.Controller, :action, 2, [file: ~c"lib/app/image_requests/img2img/controller.ex", line: 3]}, {App.ImageRequests.Img2Img.Controller, :phoenix_controller_pipeline, 2, [file: ~c"lib/app/image_requests/img2img/controller.ex", line: 3]}]}}, callback_mode: :handle_event_function, postponed: [], timeouts: {1, [timeout: nil]}, state_enter: false}
    (logger 1.17.0) lib/logger/translator.ex:306: Logger.Translator.report_gen_statem_terminate/2
    (logger 1.17.0) lib/logger/utils.ex:47: Logger.Utils.translate/5
    (logger 1.17.0) lib/logger/utils.ex:25: Logger.Utils.translator/2
    (kernel 9.2.4) logger_backend.erl:101: :logger_backend.do_apply_filters/4
    (kernel 9.2.4) logger_backend.erl:86: :logger_backend.apply_filters/4
    (kernel 9.2.4) logger_backend.erl:32: :logger_backend.log_allowed/3
    (stdlib 5.2.3) gen_statem.erl:2591: :gen_statem.terminate/7
    (stdlib 5.2.3) proc_lib.erl:241: :proc_lib.init_p_do_apply/3

which also led to the same broken state which I only fixed by restarting the entire app.

Then I found that there was a bug in the logger which got fixed in 1.17.2. I though the broken state was caused by it, so I upgraded Elixir and moved on.

You can notice that the root issue that the logger choked upon was the same (msg_decode called with nil). This time, the logger didn’t crash that’s to that fix, but the end state is the same.