I’ve recently had a production issue and I’m puzzled about what happened.
Was going to submit a Github issue but not sure to which repo yet.
I received an alert that basically said “hey, your Oban jobs are not being processed!”.
I went to look at the logs and found this
05:39:33.249 [error] :gen_statem {Oban.Registry, {Oban, Oban.Notifier}} terminating
** (FunctionClauseError) no function clause matching in Postgrex.Protocol.msg_decode/1
(postgrex 0.19.1) lib/postgrex/protocol.ex:3320: Postgrex.Protocol.msg_decode(nil)
(postgrex 0.19.1) lib/postgrex/protocol.ex:3291: Postgrex.Protocol.msg_recv/3
(postgrex 0.19.1) lib/postgrex/protocol.ex:1214: Postgrex.Protocol.recv_simple/6
(postgrex 0.19.1) lib/postgrex/simple_connection.ex:435: Postgrex.SimpleConnection.handle/5
(stdlib 5.2.3.1) gen_statem.erl:1397: :gen_statem.loop_state_callback/11
(stdlib 5.2.3.1) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Queue: [{{:call, {#PID<0.1231406.0>, [:alias | #Reference<0.0.19208019.3129316008.509673475.233699>]}}, {{:listen, #PID<0.1231406.0>, [:oban_job_complete]}, #PID<0.1231406.0>}}]
Postponed: []
One can point out that there is some issue related to Postgrex
here since msg_decode(nil)
is not expected.
I’m curious why that happened, but I’m kinda fine with that. What puzzled me is that the Notifier wasn’t restarted and it was stuck in a broken state with no jobs being processed further.
My quick fix was simply restarting the entire app (it was a production app so didn’t have time to poke around while customers are struggling), but now I’m scared that it might happen again and I won’t be there to handle it.
I can see that roughly the chain of calls is this: Oban.init -> Oban.Notifiers.Postgres.start_link -> Postgrex.SimpleConnection.start_link -> :gen_statem.start_link
Judging from the process info, Oban is trapping exists for the Notifier but I failed to find where.
iex(app@sunrise-5242144a70d7.weasel-interval.ts.net)1> Oban.Registry.whereis(Oban, Oban.Notifier) |> Process.info
[
current_function: {:gen_statem, :loop_receive, 3},
initial_call: {:proc_lib, :init_p, 5},
status: :waiting,
message_queue_len: 0,
links: [#PID<0.3694.0>, #Port<0.66>, #PID<0.3838.0>, #PID<0.3531.0>],
dictionary: [
{Oban.Notifiers.Postgres,
[
types: Postgrex.DefaultTypes,
sync_connect: false,
telemetry_prefix: [:app, :repo],
otp_app: :app,
timeout: 15000,
ssl: false,
pool_size: 60,
socket_options: [],
prefix: "sunrise",
after_connect: {Postgrex, :query!,
["SET search_path TO sunrise, public", []]},
hostname: "...",
scheme: "...",
username: "...",
password: "...",
database: "...",
port: 10999
]},
{:"$initial_call", {Postgrex.SimpleConnection, :init, 1}},
{:"$ancestors", [#PID<0.3694.0>, App.Supervisor, #PID<0.3591.0>]}
],
trap_exit: true,
error_handler: :error_handler,
priority: :normal,
group_leader: #PID<0.3590.0>,
total_heap_size: 8370,
heap_size: 4185,
stack_size: 10,
reductions: 844600,
garbage_collection: [
max_heap_size: %{
error_logger: true,
include_shared_binaries: false,
kill: true,
size: 0
},
min_bin_vheap_size: 46422,
min_heap_size: 233,
fullsweep_after: 65535,
minor_gcs: 17
],
suspending: []
]
Initially I thought that the issue is related to SimpleConnection
not handling gen_statem termination properly, but now I feel like it’s somewhere in Oban since SimpleConnection itself doesn’t seem to trap anything.
Unfortunately, I couldn’t find a way to manually reproduce this issue. Also, killing the Notifier process explicitly (Process.exit(pid, :kill)
) does lead to an immediate restart.
I would appreciate any help!
Again, a gen_statem termination is okay with me, but I want to understand how to make sure the Notifier would be restarted.