Debugging obscure process crashes?

mattei · May 4, 2022, 5:47am

Hi all, big Elixir fan (and newbie!) dropping by to ask something that has been puzzling me for a bit.

I’m building an app that relies on Oban for job processing. However, some jobs are getting killed/crashing with obscure errors by the BEAM.

These are the strange behaviors:

Oban job gets killed. The Oban job is simple, it only does an HTTPoison request.

Screen Shot 2022-05-04 at 1.43.08 AM1271×138 34.2 KB
Requests/responses seem to get stalled. HTTP requests take too long/forever, although the actual time elapsed doesn’t reflect in the response time metric in terminal.

Request gets stalled early in the plugs process, then after a few second resumes:

When it happens (all local, not production):

High request rate – sending tons of requests, Oban job inserts from Postman into the API endpoint
Suspected high memory pressure – although I doubt it’s OOM, because I’ve looked at activity monitor and sometimes it happens, even in the green.
Randomly – sometimes I’m only sending a one off request

Here are some suspicions:

Too many queries being sent, Postgres stalling.
Out of memory/low resource behavior, but I thought BEAM handled this better.
Oban job taking too long, although it’d be a TimeoutError, not a Killed.
Infinite recursion somewhere, although I feel that’d also be a TimeoutError from Oban.
HTTPoison bug, the process gets killed.
Memory leak.

I have no leads other than these error messages and strange behavior.

Where would I start to debug this problem? How would I prove any of these theories? I’m new to the BEAM and it’s very different from a traditional language.

lud · May 4, 2022, 8:44am

Your process receives and :EXIT message from another process that is linked to the former.

Do you start a process from your job process with a SomeModule.start_link or spawn_link call? The process with pid <0.13246.0> was killed and as it was linked to your own process, your process exited with that same reason.

You may create a test and directly call your Job.perform function repeatedly, without starting oban. This will be easier to get a proper stacktrace.

emoragaf · May 4, 2022, 3:49pm

I’m throwing a dart in the dark here, but I’ve been bitte by HttPoison/hackney pools before, where if the request crashes it doesn’t release the connection, starving the pool in a short time.

Try setting pool: false and see if that helps maybe?

mattei · May 4, 2022, 8:04pm

Hey, thanks for the idea. I never thought of this – this is much easier to test in isolation.

I’ve performed a simple load test, and everything I can throw at it seems to work. The job executes well. I’ll continue testing the worker in isolation and report back – I’m hoping to get a better stack trace than whan Oban provides.

Enum.each(1..100, fn _ -> Parcel.Jobs.DeliverWebhook.perform(job) end)

mattei · May 4, 2022, 8:04pm

I figured it could be caused by HTTPoison pools – will absolutely try this, as I have no clue where the process is getting sent a :killed.

Just found that pools limit the amount of concurrent connections you can execute, will disable them. This could be the problem: too many Oban jobs are executing in parallel and the pool gets killed?

emoragaf · May 4, 2022, 8:12pm

normally you would wait for a connection to become available, but when you have the issue of having the connections not being released back into the pool you basically loose capacity silently until things blow up.

The worst thing is that since it’s very easy to just leave the :default pool, you can have totally unrelated parts of your app (that also use the default pool) causing this issue.

Your process receives and :EXIT message from another process that is linked to the former.

This kind of thing could indeed be the root cause of pool starvation, to the point where you have your jobs waiting forever for a connection that is never going to be put back into the pool

mattei · May 4, 2022, 8:23pm

I’ve disabled the connection pool and wrote a small testing Mix script to load test the behavior. I’m seeing no :killed yet, but will continue testing over the following days under many conditions (including starving the system of memory).

Reporting back in a bit.

  @impl Mix.Task
  def run(_) do
    {:ok, pid} = Task.Supervisor.start_link()

    Enum.map(0..100, fn _ ->
      Task.Supervisor.async(pid, fn ->
        HTTPoison.start()

        req =
          HTTPoison.post!(
            "http://localhost:4000/v1/events/",
            Jason.encode!(%{
              subscriber_id: "sub_0000",
              name: "test.event",
              payload: %{
                hello: "world"
              }
            }),
            [{"Authorization", "Bearer key_0000"}, {"Content-Type", "application/json"}]
          )

        Logger.debug("Sent request, received #{req.status_code}")
      end)
    end)
    |> Task.await_many()
  end

mattei · May 4, 2022, 9:42pm

I’ll continue testing over the next couple of days. I got some errors, but because the system ran out of file descriptors (I bombed the system with requests). I also saw a Killed error somewhere, so I suspect the problem persists.

I will continue testing in different system conditions, such as high memory usage and long uptime (lots of hot reloads from Phoenix, for example) and report back. I’m still wondering what caused the problem, or if I can prove the connection pooling caused it definitively.

I think I’ve seen the problem more with long-running instances of mix phx.server, which include closing the laptop lots and leaving the process on for days on end.

cmo · May 4, 2022, 9:47pm

config :logger,
  handle_otp_reports: true,
  handle_sasl_reports: true

You’ll probably see the other error with those on.

lud · May 4, 2022, 10:34pm

This is not concurrent. Your other snippet with Task.Supervisor.async is the way to go. Now, it can be actually pretty fast. What I would do is to simulate load by adding a sleep(1 second) in the server that responds at http://localhost:4000/v1/events/.

You said that you do not see the problem in tests. Does the problem still exists when using Oban?

mattei · May 5, 2022, 12:20am

The problem is primarily when using Oban, yes.

I will add a Process.sleep at the controller and the Oban job to simulate load

EDIT: Okay, so what ends up happening under the load script is that the server locks up and requests don’t go through. The error persisted.

I’m so confused. I think I found how to replicate it. This only happens when the laptop goes to sleep and comes back. I restarted the process and it’s processing jobs, handling requests like normal without Phoenix locking up.

Odd state bug? I’m worried this might end up happening in production with long-running instances.

lud · May 5, 2022, 8:24am

Well this is beyond my knowledge but could it be that your HTTP library maintains long-lived connections that it believes to be still open but are actually closed because the laptop went to sleep, so when it tries to use them again it fails because it could not handle the closed event in a “normal” way?

Now, thanks to the way OTP works, I guess your app still works, doesn’t it? Oban will retry the jobs, the HTTP library will recover and in the end all your jobs should be processed. If that is the case then I would consider the app to be fine : unexpected errors are handled by restarting processes and resuming to a working state.

Hopefully productions servers do not go to sleep

mattei · May 5, 2022, 12:10pm

I suspect I found the problem. I strongly believe related to hot-reloads with Phoenix.

I came back from sleep (long-running session) and triggered two consecutive hot reloads in the dev server. Jobs started getting killed, and thanks to @cmo’s logging snippet I started seeing this error pop up:

[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19581.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19582.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19584.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19583.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19586.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19592.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19587.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19574.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19585.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19575.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19577.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19591.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19588.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19578.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19589.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[error] Child :undefined of Supervisor {Oban.Registry, {Oban, {:foreman, "default"}}} terminated
** (exit) killed
Pid: #PID<0.19590.7>
Start Call: Task.Supervised.start_link/?
Restart: :temporary
Shutdown: 5000
Type: :worker
[info] {"args":{"endpoint_id":2,"event_id":25439},"attempt":5,"duration":243090,"error":"** (exit) killed","event":"job:exception","id":127098,"max_attempts":10,"meta":{},"queue":"default","queue_time":2513829,"source":"oban","state":"failure","tags":[],"worker":"Parcel.Jobs.Deliver"}

I’m still working to nail down a true path to replicate, but this is a good step – there’s a trace.

Will file an issue in @sorentwo’s Oban repo once I can find a way to replicate.

benwilson512 · May 5, 2022, 12:20pm

Hey @mattei as a couple of notes, please always copy and paste text instead of using screenshots. Those screenshots are barely visible on my screen due to resolution / saturation.

Secondly, when you say “hot reloads” are you referring to development code reloading, or are you using OTP hot reloading in production?

mattei · May 5, 2022, 12:23pm

Fixed up! I meant development hot reloads, yes.

benwilson512 · May 5, 2022, 12:34pm

Gotcha, yeah I mean with development reloads crashes of background processes are pretty normal, code is getting loaded and unloaded without regard for whether there are live processes using that code. If that’s the only time this is happening I wouldn’t worry about it.

mattei · May 5, 2022, 5:37pm

In that case, is there any way to exclude Oban from hot reloads?

benwilson512 · May 5, 2022, 5:39pm

No, the whole VM is unloading the code that was there before, and then loading the new compiled code. The best thing to do is just let Oban retry the job after the crash.

mattei · May 5, 2022, 5:40pm

Okay, thanks! The next big question is debugging why requests start stalling under load, but I figure that might be related to Postgres.

Exadra37 · May 5, 2022, 10:28pm

They are called live reloads.