mattei
Debugging obscure process crashes?
Hi all, big Elixir fan (and newbie!) dropping by to ask something that has been puzzling me for a bit.
I’m building an app that relies on Oban for job processing. However, some jobs are getting killed/crashing with obscure errors by the BEAM.
These are the strange behaviors:
-
Oban job gets killed. The Oban job is simple, it only does an HTTPoison request.
-
Requests/responses seem to get stalled. HTTP requests take too long/forever, although the actual time elapsed doesn’t reflect in the response time metric in terminal.
Request gets stalled early in the plugs process, then after a few second resumes:
When it happens (all local, not production):
- High request rate – sending tons of requests, Oban job inserts from Postman into the API endpoint
- Suspected high memory pressure – although I doubt it’s OOM, because I’ve looked at activity monitor and sometimes it happens, even in the green.
- Randomly – sometimes I’m only sending a one off request
Here are some suspicions:
- Too many queries being sent, Postgres stalling.
- Out of memory/low resource behavior, but I thought BEAM handled this better.
- Oban job taking too long, although it’d be a TimeoutError, not a Killed.
- Infinite recursion somewhere, although I feel that’d also be a TimeoutError from Oban.
- HTTPoison bug, the process gets killed.
- Memory leak.
I have no leads other than these error messages and strange behavior.
Where would I start to debug this problem? How would I prove any of these theories? I’m new to the BEAM and it’s very different from a traditional language.
Marked As Solved
benwilson512
No, the whole VM is unloading the code that was there before, and then loading the new compiled code. The best thing to do is just let Oban retry the job after the crash.
Also Liked
lud
Your process receives and :EXIT message from another process that is linked to the former.
Do you start a process from your job process with a SomeModule.start_link or spawn_link call? The process with pid <0.13246.0> was killed and as it was linked to your own process, your process exited with that same reason.
You may create a test and directly call your Job.perform function repeatedly, without starting oban. This will be easier to get a proper stacktrace.
emoragaf
normally you would wait for a connection to become available, but when you have the issue of having the connections not being released back into the pool you basically loose capacity silently until things blow up.
The worst thing is that since it’s very easy to just leave the :default pool, you can have totally unrelated parts of your app (that also use the default pool) causing this issue.
Your process receives and :EXIT message from another process that is linked to the former.
This kind of thing could indeed be the root cause of pool starvation, to the point where you have your jobs waiting forever for a connection that is never going to be put back into the pool
benwilson512
Hey @mattei as a couple of notes, please always copy and paste text instead of using screenshots. Those screenshots are barely visible on my screen due to resolution / saturation.
Secondly, when you say “hot reloads” are you referring to development code reloading, or are you using OTP hot reloading in production?
benwilson512
Gotcha, yeah I mean with development reloads crashes of background processes are pretty normal, code is getting loaded and unloaded without regard for whether there are live processes using that code. If that’s the only time this is happening I wouldn’t worry about it.
emoragaf
I’m throwing a dart in the dark here, but I’ve been bitte by HttPoison/hackney pools before, where if the request crashes it doesn’t release the connection, starving the pool in a short time.
Try setting pool: false and see if that helps maybe?










