Hello, dumbfounded person here trying to get some insight into phoenix.
What would be having the endpoint hang for 13+ seconds until controller functions are run?
Context is this happens under heavy load.
It’s under kubernetes, distributed with six replica pods, plenty of CPU/Ram. but…
when the hanging happens, we see a lot of memory spikes.
We also noticed possible runaway creation of children by a DynamicSupervisor, so our thought is that high memory, or near - OOM status will make something in Phoenix wait until the controller functions are run.
On high memory: service usually works around 1GB, but when it starts to peak we saw it spike upwards of 6 GB and climbing until it goes into OOM and dies.
Can anyone shed some light into this? I think there’s some backoff logic but can’t really figure out what does it.
We put traces at each step so we know for sure that none of the controller function is running while it hangs for 13+ seconds.
That and everything else you can find really. Have a root OpenTelemetry span that starts somewhere in the router or whatever first-line-of-defense Plug is more relevant to you, and emit sub-spans and/or events downstream. Inspect your APM system and you should find who and where is waiting for so long.
BTW if memory serves Ecto et. al. are integrated with OpenTelemetry so you should check it out and make sure all these spans are being emitted and shown in your APM UI.
No, more like increase the queue_target and queue_timeout as a start. Maybe in conditions of higher load whoever needs DB connections has to wait more. These two options have helped me in very limited cases.
during the peak revealed a lot of eheap_alloc to the tune of 40 GB and increasing fast. On re-runs I was able to get near 100GB (on M1).
binary_alloc stayed constant at ~125 MB
Would this shed some light on what’s happening?
Even during peaks the highest memory pid has only 253MB at most, but that ehap_alloc… my gosh it was growing incessantly.
I researched and it seems like this is due to long-running processes and GC being triggered, but I know very little about this / or how to break up “monolithic” processes
Thanks for jumping in to help out. That’s what’s weird, there is only 1 or 2 max.
This is even during the memory spikes.
I read eheap is for data that processes need so it doesn’t have to ask for more memory from the VM. I wonder if those eheap memory don’t get counted when doing Process.info(pid, :memory)?
btw it might be worth it raising your problem on Erlang forums. Here’s one topic there that kinda sorta sounds similar, if for no other reason than eheap being mentioned.
I have to tell you that periodically we get forum posts like yours and I feel really bad about them because they very rarely get resolved, or if they are the original posters never follow through.