What makes an API HTTP request to Phoenix hang until going through controller function code?

mekusigjinn · April 6, 2024, 2:42am

Hello, dumbfounded person here trying to get some insight into phoenix.

What would be having the endpoint hang for 13+ seconds until controller functions are run?

Context is this happens under heavy load.

It’s under kubernetes, distributed with six replica pods, plenty of CPU/Ram. but…
when the hanging happens, we see a lot of memory spikes.

We also noticed possible runaway creation of children by a DynamicSupervisor, so our thought is that high memory, or near - OOM status will make something in Phoenix wait until the controller functions are run.

On high memory: service usually works around 1GB, but when it starts to peak we saw it spike upwards of 6 GB and climbing until it goes into OOM and dies.

Can anyone shed some light into this? I think there’s some backoff logic but can’t really figure out what does it.

We put traces at each step so we know for sure that none of the controller function is running while it hangs for 13+ seconds.

Thanks in advance!

dimitarvp · April 6, 2024, 8:48am

You haven’t given any details so hard to say, but maybe big JSON payloads are being sent to your endpoints?

mekusigjinn · April 8, 2024, 3:27pm

Thank you so much for replying back! Really appreciate your question.

Well we do have some endpoint requests with ~500+ UUID (in string form) in JASON object, in a list of strings format.

We don’t really have a lot of those calls though.

Are there anything else in particular that you’d like to know more?

At any given time during peak hours we get around 2.3K API calls, and rate is around 12 requests/sec.

dimitarvp · April 8, 2024, 3:29pm

Sadly it’s hard to guess just like that. I’d think something deserializes JSON to Elixir structs and is not properly releasing memory.

I can only shoot in the dark but I’d recommend you to get rid of that and work with JSON as raw string-keyed maps as a start.

But it’s not even certain that’s the issue. Your first step is to add some metrics / telemetry to your Plug-s.

mekusigjinn · April 8, 2024, 3:50pm

We do use raw string-keyed maps, and that’s what we pass down in context functions, like “vanilla” context implementations created in phx.gen.* .

When you said adding metrics in Plug-s, can you elaborate a bit further? You mean the steps in MyApp.Endpint module, before hitting the router file?

dimitarvp · April 8, 2024, 3:51pm

That and everything else you can find really. Have a root OpenTelemetry span that starts somewhere in the router or whatever first-line-of-defense Plug is more relevant to you, and emit sub-spans and/or events downstream. Inspect your APM system and you should find who and where is waiting for so long.

BTW if memory serves Ecto et. al. are integrated with OpenTelemetry so you should check it out and make sure all these spans are being emitted and shown in your APM UI.

mekusigjinn · April 8, 2024, 3:53pm

Yup I’ll do that. Thanks!

dimitarvp · April 8, 2024, 3:55pm

Sorry I can’t be more helpful but nothing else jumps to mind. I’d just start timing stuff whenever I can and isolate the culprit.

I am mostly sure this is something related to DB pools but have no basis to claim it.

mekusigjinn · April 8, 2024, 4:01pm

Thanks for thinking about it with me.

For checking DB pools - what we can do is maybe lower the pool size per node right?

dimitarvp · April 8, 2024, 4:06pm

No, more like increase the queue_target and queue_timeout as a start. Maybe in conditions of higher load whoever needs DB connections has to wait more. These two options have helped me in very limited cases.

Link to docs.

mekusigjinn · April 9, 2024, 6:40am

Today our team was able to reproduce this locally.

Seems running

:recon_alloc.set_unit(:megabyte)
:recon_alloc.memory(:allocated_types)

during the peak revealed a lot of eheap_alloc to the tune of 40 GB and increasing fast. On re-runs I was able to get near 100GB (on M1).

binary_alloc stayed constant at ~125 MB

Would this shed some light on what’s happening?

Even during peaks the highest memory pid has only 253MB at most, but that ehap_alloc… my gosh it was growing incessantly.

I researched and it seems like this is due to long-running processes and GC being triggered, but I know very little about this / or how to break up “monolithic” processes

dimitarvp · April 9, 2024, 6:44am

As a general observation, it seems that some binaries are never released by the GC. You’ll have to start using :binary.copy at process boundaries.

This is extremely likely related to sending parts of the received parsed JSON objects from one process to another.

mekusigjinn · April 9, 2024, 7:00am

Would that be true even if the binary_alloc is staying fairly constant at 125 MB, while eheap was at 77 GB and climbing?

LostKobrakai · April 9, 2024, 7:03am

How many of those 200MB+ processes did you have though? Those 90GB heap total need to go somewhere.

dimitarvp · April 9, 2024, 7:31am

Sorry, I don’t remember. I fought with such problems very rarely and details didn’t stick around.

mekusigjinn · April 9, 2024, 2:22pm

Thanks for jumping in to help out. That’s what’s weird, there is only 1 or 2 max.
This is even during the memory spikes.

I read eheap is for data that processes need so it doesn’t have to ask for more memory from the VM. I wonder if those eheap memory don’t get counted when doing Process.info(pid, :memory)?

dimitarvp · April 9, 2024, 3:15pm

btw it might be worth it raising your problem on Erlang forums. Here’s one topic there that kinda sorta sounds similar, if for no other reason than eheap being mentioned.

A small and IMO incomplete explanation on eheap is contained here: memory - eheap allocated in erlang - Stack Overflow

Beyond that though, no clue.

I have to tell you that periodically we get forum posts like yours and I feel really bad about them because they very rarely get resolved, or if they are the original posters never follow through.

mekusigjinn · April 30, 2024, 5:20am

I think we’ve found the issue - just wanted to report back.

Sample testing a change using PartitionSupervisor with “default” setup mode of 1 per core showed better results.

I would think together with that + poolboy would probably resolve this issue.

Thanks everyone for offering help!