Websocket binary memory leak

cburman01 · August 21, 2019, 1:25pm

Hello. I am seeing a high memory usage cowboy_protocol.init/4.

A brief overview about the application:
Data is aggregated into a genserver and updates are pushed out via websockets when the data changes or a new user is connected. The data changes a lot. - Pretty simple.

The dataset is fairly large.

To reduce the binary memory leak issue in the GenServers that aggregate the data, I had to manually garbage collect between each iteration.

I believe I have the same probably but now with each user connected via the socket.

My question is, how do you guys handle the websocket memory in phoenix? From what I am seeing, most people do not worry about it. My 100 users are eating about 500 mbs. When I trigger a GC from the observer tool, i reclaim about 400.

Is this something I should be using: https://hexdocs.pm/phoenix/Phoenix.Transports.WebSocket.html#module-garbage-collection

I guess what I am looking for is overall strategies. This application is in beta right now in our organization and will bump up to about 1000 users once I get this figured out.

OvermindDL1 · August 21, 2019, 2:34pm

If it’s not reclaiming it as it is anyway then there is probably no need to reclaim it due to having memory available. The BEAM prefers to reclaim upon process destruction (as it is effectively free then), it only reclaims in a process if a single process becomes too large for the amount of memory available to it. Are you actually running out of memory, because if so that would be a bug, but otherwise it seems normal?

akoutmos · August 21, 2019, 3:03pm

I once had a similar problem when ingesting large data into a GenStage pipeline. After reading Chapter 7 in Erlang in Anger [4], I figured out that the issue was the GenStage payloads being large and thus living in the global heap. Given that the processes were long running, these globally allocated binaries were not being cleaned up. I was able to pin point the issue by using observer_cli [1] and :recon.bin_leak/1 [2].

Solving it was a matter of adding :hibernate to the return tuple from handle_call [3]. In your case with Phoenix, it seems like that :garbage_collect flag should do the trick as it is calling :erlang.garbage_collect/1 under the hood. If it doesn’t seem to solve the problem, use the previously mentioned tools to see where the leak is coming from.

[1] https://github.com/zhongwencool/observer_cli
[2] https://github.com/ferd/recon
[3] https://hexdocs.pm/elixir/GenServer.html#c:handle_call/3
[4] https://s3.us-east-2.amazonaws.com/ferd.erlang-in-anger/text.v1.1.0.pdf

benwilson512 · August 21, 2019, 3:03pm

Unless you’re running into the ref counted binary “memory leak” issue. Running low on system memory does not trigger garbage collections in erlang processes. If the process itself runs low on memory, or otherwise has some other GC trigger, then it runs gc. In the binary mem leak scenario you can get pathological cases where a process holds on to large ref counted binaries for a very long time because its internal conditions for garbage collection are never met.

@cburman01 in our code we tend to just throw GC timers on anything that is gonna touch large binaries. For things that do intermittent work, the new hibernate_after option is also useful.

heathen · August 21, 2019, 3:14pm

Exactly the same question (at the end) I’ve got a few years ago. After that with great help of Jose I’ve made a PR to be able to collect garbage after each iteration. Just send socket.transport_pid, :garbage_collect and memory usage going down.

OvermindDL1 · August 21, 2019, 3:47pm

True this! I try to keep my binaries short just for this reason unless it’s something that truly needs to persist.

cburman01 · August 22, 2019, 4:14pm

Thanks Ben. I haven’t tested the hibernate option yet, but the main genserver is on a 20 second loop so I didn’t know if that was too short to consider using hibernate.

How do you guys do your gc timers? A handle_info that just garbage collects that is triggered by a send after?