Great insight guys, that second video Koko linked was very interesting. It seems the next time this happens, I will dive into the process as it’s running to see what’s up. When I find the concrete culprit I’ll update this thread with what was causing it.
I had a similar problem awhile back. The cause ended up being a combination of factors. What was supposed to be a transient GenServer was effectively permanent thanks to a bad return tuple from a handler that should have shut it down—instead it forced an error, which resulted in a restart. That, in combination with a very aggressive caching of data in its state, would end up creating hundreds of copies of a process that was meant to live for about 5 minutes, and the memory footprint would end up north of 30GB.
After fixing the termination tuple, as well as drastically trimming back the data that ended up in state, the server hasn’t gone over 400MB more than once or twice, even under heavy load.
I had kind of an idea of where to look and how to trigger the bug, so that helped, but also some selective logging of memory stats, using :observer and wobserver (web version) along with recon helped enormously in confirming the problem and solution.