DoS mitigation and GenServer

konstantine · February 24, 2021, 5:28pm

There is a lot of online advice about not hitting the maximum number of atoms limit (default of 1048576), including in relation to DoS mitigation. On the contrary, I have not seen any mentions of the simultaneous BEAM processes limit (default of 262144). In any case where a web page starts a new GenServer, doesn’t this create an opportunity for a DoS attack? If so, is there a recommended way of mitigating against it?

LostKobrakai · February 24, 2021, 5:55pm

I guess the difference is how expected each resource exhaustion is.

Leaking atoms is a problem specific to the implementation of atoms on the beam, which beginners usually are completely unaware of (no matter their overall experience). Every datatype but atoms is eventually cleaned up, with larger binaries in second place for easily becoming sticky when referenced by long running processes.

Leaking resources through starting anything based on http request is at least in my book a much more expected DoS vector. I’d also suggest the usual suspects for dealing with it: something like cloudflare and/or rate limiting for incoming requests and queues + limited number of workers for dealing with what was requested.

Exadra37 · February 24, 2021, 8:05pm

And adding some firewall rules on the server running the application, even if you have Cloudflare on front of it.

al2o3cr · February 24, 2021, 10:03pm

The other difference I see is that processes can shut down and release resources, but atoms are never reclaimed. A DoS trying to overflow the process limit would need substantial instantaneous bandwidth, versus an atom leak which only needs enough total requests over the lifetime of the VM.

mpope · February 25, 2021, 1:16am

I’ve seen ‘accidental’ DoS due to processes however it wasn’t around the number of existing processes, but a routing algorithm for directing messages to the proper processes. As the number of processes grew the responsiveness of the system slowed. Around 50K processes caused the system to become completely unresponsive. We needed a two fold mitigation strategy, first kill processes more frequently (easy short term fix) and secondly fixing the routing algorithm to something sane (longer term and could be pushed off to when it was actually needed). Killing the processes when they no longer needed to be active dropped the total processes to 100-300 at a time but had other effects, like increased database pressure because we now had to hit the DB to load the right info into a newly spawned process that would have just been kept in the process dictionary.

This isn’t a great example for isolated HTTP sessions, granted. It is good to keep in mind for OTP designs that rely on a process per ‘object’ that hang around.

konstantine · February 25, 2021, 8:04pm

Thank you all for your replies. It always feels good when a better understanding is attained!