Issues with using GenServer.cast to log metrics?

So I was catching on keynotes backlogs when I got to Chris McCord’s ElixirConf 2017 keynote, and around the 27 mins mark, he talked about a performance issues with a plug that calls a GenServer.cast to log metrics.

Paraphrased, he mentioned something along the line of

the problem we’re seeing is these services was doing a GenServer.cast on every request to a reporter backend, while the reporter backend was batching the requests, but it was sending a message to a single genserver process on every request, so this crashed the VM under load and reduced the application to a single thread performance

Referring the slide

So, my questions are:

  • Whats the issue with the code above? I get the crashing the VM part as it could flood the process of the mailbox and potentially causing it to run out of memory, but how does this cause the application to slow down to a single-threaded application level? Shouldn’t the GenServer.cast be asynchronous and thus should have minimal impact on the requests?

  • How should it be done instead? I’ve seen some suggestion on maintaining a pool of workers, but I don’t see how that would help since theres still a “reporting backend” to receive the message that would still be the bottleneck. (or really, I see how it would become the bottleneck for the metric collection, but I don’t really understand how it would become a bottleneck that affects the web requests since it runs on a separate process (ignoring resource consumption/process scheduling etc)).

but it was sending a message to a single genserver process on every request

Single genserver leads to a bottleneck.

How should it be done instead?

Prometheus, for example, pulls the metrics instead of allowing clients to push them. You can collect metrics either in ets tables (one per scheduler to avoid lock contention) or with nifs (like oneup).

Can you elaborate?

I get that theres an obvious bottleneck on the reporting, but how does it affect the web requests performance?

It’s probably implied that the effect on the web application was indirect: it crashed since the single genserver couldn’t process all incoming requests.

Thats what I thought, but I couldn’t be sure as english isn’t my first language, so I thought maybe I missed something. I suppose I was just overthinking it. LOL