Background
We have a system in Elixir that handles millions of requests per second. In order to be responsive, we have thousands of workers in parallel to answer those petitions.
Problem
The problem here is that we don’t know if a worker is being overwhelmed. This is dangerous because the mailboxes of the workers grow until the system crashes.
erlang.process_info(pid, :message_queue_len)
One of the solutions this community recommended was to use :erlang.process_info/2
(or its Elixir equivalent Process.info/2
) to check the mailbox of a worker. Then, if the mailbox has too many messages (let’s say, 100) we drop the request.
http://erlang.org/doc/man/erlang.html#process_info-2
This is nothing new, the Elixir Logger itself uses a similar approach.
The issue here is that for each request we get we would need to call upon erlang.process_info(pid, :message_queue_len)
roughly 60 times (because each request can go up to 65 workers for different types of processing).
So we would be invoking this function hundreds of millions of times per second. For this to work, this function needs to be lightweight, which raises some questions:
- Is process_info safe to use in production code?
- Is process_info a heavt operation when compared to checking an ETS table for a value?
I have read the official docs and didn’t find anything alarming. What are your experiences with the usage of this function?