Is `:erlang.process_info(:pid, :message_queue_len)` a heavy operation?

Fl4m3Ph03n1x · February 22, 2019, 9:54am

Background

We have a system in Elixir that handles millions of requests per second. In order to be responsive, we have thousands of workers in parallel to answer those petitions.

Problem

The problem here is that we don’t know if a worker is being overwhelmed. This is dangerous because the mailboxes of the workers grow until the system crashes.

`erlang.process_info(pid, :message_queue_len)`

One of the solutions this community recommended was to use :erlang.process_info/2 (or its Elixir equivalent Process.info/2) to check the mailbox of a worker. Then, if the mailbox has too many messages (let’s say, 100) we drop the request.

http://erlang.org/doc/man/erlang.html#process_info-2

This is nothing new, the Elixir Logger itself uses a similar approach.

The issue here is that for each request we get we would need to call upon erlang.process_info(pid, :message_queue_len) roughly 60 times (because each request can go up to 65 workers for different types of processing).

So we would be invoking this function hundreds of millions of times per second. For this to work, this function needs to be lightweight, which raises some questions:

Is process_info safe to use in production code?
Is process_info a heavt operation when compared to checking an ETS table for a value?

I have read the official docs and didn’t find anything alarming. What are your experiences with the usage of this function?

massimo · February 22, 2019, 11:24am

There’s a @josevalim video on his twitch channel that talks about it.

The title is something like ‘improving logger’ or something similar, I’m going from memory here, Twitch is blocked where I work.

Basically Process.info is blocking and should never be used in a tight loop.

ericmj · February 22, 2019, 11:24am

The best way to know if it’s an expensive operation is measure it:

iex(1)> pid = self()
#PID<0.107.0>
iex(2)> :timer.tc fn -> :erlang.process_info(pid, :message_queue_len) end
{7, {:message_queue_len, 0}}
iex(3)> :timer.tc fn -> :erlang.process_info(pid, :message_queue_len) end
{6, {:message_queue_len, 0}}
iex(4)> :timer.tc fn -> :erlang.process_info(pid, :message_queue_len) end
{8, {:message_queue_len, 0}}
iex(5)> :timer.tc fn -> :erlang.process_info(pid, :message_queue_len) end
{8, {:message_queue_len, 0}}
iex(6)> :timer.tc fn -> :erlang.process_info(pid, :message_queue_len) end
{5, {:message_queue_len, 0}}
iex(7)> :timer.tc fn -> :erlang.process_info(pid, :message_queue_len) end
{8, {:message_queue_len, 0}}
iex(8)> :timer.tc fn -> :erlang.process_info(pid, :message_queue_len) end
{7, {:message_queue_len, 0}}
iex(9)> :timer.tc fn -> Kernel.+(1, 1) end
{4, 2}
iex(10)> :timer.tc fn -> Kernel.+(1, 1) end
{5, 2}
iex(11)> :timer.tc fn -> Kernel.+(1, 1) end
{4, 2}
iex(12)> :timer.tc fn -> Kernel.+(1, 1) end
{3, 2}

It seems to be slightly more expensive than a remote function call.

ericmj · February 22, 2019, 11:27am

Can you elaborate on what you mean by “blocking”? All function calls in Elixir are blocking in some sense, so what does it block?

NobbZ · February 22, 2019, 11:29am

But is it constant time, or does the runtime of this function perhaps depend on the length of the queue?

Fl4m3Ph03n1x · February 22, 2019, 11:30am

I don’t get this, do you mean that Kernel.+, which is the basic + operator in Elixir (meaning how we add numbers) is considered a remote operation?

I apologize if this seems non-sensical to you, but when I hear about remote functions I immediately jump to things like RPC, CORBA and RMI, which are iirc, quite expensive (because they go across the network).

josevalim · February 22, 2019, 11:31am

Process.info, when called from an external process, puts a lock on the process being “infoed”. Therefore, it is super safe when calling with self() but not from an external one.

ericmj · February 22, 2019, 11:32am

My guess would be that it’s constant time, but it’s probably best to measure :).

ericmj · February 22, 2019, 11:34am

It was just to show an example of what :timer.tc/1 returns for a “cheap” operation. Also note that I ran this with evaluated code which is not a great measurement, it is better to measure like this:

iex(1)> defmodule Foo do
...(1)> def x(pid), do: fn -> :erlang.process_info(pid, :message_queue_len) end
...(1)> def y, do: fn -> Kernel.+(1, 1) end
...(1)> end
{:module, Foo,
<<70, 79, 82, 49, 0, 0, 5, 64, 66, 69, 65, 77, 65, 116, 85, 56, 0, 0, 0, 176,
  0, 0, 0, 18, 10, 69, 108, 105, 120, 105, 114, 46, 70, 111, 111, 8, 95, 95,
  105, 110, 102, 111, 95, 95, 7, 99, 111, ...>>, {:y, 0}}
iex(2)> x = Foo.x(self)
#Function<0.71361502/0 in Foo.x/1>
iex(3)> y = Foo.y
#Function<1.71361502/0 in Foo.y/0>
iex(4)> :timer.tc x
{2, {:message_queue_len, 0}}
iex(5)> :timer.tc y
{1, 2}

It is almost not measurable.

Fl4m3Ph03n1x · February 22, 2019, 11:46am

The way I interpret this is that if I have a given worker whose mailbox is currently growing because it is overwhelmed, having an external process issue a Process.info on the worker which is dying is a terrible idea because I will block the worker, keeping him from doing his job.

Is this interpretation correct ?

massimo · February 22, 2019, 11:58am

You shouldn’t do

erlang.process_info(pid, :message_queue_len)

hundreds of millions of times per second.

But each process could safely ask for its own counter and send it back when requested.

gregvaughn · February 22, 2019, 6:39pm

This is terminology used within the BEAM. It basically means a call to a different module, not in the more generic sense of across a network.