Erlang DNS resolver performing badly when two nodes are splitted

2Sjch8AT · November 19, 2020, 3:31pm

We have a productive system where one Erlang node connected to another node perform badly when the latest node is down. In such a case the message queues on the first node augment and the node does not manage to process its messages quickly enough. The communication between the nodes is done with GenServer.cast. When sending a message to the second node we use a call such as GenServer.cast({:registered_name_on_the_remote, "server2.example.com"}, msg).

With tcpdump, traces, trial and errors and reading some of the Erlang code, I have identified that the :inet_gethost_native.gethostbyname is invoked a huge number of time as long as the second down is stopped. This increase in load causes the function to sometimes take more than 3 seconds before returning a value! With tcpdump I have measure how long the DNS queries take and can confirm that this increase in time is due to the Erlang code and not to the network or dns server (max time for a DNS query is 15ms).

My team and I were surprised that just disconnecting/stopping (with Ctrl-C or systemctl stop <service>) the node causes problem.

I have created a proof of concept with a detailed README showing how to reproduce the problem:

My team and I are surprised that such a situation put Erlang under stress. After all, a node being down should not be a big deal and DNS queries should be cached.

Would you consider this behavior a bug?
Do you have experience with a such a situation?

P.S: the system where the code is running:

Erlang/OTP 21 [erts-10.3.5.7] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
Elixir 1.10.3 (compiled with Erlang/OTP 21)
Erlang/OTP 21 [erts-10.3.5.7] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
Debian GNU/Linux 9 (stretch)