Difficult debugging problem

dch · January 8, 2019, 10:40am

TLDR this is a game of whack-a-mole move-the-bottleneck. A lot of this advice is covered above but I added some additional tips and hopefully some useful commentary.

If you’re still struggling with this I would bypass all of docker/kubes and spin up a bare metal instance at https://packet.net/ they provide excellent short-term h/w and networking just like cloud VMs just on real hardware, and do a quick comparison to see what your app ends up with without the layers of virtualisation and load balancer goop in the way, and then in order of least effort/max benefit:

read the error messages, they tell you what’s wrong. all the logs, all the time. measure the impact of your changes.
ensure you are running with max ulimit
check your erlang VM ports with recon http://ferd.github.io/recon/ and monitor it to ensure you’re not running out. It’s an easy fix, and adding overhead is not expensive in memory terms.
ensure your socket listen queues are not backlogged in erlang. See http://erlang.org/doc/man/gen_tcp.html#type-option for some possible options to pass through to your HTTP/TCP libraries, but basically use large buffers, no nagel delay, use tcp keepalive. This varies depending on the particular library you’re using, but this should get you started at least. You will see from outside with netstat -Lan |grep <port> if you’re running out. A healthy acceptor should be 0/0/... at all times, anything else you have problems.

Here’s my haproxy and couchdb for example:

fffff80e098a17a0 tcp46 0/0/2000                         *.443 (haproxy)
fffff8071a26e000 tcp4  0/0/2000                         127.0.0.1.5984 (beam.smp from couchdb)

tune tcp ephemeral port range, and tweak TIME_WAIT reuse if logs show you’re running out. I see sonewconn errors in kernel logs on FreeBSD but I’m not sure what linux spits out. The rabbitmq examples are excellent, https://www.rabbitmq.com/networking.html more notes below. In particular I’d expect to see {error, eaddrnotavail} being returned from attempts to open new connections in the VM. These should bubble up to your HTTP layer.
get real data from the network side with tcpdump & wireshark to see how your external connections are behaving. You may need help interpreting these, if so just skip this til last.
ensure that you’re using persistent outbound TLS + HTTP1.1 (or better) connections to your upstream APIs. The first step is to use multiple IPs for outbound connections in your BEAM pool, and potentially secondly to move some work out of the VM entirely - I use haproxy and also do TLS termination there rather than inside the BEAM. This minimises the work the VM has to do, and also gives me a very nice stats page on how the connections are being handled, or retried, and additional load balancing capabilities across instances and servers.
go back and read your logfiles again hand in hand with proxy and debugging output looking for correlations.

Further notes below.

Personally I am a huge fan of grabbing a tcpdump on the server, and throwing that into wireshark to see what’s actually happening on the wire. Every time I’ve needed to do this, I’ve wished that I had done it earlier. I appreciate this is not necessarily everybody’s cup of tea but if you’re dealing with network issues, then you need to see what’s actually happening on the wire.

It’s really common (as others have pointed out) to run out of ports, both in the Erlang VM, and in the OS (called ephemeral TCP port exhaustion). The former is set with +Q as mentioned and just requires more ram, the latter is slightly more complicated.

If this is the case your OS should be logging this somewhere, you’ll need to research that to confirm if it’s happening. On my OS (FreeBSD) I see sonewconn errors in kernel logs. A symptom of this is increasing numbers of TCP connections in TIME_WAIT state. You can see this in dmesg and in output of ss -tan state time-wait as mentioned above. The next step is to make more ports available for use as ephemeral ports. See links for details http://www.ncftp.com/ncftpd/doc/misc/ephemeral_ports.html and https://www.nginx.com/blog/overcoming-ephemeral-port-exhaustion-nginx-plus/ have reasonable explanations.

For nertwork stack tuning in general, see https://fasterdata.es.net/host-tuning/ http://proj.sunet.se/E2E/tcptune.html & https://www.psc.edu/index.php/networking/641-tcp-tune are excellent references, as well as https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/performance_tuning_guide/#chap-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Networking but remember that you can’t simply tweak all the above settings without some consequences. slow gradual change and testing are required

If you make your outbound pools too big, and you’re using persistent connections, the upstream servers can also start disconnecting idle connections, thus increasing your connection churn even further. If you can, contact those providers and ask them if they do any connection handling like this or not.

If you’re connecting over and over again to the same upstream IPs or all your connections are coming from a load balancer, (e.g. kubernetes, AWS LB) you may be hitting the time_wait state much much earlier than expected, as the BSD socket API uses the quad tuple (source IP, source port, dest IP, dest port) as a unique key. The solution is not to enable socket reuse in the kernel, but to inject further source & destination IPs into the mix. You can configure additional (non-public) IPs on the server, and have a load balancer bound to more IPs on both inbound and outbound legs of the proxy (if both are needed) so that the quad tuples are spread out across more IPs. Again the output of ss or netstat -Lan as a simple wc is not helpful, you need to see if a given IP is hitting your ephemeral port limits.

Finally, this (at least for me) is where haproxy shines, I can see what’s happening in the logs and change many of these network stack related settings without interrupting my BEAM applications. Separation of concerns is a nice thing, but you do introduce a further component.