Difficult debugging problem

BTW it was 256 connections for a single 8 CPU server?

When the CPU is overloaded the message queues backup but if CPU usage is okay message queues stay mostly empty.

256 connections per service per cpu. So if you are talking to two services you’d have two pools each with 256 max_connections. Calls to each service use their own pool. We were only calling two services heavily so if you are calling more you may need to adjust a bit.

2 Likes

Very interesting thread. Based on the description of incoming / outgoing request rate, you may want to read the guide my team has put together:

https://www.rabbitmq.com/networking.html#dealing-with-high-connection-churn

This came about due to RabbitMQ MQTT users who ran into issues in scenarios like you describe. You may get a lot of mileage out of adjusting system-level TCP settings.

3 Likes

I run a similar application that handles over 1B requests a day. It definitely sounds like you are hitting limits on TCP ports, either open ports in general or TCP ephemeral ports.

We use dedicated servers for this. It’s nice to get 24 CPU cores and 32 GB of RAM for about $100/month… Here are some things to look at. After that, it’s probably K8s related.

  1. Basic open file limits for the account running the app: /etc/security/limits.d/foo-limits
foo soft    nofile          1000000
foo hard    nofile          1000000
  1. Open file limits for systemd (which doesn’t respect the limits file), e.g.
    LimitNOFILE=65536

  2. Open file limits in the VM, in vm.args for your release, e.g.
    -env ERL_MAX_PORTS 65536

You may also want:

## Enable kernel poll and a few async threads
+K true
+A 128
  1. If you are running Nginx in front (not needed or helpful in this kind of application but here for reference), then you need to tune that as well, e.g.
    worker_rlimit_nofile 65536;

See https://www.cogini.com/blog/serving-your-phoenix-app-with-nginx/

  1. After that, you will run into lack of ephemeral TCP ports.
    In TCP/IP, a connection is defined by the combination of source IP + source port + destination IP + destination port. In this situation, all but the source port is fixed: 1.2.3.4 + random + 4.5.6.7 + 80. There are only 64K ports. The TCP/IP stack won’t reuse a port for 2 x maximum segment lifetime, which by default is 2 minutes.

Doing the math:

  • 60000 ports / 120 sec = 500 requests per sec

It hits you very hard running behind Nginx as a proxy, but can also hit you on the outbound side when you are talking to a small number of back end servers.

Tune the kernel settings to reduce the maximum segment lifetime, e.g.:

# Decrease the time default value for tcp_fin_timeout connection 
net.ipv4.tcp_fin_timeout = 15
# Recycle and Reuse TIME_WAIT sockets faster
net.ipv4.tcp_tw_reuse = 1

There are other kernel TCP settings you should tune as well, e.g.

sysctl -w fs.file-max=12000500
sysctl -w fs.nr_open=20000500
ulimit -n 20000000
sysctl -w net.ipv4.tcp_mem='10000000 10000000 10000000'
sysctl -w net.ipv4.tcp_rmem='1024 4096 16384'
sysctl -w net.ipv4.tcp_wmem='1024 4096 16384'
sysctl -w net.core.rmem_max=16384
sysctl -w net.core.wmem_max=16384

See https://phoenixframework.org/blog/the-road-to-2-million-websocket-connections

If you are getting limited talking to back end servers, then it’s useful to give your server multiple IP addresses. Then tell your HTTP client library to use an IP from a pool as its source when talking to the back ends. So the equation turns into “source IP from pool” + random port + target IP + 80.
You may be able to reuse outbound connections, with HTTP pipelining, if the back ends support it.
At a certain point, the back end servers may be the limit. They may benefit from having more IPs as well.

DNS lookups on the backends can become an issue. We have had hosting providers block us because they thought we were doing a DOS attack on their DNS. Run a local caching DNS on your server.

See https://www.cogini.com/blog/best-practices-for-deploying-elixir-apps/

Glad to give you more specific help if you need it.

19 Likes

Mother of god :clap:t3::clap:t3::clap:t3::clap:t3::clap:t3::clap:t3::clap:t3::clap:t3::clap:t3:

Also thank you @lukebakken

these replies are both 1000% phenomenal and I’m super grateful. I will have to go through each of these very slowly in detail, THANK YOU!!!

2 Likes

A few questions

  1. Is -env ERL_MAX_PORTS 65536 the same as +Q 134217727?
  2. i dont see any ipv4.mem options… are those not available on some servers? maybe because im inside the container… if so im not sure how to set these other sysctl values outside of kubernetes …hmmmmm (I just checked, its not in the instance itself. im using google container optimized OS). i might just have to do this on vanilla ubuntu servers or something
  3. Whats the easiest way to run a ‘local caching dns’ on the server?

@9mm KubeDNS / CoreDNS already is one of those. You’ll just want to check out the tuning / configuration link for CoreDNS I mentioned earlier to make sure it has enough resources.

1 Like
  1. Yes, ERL_MAX_PORTS is the same as +Q, use that
  2. The kernel params are probably different running in a container
  3. Here is an example of running a local caching DNS on a server: https://www.cogini.com/blog/running-a-local-caching-dns-for-your-app/
2 Likes

Ok thanks. I actually dont think GKE uses CoreDNS though, they chose to keep using KubeDNS as as far as i can tell

1 Like

TLDR this is a game of whack-a-mole move-the-bottleneck. A lot of this advice is covered above but I added some additional tips and hopefully some useful commentary.

If you’re still struggling with this I would bypass all of docker/kubes and spin up a bare metal instance at https://packet.net/ they provide excellent short-term h/w and networking just like cloud VMs just on real hardware, and do a quick comparison to see what your app ends up with without the layers of virtualisation and load balancer goop in the way, and then in order of least effort/max benefit:

  1. read the error messages, they tell you what’s wrong. all the logs, all the time. measure the impact of your changes.

  2. ensure you are running with max ulimit

  3. check your erlang VM ports with recon http://ferd.github.io/recon/ and monitor it to ensure you’re not running out. It’s an easy fix, and adding overhead is not expensive in memory terms.

  4. ensure your socket listen queues are not backlogged in erlang. See http://erlang.org/doc/man/gen_tcp.html#type-option for some possible options to pass through to your HTTP/TCP libraries, but basically use large buffers, no nagel delay, use tcp keepalive. This varies depending on the particular library you’re using, but this should get you started at least. You will see from outside with netstat -Lan |grep <port> if you’re running out. A healthy acceptor should be 0/0/... at all times, anything else you have problems.

Here’s my haproxy and couchdb for example:

fffff80e098a17a0 tcp46 0/0/2000                         *.443 (haproxy)
fffff8071a26e000 tcp4  0/0/2000                         127.0.0.1.5984 (beam.smp from couchdb)
  1. tune tcp ephemeral port range, and tweak TIME_WAIT reuse if logs show you’re running out. I see sonewconn errors in kernel logs on FreeBSD but I’m not sure what linux spits out. The rabbitmq examples are excellent, https://www.rabbitmq.com/networking.html more notes below. In particular I’d expect to see {error, eaddrnotavail} being returned from attempts to open new connections in the VM. These should bubble up to your HTTP layer.

  2. get real data from the network side with tcpdump & wireshark to see how your external connections are behaving. You may need help interpreting these, if so just skip this til last.

  3. ensure that you’re using persistent outbound TLS + HTTP1.1 (or better) connections to your upstream APIs. The first step is to use multiple IPs for outbound connections in your BEAM pool, and potentially secondly to move some work out of the VM entirely - I use haproxy and also do TLS termination there rather than inside the BEAM. This minimises the work the VM has to do, and also gives me a very nice stats page on how the connections are being handled, or retried, and additional load balancing capabilities across instances and servers.

  4. go back and read your logfiles again hand in hand with proxy and debugging output looking for correlations.

Further notes below.

Personally I am a huge fan of grabbing a tcpdump on the server, and throwing that into wireshark to see what’s actually happening on the wire. Every time I’ve needed to do this, I’ve wished that I had done it earlier. I appreciate this is not necessarily everybody’s cup of tea but if you’re dealing with network issues, then you need to see what’s actually happening on the wire.

It’s really common (as others have pointed out) to run out of ports, both in the Erlang VM, and in the OS (called ephemeral TCP port exhaustion). The former is set with +Q as mentioned and just requires more ram, the latter is slightly more complicated.

If this is the case your OS should be logging this somewhere, you’ll need to research that to confirm if it’s happening. On my OS (FreeBSD) I see sonewconn errors in kernel logs. A symptom of this is increasing numbers of TCP connections in TIME_WAIT state. You can see this in dmesg and in output of ss -tan state time-wait as mentioned above. The next step is to make more ports available for use as ephemeral ports. See links for details http://www.ncftp.com/ncftpd/doc/misc/ephemeral_ports.html and https://www.nginx.com/blog/overcoming-ephemeral-port-exhaustion-nginx-plus/ have reasonable explanations.

For nertwork stack tuning in general, see https://fasterdata.es.net/host-tuning/ http://proj.sunet.se/E2E/tcptune.html & https://www.psc.edu/index.php/networking/641-tcp-tune are excellent references, as well as https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/performance_tuning_guide/#chap-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Networking but remember that you can’t simply tweak all the above settings without some consequences. slow gradual change and testing are required :frowning:

If you make your outbound pools too big, and you’re using persistent connections, the upstream servers can also start disconnecting idle connections, thus increasing your connection churn even further. If you can, contact those providers and ask them if they do any connection handling like this or not.

If you’re connecting over and over again to the same upstream IPs or all your connections are coming from a load balancer, (e.g. kubernetes, AWS LB) you may be hitting the time_wait state much much earlier than expected, as the BSD socket API uses the quad tuple (source IP, source port, dest IP, dest port) as a unique key. The solution is not to enable socket reuse in the kernel, but to inject further source & destination IPs into the mix. You can configure additional (non-public) IPs on the server, and have a load balancer bound to more IPs on both inbound and outbound legs of the proxy (if both are needed) so that the quad tuples are spread out across more IPs. Again the output of ss or netstat -Lan as a simple wc is not helpful, you need to see if a given IP is hitting your ephemeral port limits.

Finally, this (at least for me) is where haproxy shines, I can see what’s happening in the logs and change many of these network stack related settings without interrupting my BEAM applications. Separation of concerns is a nice thing, but you do introduce a further component.

15 Likes

Wow this is 1000% incredible, thank you. I’m literally bookmarking this reply for years to come.
Can you tell me how to run TCP dump in the way you do to feed it into wireshark? That sounds very helpful. I’ll need to read this a few more times to absorb it all. THANK YOU

4 Likes

Sorry for the belated reply but I’m sure this will be useful at some point! Here’s a couple of examples. This first one is tcpdump slurping packets on the lo1 interface, as the actual traffic is behind haproxy, which handles the TLS termination, and filtering only on traffic going to, or coming from, port 4003:

# tcpdump -i lo1 -vvv  'port 4003' -ttt -w /tmp/plug.pcap
tcpdump: listening on lo1, link-type EN10MB (Ethernet), capture size 262144 bytes
^C12 packets captured
582 packets received by filter
0 packets dropped by kernel

root@i09 /u/h/dch# l /tmp/plug.pcap 
-rw-r--r--  1 root  wheel   1.3K Mar  1 13:21 /tmp/plug.pcap

You can then just copy that locally, open that in wireshark. wireshark and tcpdump have powerful filtering capabilities, https://wiki.wireshark.org/CaptureFilters so you can do a lot more than just port 4003, you could filter only on traffic coming from a particular IP, or a datagram containing a specific word (like a Host: http header).

This example is using ngrep, acquiring traffic that only contains a particular regex of text, and using the same tcpdump filter as above. ngrep also allows nicely displaying things textually which, if you’re stripping TLS upstream, makes reading http for debugging purposes almost pleasant. The packets are also output to a file which is the same tcpdump format that wireshark also reads.

# ngrep -W byline -O /tmp/plug.pcap  -qid lo1 '(health|http)'  port 4003
interface: lo1
filter: ( port 4003 ) and ((ip || ip6) || (vlan && (ip || ip6)))
match: (health|http)
output: /tmp/plug.pcap

T fc36:c375:a2e3:b0e3:e4ee::1:61304 -> fc36:c375:a250:715c:f0b4::8:4003 [AP] #22
GET /healthz HTTP/1.0.
.


T fc36:c375:a250:715c:f0b4::8:4003 -> fc36:c375:a2e3:b0e3:e4ee::1:61304 [AP] #23
HTTP/1.1 200 OK.
cache-control: max-age=0, private, must-revalidate.
connection: close.
content-length: 0.
date: Fri, 01 Mar 2019 13:35:48 GMT.
server: Cowboy.
.

Its’ worth restating that if you are using OTP to handle TLS connections then none of the above is at all useful, and that HTTP2 doesn’t use the same readable text. I’m a huge fan of haproxy for doing TLS termination, and switching external HTTP2 to HTTP1.1 and load balancing in general, and handling TLS outside the BEAM lets it do what it does best as well.

2 Likes