9mm

9mm

Difficult debugging problem

I’m having a very frustrating issue debugging something, and I’m now resorting to the forum for help.

I’m sure people will recognize this question as I’ve exhausted a lot of options trying to solve it.

I have an API that currently gets ~100,000 incoming requests/minute (1,600 RPS). For each incoming request, I create 2 to 4 outbound requests which each have a 500ms timeout. If any requests don’t complete within 500ms, I take the ones which completed successfully, analyze the results, and then return the response back to the original incoming API request. So basically 100K incoming equates to 200K-400K outgoing.

I need to scale incoming requests to 25K RPS (to start) and beyond… upwards of 200K incoming RPS (400K-800K outgoing)

I log all outgoing request error codes, timeouts, etc.

I’m encountering an error where outgoing requests will start to report timeouts at ~20K incoming req/min (a very low number… thats only about 1000 outgoing RPS). If it goes much higher than that, I will see outgoing request timeouts as high as 80% to 95%.

I’ve hired a few people to help me figure this out, and everyone is stumped. Here are my current findings:

  • Increased ulimit
  • Added +K and +Q 134217727 (total ports) in vm.args
  • Migrating from AWS to Google Cloud helped by about a factor of 2-3x but didn’t fix the issue. Currently I’m on Google Cloud and the issue still happens at ~20K/min (incoming)
  • I know for a fact it’s not the APIs themselves (I’ll spare you all those details)
  • I’ve tried multiple HTTP libraries (Tesla, Buoy, Machine Gun, HTTPoison), all have basically the same results with different levels of timeouts and errors
  • I’ve tried not using a pool, and adjusting pool settings. I don’t remember the exact specifics of how it affected it, I just know it didn’t do much

Most importantly… observer seems to report no issues. Every single person that looks is like “hmm… you’re right, observer shows no MsgQ backups, and Reds is a normal number. Theres no IO load.”

So far, everyone has thought this is some issue lying outside of Erlang (all things seem to point to it, especially considering migrating from AWS to GKE helped).

However, last night I did a test which 100% confirms to me the issue is not ‘outside’ but within Erlang/Elixir/the HTTP library.

First, I created a node pool in Kubernetes with a single node: a highcpu-64 (64 CPUs and 240GB memory). There are zero other nodes running.

For those not familiar with kubernetes, a “node” is a server instance, and a “pod” is an instance of the app running.

I then made a single pod running on that node. I observed the app would start having timeouts at ~20K/min. Obviously the CPU and Memory on the machine was at like 1% on each because its a huge server. Again, when timeouts occur, observer has no backed up MsgQ and Reds is fairly normal.

I then increased pods to 12, all running on the same node (no changes were made to the node, networking settings, etc. All variables are the same).

Each instance of the app logs stats, so now instead of seeing 1 set of stats @ 20K/min I see 12 sets of stats @ 1.6K/min each. Low and behold, the problem magically goes away. In fact, I can now increase the traffic to 10K per app instance (10000 * 12 = 120K/min), with < 1% timeouts.

To phrase it another way: 1 app on 1 64-cpu server = 20K/min max throughput before errors. 12 apps on SAME 64-cpu server with no other changes = 10K throughput/min per app without errors. Thats 120K total/min. 120K is almost an order of magnitude greater than 20K/min so the max throughput without errors absolutely increased by changing only erlang/elixir and no outside server settings/firewall settings, etc. I’m sure it could even up to ~20K (each) again before it starts getting errors but its difficult to control the traffic level.

This to me proves the problem lies in the elixir/erlang app space and not some external issue (UNLESS each instance of erlang is somehow allocating outside resources based on a limit, and creating new app instances is “fixing it” because its getting more of those outside resources).

I guess I’m out of ideas of what to think about / try. If anyone has any complicated linux commands I could run, like using strace or something to help debug I’d love to hear. I could sit here all day playing with pool settings and http libraries but my gut says its not that, because in every library I try there’s always some low limit, which makes me think the problem is somewhere else.

Sorry this isn’t more specific, theres so many things I’ve tried and SO many knobs and dials to turn that it makes this extremely difficult to get a 100% cause / effect, especially considering the traffic level always fluctuates. So one minute I’ll be testing something at 8K/min but then i’ll change something and the traffic level goes to 13K/min, changing variables.

Most Liked Responses

jakemorrison

jakemorrison

I run a similar application that handles over 1B requests a day. It definitely sounds like you are hitting limits on TCP ports, either open ports in general or TCP ephemeral ports.

We use dedicated servers for this. It’s nice to get 24 CPU cores and 32 GB of RAM for about $100/month… Here are some things to look at. After that, it’s probably K8s related.

  1. Basic open file limits for the account running the app: /etc/security/limits.d/foo-limits
foo soft    nofile          1000000
foo hard    nofile          1000000
  1. Open file limits for systemd (which doesn’t respect the limits file), e.g.
    LimitNOFILE=65536

  2. Open file limits in the VM, in vm.args for your release, e.g.
    -env ERL_MAX_PORTS 65536

You may also want:

## Enable kernel poll and a few async threads
+K true
+A 128
  1. If you are running Nginx in front (not needed or helpful in this kind of application but here for reference), then you need to tune that as well, e.g.
    worker_rlimit_nofile 65536;

See Serving your Phoenix app with Nginx

  1. After that, you will run into lack of ephemeral TCP ports.
    In TCP/IP, a connection is defined by the combination of source IP + source port + destination IP + destination port. In this situation, all but the source port is fixed: 1.2.3.4 + random + 4.5.6.7 + 80. There are only 64K ports. The TCP/IP stack won’t reuse a port for 2 x maximum segment lifetime, which by default is 2 minutes.

Doing the math:

  • 60000 ports / 120 sec = 500 requests per sec

It hits you very hard running behind Nginx as a proxy, but can also hit you on the outbound side when you are talking to a small number of back end servers.

Tune the kernel settings to reduce the maximum segment lifetime, e.g.:

# Decrease the time default value for tcp_fin_timeout connection 
net.ipv4.tcp_fin_timeout = 15
# Recycle and Reuse TIME_WAIT sockets faster
net.ipv4.tcp_tw_reuse = 1

There are other kernel TCP settings you should tune as well, e.g.

sysctl -w fs.file-max=12000500
sysctl -w fs.nr_open=20000500
ulimit -n 20000000
sysctl -w net.ipv4.tcp_mem='10000000 10000000 10000000'
sysctl -w net.ipv4.tcp_rmem='1024 4096 16384'
sysctl -w net.ipv4.tcp_wmem='1024 4096 16384'
sysctl -w net.core.rmem_max=16384
sysctl -w net.core.wmem_max=16384

See The Road to 2 Million Websocket Connections in Phoenix - Phoenix Blog

If you are getting limited talking to back end servers, then it’s useful to give your server multiple IP addresses. Then tell your HTTP client library to use an IP from a pool as its source when talking to the back ends. So the equation turns into “source IP from pool” + random port + target IP + 80.
You may be able to reuse outbound connections, with HTTP pipelining, if the back ends support it.
At a certain point, the back end servers may be the limit. They may benefit from having more IPs as well.

DNS lookups on the backends can become an issue. We have had hosting providers block us because they thought we were doing a DOS attack on their DNS. Run a local caching DNS on your server.

See Best practices for deploying Elixir apps

Glad to give you more specific help if you need it.

dch

dch

TLDR this is a game of whack-a-mole move-the-bottleneck. A lot of this advice is covered above but I added some additional tips and hopefully some useful commentary.

If you’re still struggling with this I would bypass all of docker/kubes and spin up a bare metal instance at https://packet.net/ they provide excellent short-term h/w and networking just like cloud VMs just on real hardware, and do a quick comparison to see what your app ends up with without the layers of virtualisation and load balancer goop in the way, and then in order of least effort/max benefit:

  1. read the error messages, they tell you what’s wrong. all the logs, all the time. measure the impact of your changes.

  2. ensure you are running with max ulimit

  3. check your erlang VM ports with recon http://ferd.github.io/recon/ and monitor it to ensure you’re not running out. It’s an easy fix, and adding overhead is not expensive in memory terms.

  4. ensure your socket listen queues are not backlogged in erlang. See http://erlang.org/doc/man/gen_tcp.html#type-option for some possible options to pass through to your HTTP/TCP libraries, but basically use large buffers, no nagel delay, use tcp keepalive. This varies depending on the particular library you’re using, but this should get you started at least. You will see from outside with netstat -Lan |grep <port> if you’re running out. A healthy acceptor should be 0/0/... at all times, anything else you have problems.

Here’s my haproxy and couchdb for example:

fffff80e098a17a0 tcp46 0/0/2000                         *.443 (haproxy)
fffff8071a26e000 tcp4  0/0/2000                         127.0.0.1.5984 (beam.smp from couchdb)
  1. tune tcp ephemeral port range, and tweak TIME_WAIT reuse if logs show you’re running out. I see sonewconn errors in kernel logs on FreeBSD but I’m not sure what linux spits out. The rabbitmq examples are excellent, https://www.rabbitmq.com/networking.html more notes below. In particular I’d expect to see {error, eaddrnotavail} being returned from attempts to open new connections in the VM. These should bubble up to your HTTP layer.

  2. get real data from the network side with tcpdump & wireshark to see how your external connections are behaving. You may need help interpreting these, if so just skip this til last.

  3. ensure that you’re using persistent outbound TLS + HTTP1.1 (or better) connections to your upstream APIs. The first step is to use multiple IPs for outbound connections in your BEAM pool, and potentially secondly to move some work out of the VM entirely - I use haproxy and also do TLS termination there rather than inside the BEAM. This minimises the work the VM has to do, and also gives me a very nice stats page on how the connections are being handled, or retried, and additional load balancing capabilities across instances and servers.

  4. go back and read your logfiles again hand in hand with proxy and debugging output looking for correlations.

Further notes below.

Personally I am a huge fan of grabbing a tcpdump on the server, and throwing that into wireshark to see what’s actually happening on the wire. Every time I’ve needed to do this, I’ve wished that I had done it earlier. I appreciate this is not necessarily everybody’s cup of tea but if you’re dealing with network issues, then you need to see what’s actually happening on the wire.

It’s really common (as others have pointed out) to run out of ports, both in the Erlang VM, and in the OS (called ephemeral TCP port exhaustion). The former is set with +Q as mentioned and just requires more ram, the latter is slightly more complicated.

If this is the case your OS should be logging this somewhere, you’ll need to research that to confirm if it’s happening. On my OS (FreeBSD) I see sonewconn errors in kernel logs. A symptom of this is increasing numbers of TCP connections in TIME_WAIT state. You can see this in dmesg and in output of ss -tan state time-wait as mentioned above. The next step is to make more ports available for use as ephemeral ports. See links for details http://www.ncftp.com/ncftpd/doc/misc/ephemeral_ports.html and https://www.nginx.com/blog/overcoming-ephemeral-port-exhaustion-nginx-plus/ have reasonable explanations.

For nertwork stack tuning in general, see https://fasterdata.es.net/host-tuning/ http://proj.sunet.se/E2E/tcptune.html & https://www.psc.edu/index.php/networking/641-tcp-tune are excellent references, as well as https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/performance_tuning_guide/#chap-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Networking but remember that you can’t simply tweak all the above settings without some consequences. slow gradual change and testing are required :frowning:

If you make your outbound pools too big, and you’re using persistent connections, the upstream servers can also start disconnecting idle connections, thus increasing your connection churn even further. If you can, contact those providers and ask them if they do any connection handling like this or not.

If you’re connecting over and over again to the same upstream IPs or all your connections are coming from a load balancer, (e.g. kubernetes, AWS LB) you may be hitting the time_wait state much much earlier than expected, as the BSD socket API uses the quad tuple (source IP, source port, dest IP, dest port) as a unique key. The solution is not to enable socket reuse in the kernel, but to inject further source & destination IPs into the mix. You can configure additional (non-public) IPs on the server, and have a load balancer bound to more IPs on both inbound and outbound legs of the proxy (if both are needed) so that the quad tuples are spread out across more IPs. Again the output of ss or netstat -Lan as a simple wc is not helpful, you need to see if a given IP is hitting your ephemeral port limits.

Finally, this (at least for me) is where haproxy shines, I can see what’s happening in the logs and change many of these network stack related settings without interrupting my BEAM applications. Separation of concerns is a nice thing, but you do introduce a further component.

benwilson512

benwilson512

Author of Craft GraphQL APIs in Elixir with Absinthe

Definitely do a test where you explicitly set 12 CPUs and 24 schedulers in your vm.args.

In your pod spec do:

  limits:
    cpu: 12000m
  requests:
    cpu: 12000m

and in your vm.args +S 24.

One aside to check here if you’re on K8s is DNS lookups. Each outbound request from within K8s will do like a half dozen DNS attempts to see if there are any local services that use that domain name before actually heading to the outside world. Google may optimize better for this. You can get around this by prefixing the domains you hit with a . to indicate that they are fully qualified.

ALSO be sure to either configure the KubeDNS pods with some CPU requests / limits or ensure they’re on a different node, otherwise you can use so much CPU that KubeDNS can get throttled and stop responding properly.

In fact I’d consider trying to run a test outside of K8s just to eliminate this specific issue.

Where Next?

Popular in Questions Top

senggen
Erlang/OTP 25 [erts-13.2.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] 15:22:35.803 [error] gen_event {lager_file_backend...
New
marius95
Hello everyone, I try to use an Javascript Event Handler in my root.html.leex file. Therefore I created a function in the app.js file: ...
New
Tee
can someone please explain to me how Enum.reduce works with maps
New
lessless
I believe there are people here who are dealing with CSV files import on the daily basis, and since Excel is a really popular tool there ...
New
gshaw
What is the idiomatic way of matching for not nil in Elixir? E.g., First way: defp halt_if_not_signed_in(conn, signed_in_account) when...
New
electic
Hi, I am new to Elixir. I am trying to use the DateTime component to insert a date into MySQL however the there seems to be no way to fo...
New
fireproofsocks
Forgive me if this is obvious, but how does one delete a database record WITHOUT selecting it first? Ecto.Repo — Ecto v3.14.0 has exampl...
New
stefanluptak
Hello everybody, usually, I use a 29" ultra-wide monitor for VSCode which can easily accomodate explorer (files panel) + file with code ...
New
rms.mrcs
Hi, I need to transform a list of numbers into a map where the keys are the indexes and the values are the original values of the list. ...
New
hariharasudhan94
Lets say i have map like this fetching from my database %{"_id" =&gt; #BSON.ObjectId&lt;58eb1a7a9ad169198c3dXXXX&gt;, "email" =&gt; "XXX...
New

Other popular topics Top

malloryerik
Hi, this is for people who, like me, have had some friction using .html.heex templates in VSCode. The solution seems to be, in a hyphena...
New
mcarvalho
What is the difference between System.get_env and Application.get_env? For example, what are best practices to use one versus another.
New
electic
Hi, I am new to Elixir. I am trying to use the DateTime component to insert a date into MySQL however the there seems to be no way to fo...
New
ovidiubadita
Hey all, I discovered Elixir and I love it. I always wanted to learn a functional programming and I intended to go for Haskell, but afte...
New
jerry
Good day to you all. I have been struggling to get a query involving like and ilike to work. Can anyone assist me on this, please? pro...
New
stefanchrobot
What’s the safe way to decode a JSON string into a struct? I want to avoid calling String.to_atom. Jason.decode can give me a map with st...
New
alice
Hey, Just curious what are the main benefits of Elixir compared to Clojure? When is Elixir more useful than Clojure and vice versa? Th...
New
fayddelight
I tried installing elixir 1.11.2 erlang 23.3.4 via asdf in my zsh shell. Enabled the versions locally and globally. When I list them ...
New
Brian
What is the proper way to load a module from a file in to IEX? In the python world, doing something like this pretty standard: from ....
New
jononomo
For some reason my phoenix channels are working for me in my local dev environment, but as soon as I deploy via Docker, I get a 403 error...
New

We're in Beta

About us Mission Statement