I’m having a very frustrating issue debugging something, and I’m now resorting to the forum for help.
I’m sure people will recognize this question as I’ve exhausted a lot of options trying to solve it.
I have an API that currently gets ~100,000 incoming requests/minute (1,600 RPS). For each incoming request, I create 2 to 4 outbound requests which each have a 500ms timeout. If any requests don’t complete within 500ms, I take the ones which completed successfully, analyze the results, and then return the response back to the original incoming API request. So basically 100K incoming equates to 200K-400K outgoing.
I need to scale incoming requests to 25K RPS (to start) and beyond… upwards of 200K incoming RPS (400K-800K outgoing)
I log all outgoing request error codes, timeouts, etc.
I’m encountering an error where outgoing requests will start to report timeouts at ~20K incoming req/min (a very low number… thats only about 1000 outgoing RPS). If it goes much higher than that, I will see outgoing request timeouts as high as 80% to 95%.
I’ve hired a few people to help me figure this out, and everyone is stumped. Here are my current findings:
- Increased ulimit
+Q 134217727(total ports) in vm.args
- Migrating from AWS to Google Cloud helped by about a factor of 2-3x but didn’t fix the issue. Currently I’m on Google Cloud and the issue still happens at ~20K/min (incoming)
- I know for a fact it’s not the APIs themselves (I’ll spare you all those details)
- I’ve tried multiple HTTP libraries (Tesla, Buoy, Machine Gun, HTTPoison), all have basically the same results with different levels of timeouts and errors
- I’ve tried not using a pool, and adjusting pool settings. I don’t remember the exact specifics of how it affected it, I just know it didn’t do much
Most importantly… observer seems to report no issues. Every single person that looks is like “hmm… you’re right, observer shows no MsgQ backups, and Reds is a normal number. Theres no IO load.”
So far, everyone has thought this is some issue lying outside of Erlang (all things seem to point to it, especially considering migrating from AWS to GKE helped).
However, last night I did a test which 100% confirms to me the issue is not ‘outside’ but within Erlang/Elixir/the HTTP library.
First, I created a node pool in Kubernetes with a single node: a highcpu-64 (64 CPUs and 240GB memory). There are zero other nodes running.
For those not familiar with kubernetes, a “node” is a server instance, and a “pod” is an instance of the app running.
I then made a single pod running on that node. I observed the app would start having timeouts at ~20K/min. Obviously the CPU and Memory on the machine was at like 1% on each because its a huge server. Again, when timeouts occur, observer has no backed up MsgQ and Reds is fairly normal.
I then increased pods to 12, all running on the same node (no changes were made to the node, networking settings, etc. All variables are the same).
Each instance of the app logs stats, so now instead of seeing 1 set of stats @ 20K/min I see 12 sets of stats @ 1.6K/min each. Low and behold, the problem magically goes away. In fact, I can now increase the traffic to 10K per app instance (10000 * 12 = 120K/min), with < 1% timeouts.
To phrase it another way: 1 app on 1 64-cpu server = 20K/min max throughput before errors. 12 apps on SAME 64-cpu server with no other changes = 10K throughput/min per app without errors. Thats 120K total/min. 120K is almost an order of magnitude greater than 20K/min so the max throughput without errors absolutely increased by changing only erlang/elixir and no outside server settings/firewall settings, etc. I’m sure it could even up to ~20K (each) again before it starts getting errors but its difficult to control the traffic level.
This to me proves the problem lies in the elixir/erlang app space and not some external issue (UNLESS each instance of erlang is somehow allocating outside resources based on a limit, and creating new app instances is “fixing it” because its getting more of those outside resources).
I guess I’m out of ideas of what to think about / try. If anyone has any complicated linux commands I could run, like using strace or something to help debug I’d love to hear. I could sit here all day playing with pool settings and http libraries but my gut says its not that, because in every library I try there’s always some low limit, which makes me think the problem is somewhere else.
Sorry this isn’t more specific, theres so many things I’ve tried and SO many knobs and dials to turn that it makes this extremely difficult to get a 100% cause / effect, especially considering the traffic level always fluctuates. So one minute I’ll be testing something at 8K/min but then i’ll change something and the traffic level goes to 13K/min, changing variables.