I have been having gallbladder issue and havent eaten for 10 days while i wait for my ultrasound, if this is a little poorly explained i apologize in advance. fasting is doodoo for my brain power
I have a Phoenix API JSON endpoint thats currently doing ~100,000 requests/minute across 10 nodes, so roughly ~150 RPS per node.
This is the lifecycle of a request:
- request comes in from vendor A. In order to serve this request I need to, in real-time, hit 3 other endpoints, gather results, find the data I need, and then I can fulfill the request.
- So… In the controller for this request, I use
Task.async/await
to create 3 processes which each hit a separate API endpoint with machine_gun. I have machine_gun timeout set to 1000ms - In theory, a request from vendor A should never take longer than 1000-1200ms. 1000ms timeout cutoff + several ms overhead for rest of code to run. Let’s say I hit 3 endpoints, 2 finish with 200 OK in 50ms, and the 3rd one times out at 1000ms. I take the 2 successful responses and ignore the third. I take the data I have, and do what I need to, and then return it to vendor A.
In my mind… this seems to me like it has a predictable upper limit as far as resources are concerned… I mean realistically I don’t see how 150 RPS … which each create 3 outbound requests with a 1000ms timeout, thats only 450 outgoing RPS, at most ‘locking’ 450 processes which will immediately be released.
However, this is not the case. I noticed on several of my 10 instances, some are rapidly running OOM and wreaking havoc on other requests. I was having massive timeouts on all 3 api endpoints, in huge chunks (thousands of requests were timing out all in batches, as if something else had failed rather than the api itself). Someone on our team ended up pausing 2 of the endpoints out of 3 and magically the 3rd api endpint, whcih was constantly timing out, now worked magically at 80ms response time 100% of the time without error. this tells me some other thing outside the actual API endpoint itself is causing some issue.
I have thousands of errors in rollbar a minute saying Async/Await is taking timing out and taking longer than 5000ms to reply, which doesnt even make sense because the timeout is set to 1000.
Also, i connected to observer one time and saw this… i am getting help with someone at gigalixir but they said this is used for DNS lookups. Perhaps I dont know DNS as well as i thought, but i thought all that happened behind the scenes, as it, the language itself wasnt responsible for resolving hostnames to IP… isnt that the job of a dns server?
Can anyone think of anything else that may be causing so many failures and wreaking havoc with memory, in such a ‘predictable’ setup?
Here’s an example of the massive groups of blackout timeouts that would happen every minute: