Our Phoenix production API is experiencing some intermittent very high latency on some requests (1s+, sometimes over 30s!)
This seems to happen a few times per day with varying latency amounts.
It’s not correlated with extra traffic, and the traffic on this API is low in general (20 requests/min).
In a non-prod environment with similar infrastructure, response times are ~3x faster with much more volume, and we haven’t seen this latency issue.
The latency seems to occur between when the router receives the request, and when a controller receives it.
Once it hits the controller, the requests runs in a reasonable amount of time, averaging ~200ms.
Here’s an example of a 27s latency:
09:39:50 app/web.1 09:39:49.837 [info] POST /pos/v1/foo
09:40:17 app/web.1 09:40:16.992 [info] Processing with FooController
09:40:17 app/web.1 09:40:17.318 [info] Foo created successfully
09:40:17 app/web.1 09:40:17.318 [info] Sent 201 in 27481ms
Using APM, we can see that only ~600ms was spent in the controller flow, which is an acceptable amount of time for this request.
Some additional details:
Hosted on Heroku w/ 2 standard 1x dynos
The issue occurs after the request has been sent to our application
We do a few async things, like uploading to S3.
We have one known slow request that consistently waits ~8s on an external network call. We do this synchronously since the long response time is not a problem for this call. This happens ~1x/hour.
Memory stays consistent at all times
Some possibly relevant dependencies / endpoint specifics:
Makes sense. We do run some custom auth (internal, not external) on most routes which we’ll be inspecting.
No persistent stores a la redis or memcached, just a couple Postgres dbs.
As a sanity check, it’s reasonable to expect to handle that kind of traffic (and much more) on 2 servers, correct? i.e. We’re not exceeding a phoenix connection request pool or something like that & thus queueing? I have similar questions about exceeding DB connection pool, but it really doesn’t seem like that’s happening.
And also, the long-running (8s) request tying up a “main” phoenix request process shouldn’t be an issue, right?
20 rpm is nothing at all from this perspective. Ultimately “it depends” on what the server is doing, but that shouldn’t be an issue.
Phoenix has a process per request, so it won’t tie up any main process.
It could be tied up by your database depending on your poolsize, queue targets / timeouts, and length of queries. But, you should see other slow queries that correspond. Do you have any customized Ecto config around timeout or queue targets?
We’ve removed one custom plug that’s become unnecessary, and added logs for timing on the other.
No major offenders so far, but still seeing some unexpected latency.
Here’s the slowest example so far today after this update:
Jan 17 11:26:32 app/web.1 11:26:32.093 [info] request_id=xyz POST /foo
Jan 17 11:26:32 app/web.1 11:26:32.464 [info] request_id=xyz Starting Custom Authentication callback
Jan 17 11:26:32 app/web.1 11:26:32.479 [info] request_id=xyz Assigning Custom Authentication results to conn
Jan 17 11:26:32 app/web.1 11:26:32.483 [info] request_id=xyz Processing with FooController.create/2
Notice the 371ms delay between the POST log and the custom auth callback log. This is again during very low traffic for this service. This obviously isn’t as bad as other examples in the past (>30s), but still feels like the issue may still be there & happen again in the future.
Is there a simple way to log or view the other non-custom plug calls to see where the time is being spent?
Adding logs before/after plug calls in Endpoint doesn’t show them per request. We’d rather avoid forking the libraries & adding logs that way.
Instrumenting the plugs with the expansion was very helpful, thanks again.
We can now see that the very high latency appears to be occurring in Plug.Parsers. We saw a few that were extremely slow (8s, 30s) vs the usual ~200us. Plugs before it & after it are very fast, as are other post requests of the same type before & after.
I added more detail on another post where someone else may be having the same problem:
Attempts to reproduce haven’t been successful yet. Any additional thoughts?
Have you tried to reproduce the problem by directly using the Plug.Parsers with the input that caused the slow parsing?
If you have some input that causes serious slow downs then maybe it’s something that can be improved.
@tcoopman This is unfortunately not straightforward to gather since the logger truncates the base64 image strings. We did base64 decode the image that we eventually upload along with the rest of the payload, and response times were normal (fast). Still trying to gather the initial base64 string from the source in case it’s changed during decode/upload.
@wanton7@tcoopman Thanks - we may have more questions on doing that, though this is starting to feel more & more that it’s not about the payload.
One new data point - a few minutes after the 30s parse, we saw an 8s parse. These both came from the same client. Many more of the same request type from other clients before/during/after, and those were all fast. Can’t quite call it data yet, but the slow client theory is looking pretty strong.
(Note that the client is a server, not a browser, so we get more consistency).
So the questions there are:
Is there anything that refutes this theory?
How can we measure it?
How can we handle it? We haven’t customized the read_length / read_timeout values, but maybe we need to. Basically we need to either complete the parse before it can take a total of ~20s.