Phoenix API latency > 30s

CTD · January 16, 2021, 5:06pm

Our Phoenix production API is experiencing some intermittent very high latency on some requests (1s+, sometimes over 30s!)
This seems to happen a few times per day with varying latency amounts.
It’s not correlated with extra traffic, and the traffic on this API is low in general (20 requests/min).
In a non-prod environment with similar infrastructure, response times are ~3x faster with much more volume, and we haven’t seen this latency issue.

The latency seems to occur between when the router receives the request, and when a controller receives it.
Once it hits the controller, the requests runs in a reasonable amount of time, averaging ~200ms.

Here’s an example of a 27s latency:
09:39:50 app/web.1 09:39:49.837 [info] POST /pos/v1/foo
09:40:17 app/web.1 09:40:16.992 [info] Processing with FooController
09:40:17 app/web.1 09:40:17.318 [info] Foo created successfully
09:40:17 app/web.1 09:40:17.318 [info] Sent 201 in 27481ms

Using APM, we can see that only ~600ms was spent in the controller flow, which is an acceptable amount of time for this request.

Some additional details:

Hosted on Heroku w/ 2 standard 1x dynos
The issue occurs after the request has been sent to our application
We do a few async things, like uploading to S3.
We have one known slow request that consistently waits ~8s on an external network call. We do this synchronously since the long response time is not a problem for this call. This happens ~1x/hour.
Memory stays consistent at all times

Some possibly relevant dependencies / endpoint specifics:

Sentry
Plug.Static, Plug.Session
CORSplug
phoenix_swagger

Some possibly relevant config details:

MIX_ENV=prod
logger level: info
url: [scheme: “https”, port: 443, host: “https://ourdomain.com”], force_ssl: [rewrite_on: [:x_forwarded_proto]]
3 Repo connections w/ POOL_SIZE=10

We’re overdue some upgrades, but I feel like there’s another issue at hand (misconfiguration, etc.).
I believe our 2 phoenix servers should handle this traffic with ease.

We haven’t been able to reproduce it outside of production yet. Any ideas or additional details needed?

sb8244 · January 16, 2021, 5:11pm

Is it possible to log the entry/exit times of the plugs in your pipeline? Especially any custom plugs. It sounds like one of those is experiencing some sort of issue.

Are you using cookie based sessions or some persistent store (redis, memcached, etc)?

CTD · January 16, 2021, 6:39pm

Thanks for the quick reply!
Our next test is indeed to log times before & after plug calls, hopefully we can isolate the issue.

As for cookie based sessions, yes we are per the config below. They may have become unnecessary for us though. Any reason you call that out specifically?

  plug(
    Plug.Session,
    store: :cookie,
    key: "_our_key",
    signing_salt: "xyzxyzxy"
  )

sb8244 · January 16, 2021, 6:56pm

Mainly wondering because if you were using external cookies, that would include an external call in the plug which could experience latency. Cookie based wouldn’t be an issue.

Something like observer_cli while manually inspecting the box could reveal the source of slowness, especially if you can recreate it either locally or in production.

CTD · January 16, 2021, 7:17pm

Makes sense. We do run some custom auth (internal, not external) on most routes which we’ll be inspecting.
No persistent stores a la redis or memcached, just a couple Postgres dbs.

As a sanity check, it’s reasonable to expect to handle that kind of traffic (and much more) on 2 servers, correct? i.e. We’re not exceeding a phoenix connection request pool or something like that & thus queueing? I have similar questions about exceeding DB connection pool, but it really doesn’t seem like that’s happening.

And also, the long-running (8s) request tying up a “main” phoenix request process shouldn’t be an issue, right?

sb8244 · January 16, 2021, 10:34pm

20 rpm is nothing at all from this perspective. Ultimately “it depends” on what the server is doing, but that shouldn’t be an issue.

Phoenix has a process per request, so it won’t tie up any main process.

It could be tied up by your database depending on your poolsize, queue targets / timeouts, and length of queries. But, you should see other slow queries that correspond. Do you have any customized Ecto config around timeout or queue targets?

dimitarvp · January 17, 2021, 12:19am

The BEAM VM (and thus Elixir) has one green thread (a “process” in Erlang lingo) per request so you can easily have 200k+ of them at one time as long as the OS kernel can handle it.

I’d check your connection pool size and if you are overloading it. But it could be a plethora of other things in your code.

All in all, Phoenix itself is extremely light.

CTD · January 17, 2021, 2:18am

Thanks for the feedback guys.

DB connection pool_size is set to 10 per dyno/server.
The DB in question ends up at 48 connections most of the time, occasionally jumping up to 68. The DB can handle up to 120.

We haven’t customized the timeout or queue targets.

The mean time on our slowest db query over a 7 day period is 5ms. Some requests hit up to 30 queries per request, but that doesn’t seem too heavy.

Anything sound unreasonable about these numbers?

@dimitarvp when you say check connection pool size, is that what you’re referring to?

dimitarvp · January 17, 2021, 2:32am

I meant exactly what you elaborated on.

I don’t know then, it seems installing telemetry and adding a bit more granular reporting is in order.

CTD · January 17, 2021, 7:54pm

We’ve removed one custom plug that’s become unnecessary, and added logs for timing on the other.
No major offenders so far, but still seeing some unexpected latency.
Here’s the slowest example so far today after this update:

Jan 17 11:26:32 app/web.1 11:26:32.093 [info] request_id=xyz POST /foo
Jan 17 11:26:32 app/web.1 11:26:32.464 [info] request_id=xyz Starting Custom Authentication callback
Jan 17 11:26:32 app/web.1 11:26:32.479 [info] request_id=xyz Assigning Custom Authentication results to conn
Jan 17 11:26:32 app/web.1 11:26:32.483 [info] request_id=xyz Processing with FooController.create/2

Notice the 371ms delay between the POST log and the custom auth callback log. This is again during very low traffic for this service. This obviously isn’t as bad as other examples in the past (>30s), but still feels like the issue may still be there & happen again in the future.

Is there a simple way to log or view the other non-custom plug calls to see where the time is being spent?
Adding logs before/after plug calls in Endpoint doesn’t show them per request. We’d rather avoid forking the libraries & adding logs that way.

josevalim · January 18, 2021, 7:02am

plug Foo, opts

Is equivalent to:

@opts Foo.init(opts)
defp foo(conn, _), do: Foo.call(conn, @opts)
plug :foo

So you can replace all module plugs by the code above and instrument it accordingly.

CTD · January 18, 2021, 1:52pm

Thanks José! That will do the trick.

CTD · January 21, 2021, 11:16pm

Instrumenting the plugs with the expansion was very helpful, thanks again.
We can now see that the very high latency appears to be occurring in Plug.Parsers. We saw a few that were extremely slow (8s, 30s) vs the usual ~200us. Plugs before it & after it are very fast, as are other post requests of the same type before & after.

I added more detail on another post where someone else may be having the same problem:

Attempts to reproduce haven’t been successful yet. Any additional thoughts?

tcoopman · January 22, 2021, 8:40am

Have you tried to reproduce the problem by directly using the Plug.Parsers with the input that caused the slow parsing?
If you have some input that causes serious slow downs then maybe it’s something that can be improved.

outlog · January 22, 2021, 12:41pm

also consider large payload, slow client internet connection - how large are the payloads (megabytes?) - and is somebody uploading from a slow connection?

CTD · January 22, 2021, 1:32pm

@tcoopman This is unfortunately not straightforward to gather since the logger truncates the base64 image strings. We did base64 decode the image that we eventually upload along with the rest of the payload, and response times were normal (fast). Still trying to gather the initial base64 string from the source in case it’s changed during decode/upload.

CTD · January 22, 2021, 1:39pm

@outlog The payload is moderately large (38k), but it’s worth noting we get much larger payloads that the parser rips through.

Slow client internet connection - yes, this is entirely possible. It’s not a multipart form/file upload, just a single json payload. Could this still be a factor?

wanton7 · January 22, 2021, 2:43pm

Could you time parsers and if parsing is slow store actual JSON somewhere?

tcoopman · January 22, 2021, 2:47pm

Like @wanton7 is saying, just write the full request down on slow requests

CTD · January 22, 2021, 3:12pm

@wanton7 @tcoopman Thanks - we may have more questions on doing that, though this is starting to feel more & more that it’s not about the payload.

One new data point - a few minutes after the 30s parse, we saw an 8s parse. These both came from the same client. Many more of the same request type from other clients before/during/after, and those were all fast. Can’t quite call it data yet, but the slow client theory is looking pretty strong.
(Note that the client is a server, not a browser, so we get more consistency).

So the questions there are:

Is there anything that refutes this theory?
How can we measure it?
How can we handle it? We haven’t customized the read_length / read_timeout values, but maybe we need to. Basically we need to either complete the parse before it can take a total of ~20s.