Phoenix API latency > 30s

Thanks for the feedback guys.

DB connection pool_size is set to 10 per dyno/server.
The DB in question ends up at 48 connections most of the time, occasionally jumping up to 68. The DB can handle up to 120.

We haven’t customized the timeout or queue targets.

The mean time on our slowest db query over a 7 day period is 5ms. Some requests hit up to 30 queries per request, but that doesn’t seem too heavy.

Anything sound unreasonable about these numbers?

@dimitarvp when you say check connection pool size, is that what you’re referring to?

1 Like

I meant exactly what you elaborated on.

I don’t know then, it seems installing telemetry and adding a bit more granular reporting is in order.

1 Like

We’ve removed one custom plug that’s become unnecessary, and added logs for timing on the other.
No major offenders so far, but still seeing some unexpected latency.
Here’s the slowest example so far today after this update:

Jan 17 11:26:32 app/web.1 11:26:32.093 [info] request_id=xyz POST /foo
Jan 17 11:26:32 app/web.1 11:26:32.464 [info] request_id=xyz Starting Custom Authentication callback
Jan 17 11:26:32 app/web.1 11:26:32.479 [info] request_id=xyz Assigning Custom Authentication results to conn
Jan 17 11:26:32 app/web.1 11:26:32.483 [info] request_id=xyz Processing with FooController.create/2

Notice the 371ms delay between the POST log and the custom auth callback log. This is again during very low traffic for this service. This obviously isn’t as bad as other examples in the past (>30s), but still feels like the issue may still be there & happen again in the future.

Is there a simple way to log or view the other non-custom plug calls to see where the time is being spent?
Adding logs before/after plug calls in Endpoint doesn’t show them per request. We’d rather avoid forking the libraries & adding logs that way.

plug Foo, opts

Is equivalent to:

@opts Foo.init(opts)
defp foo(conn, _), do: Foo.call(conn, @opts)
plug :foo

So you can replace all module plugs by the code above and instrument it accordingly.

7 Likes

Thanks José! That will do the trick.

Instrumenting the plugs with the expansion was very helpful, thanks again.
We can now see that the very high latency appears to be occurring in Plug.Parsers. We saw a few that were extremely slow (8s, 30s) vs the usual ~200us. Plugs before it & after it are very fast, as are other post requests of the same type before & after.

I added more detail on another post where someone else may be having the same problem:

Attempts to reproduce haven’t been successful yet. Any additional thoughts?

Have you tried to reproduce the problem by directly using the Plug.Parsers with the input that caused the slow parsing?
If you have some input that causes serious slow downs then maybe it’s something that can be improved.

2 Likes

also consider large payload, slow client internet connection - how large are the payloads (megabytes?) - and is somebody uploading from a slow connection?

3 Likes

@tcoopman This is unfortunately not straightforward to gather since the logger truncates the base64 image strings. We did base64 decode the image that we eventually upload along with the rest of the payload, and response times were normal (fast). Still trying to gather the initial base64 string from the source in case it’s changed during decode/upload.

@outlog The payload is moderately large (38k), but it’s worth noting we get much larger payloads that the parser rips through.

Slow client internet connection - yes, this is entirely possible. It’s not a multipart form/file upload, just a single json payload. Could this still be a factor?

Could you time parsers and if parsing is slow store actual JSON somewhere?

Like @wanton7 is saying, just write the full request down on slow requests

@wanton7 @tcoopman Thanks - we may have more questions on doing that, though this is starting to feel more & more that it’s not about the payload.

One new data point - a few minutes after the 30s parse, we saw an 8s parse. These both came from the same client. Many more of the same request type from other clients before/during/after, and those were all fast. Can’t quite call it data yet, but the slow client theory is looking pretty strong.
(Note that the client is a server, not a browser, so we get more consistency).

So the questions there are:

  • Is there anything that refutes this theory?
  • How can we measure it?
  • How can we handle it? We haven’t customized the read_length / read_timeout values, but maybe we need to. Basically we need to either complete the parse before it can take a total of ~20s.

You can try to reproduce by faking a very slow client.
Extremely slow client usually means some networking problem, like large amount of packets dropped. It may have something to do with your hosting company.
On the other hand, elixir processes have smaller footprint compare to other stacks so it is probably fortunate that you are using elixir; could have been worse.

3 Likes

I am indeed able to reproduce it by simulating a very slow client using curl with Transfer-Encoding: chunked and --limit-rate 1000 to simulate 1k/sec transfer rate.

It hangs in Plugs.Parser for ~30s exactly as expected, likely because it can’t pull the remaining chunks from the slow client quickly enough. Seems pretty likely this is the issue, happy to see it reproduced.

I do think this is likely due to the client being slow vs our own host, as other requests continued to arrive & process normally at the same time. This will likely be an ongoing issue, so now looking for solutions.

Next questions:

  • How can we look at raw request headers to see if these are coming in chunked? Logging req_headers from conn in the plug does not contain Transfer-Encoding: chunked, even when I send it myself.
  • Is there some way to async this so it’s out of the request flow?
  • If not, should we be looking at tuning read_length / read_timeout so we can gracefully kill or retry the request? Or some other method of doing this?
3 Likes

Some people are less privileged and have to live with shitty network. If they didn’t give up your service, please don’t give up on them.

6 Likes

Maybe you could at the proxy level but notice that the client can be slow even if not chunking. For example, even if it is a 38k JSON that is not chunked, the client can still just write slowly to the socket.

You could make it async but I am not sure if it will help? The part of the request that doesn’t need the body would likely just be executed very fast and you would go back to waiting again. Note that the fact this is slow does not affect any other request at all.

Any retry mechanism belongs in the client. The read_length and read_timeout is supposed to be triggered though if data is not sent fast enough. However, be careful with increasing those values to be more permissive. There is a DDoS attack called Slowlloris which is exactly about opening many connections to the server and then slowly writing to the socket, slowly enough to not being disconnected, and keeping everything busy. Being more permissive can make those attacks easier to pull off.

3 Likes

For anyone wanting a little more details on it:

2 Likes

Something else to be aware of: Standard 1x Heroku hosts are run on a shared platform. Other workloads on that platform could be impacting your service.

You could try redeploying onto a Performance M (dedicated) for a short period (2 days?) to see whether you’re still experiencing the issues. If you are, it’s code-related and you can investigate through profiling. If everything runs smooth, then it was that particular host. Down-scaling back to your previous setup may deploy onto different servers with different workloads, and you could see everything start to run smooth again.

1 Like

Something else I’ve found useful when running Phoenix on Heroku: setting ERL_FULLSWEEP_AFTER to 0 in the Config Vars helps keep the memory usage of apps within Heroku’s limits, at the cost of more GC. I’ve played around with setting it to other low values, but nothing works as well as setting it to 0. This may or may not have an impact on your response times.

1 Like