Phoenix API latency > 30s

derek-zhou · January 22, 2021, 4:04pm

You can try to reproduce by faking a very slow client.
Extremely slow client usually means some networking problem, like large amount of packets dropped. It may have something to do with your hosting company.
On the other hand, elixir processes have smaller footprint compare to other stacks so it is probably fortunate that you are using elixir; could have been worse.

CTD · January 22, 2021, 9:56pm

I am indeed able to reproduce it by simulating a very slow client using curl with Transfer-Encoding: chunked and --limit-rate 1000 to simulate 1k/sec transfer rate.

It hangs in Plugs.Parser for ~30s exactly as expected, likely because it can’t pull the remaining chunks from the slow client quickly enough. Seems pretty likely this is the issue, happy to see it reproduced.

I do think this is likely due to the client being slow vs our own host, as other requests continued to arrive & process normally at the same time. This will likely be an ongoing issue, so now looking for solutions.

Next questions:

How can we look at raw request headers to see if these are coming in chunked? Logging req_headers from conn in the plug does not contain Transfer-Encoding: chunked, even when I send it myself.
Is there some way to async this so it’s out of the request flow?
If not, should we be looking at tuning read_length / read_timeout so we can gracefully kill or retry the request? Or some other method of doing this?

derek-zhou · January 22, 2021, 10:14pm

Some people are less privileged and have to live with shitty network. If they didn’t give up your service, please don’t give up on them.

josevalim · January 22, 2021, 10:15pm

Maybe you could at the proxy level but notice that the client can be slow even if not chunking. For example, even if it is a 38k JSON that is not chunked, the client can still just write slowly to the socket.

You could make it async but I am not sure if it will help? The part of the request that doesn’t need the body would likely just be executed very fast and you would go back to waiting again. Note that the fact this is slow does not affect any other request at all.

Any retry mechanism belongs in the client. The read_length and read_timeout is supposed to be triggered though if data is not sent fast enough. However, be careful with increasing those values to be more permissive. There is a DDoS attack called Slowlloris which is exactly about opening many connections to the server and then slowly writing to the socket, slowly enough to not being disconnected, and keeping everything busy. Being more permissive can make those attacks easier to pull off.

Exadra37 · January 22, 2021, 10:44pm

For anyone wanting a little more details on it:

OldhamMade · January 25, 2021, 8:38am

Something else to be aware of: Standard 1x Heroku hosts are run on a shared platform. Other workloads on that platform could be impacting your service.

You could try redeploying onto a Performance M (dedicated) for a short period (2 days?) to see whether you’re still experiencing the issues. If you are, it’s code-related and you can investigate through profiling. If everything runs smooth, then it was that particular host. Down-scaling back to your previous setup may deploy onto different servers with different workloads, and you could see everything start to run smooth again.

OldhamMade · January 25, 2021, 8:42am

Something else I’ve found useful when running Phoenix on Heroku: setting ERL_FULLSWEEP_AFTER to 0 in the Config Vars helps keep the memory usage of apps within Heroku’s limits, at the cost of more GC. I’ve played around with setting it to other low values, but nothing works as well as setting it to 0. This may or may not have an impact on your response times.

Exadra37 · January 25, 2021, 11:47am

But why using a hosting provider that is not suitable for serious production workloads(dyno also restarts every 24 hours), wouldn’t be better to switch to a better hosting provider?

AstonJ · January 25, 2021, 12:48pm

6 posts were split to a new topic: Split from “Phoenix API latency > 30s” thread

OldhamMade · January 25, 2021, 12:49pm

Basically what @dimitarvp said.

Heroku provides a nice abstraction when the team is small, and the automation around workflow is very useful and difficult/time-consuming to replicate in other scenarios. If there was a simple DigitalOcean droplet image that would auto deploy from a git push and some config files, I’d switch in a heartbeat. (dokku comes close but still requires management time)

Also, you can get pretty far with Heroku’s free and hobby tiers. Saving time/money early in a project’s life can be a big win. Having to move away from Heroku because it’s lower tiers can’t service your project any longer is a “Nice Problem To Have”.

Although, after quick look at Render it seems like a good alternative for the price.

CTD · January 25, 2021, 4:35pm

Thanks again @josevalim and others.

Maybe you could at the proxy level but notice that the client can be slow even if not chunking. For example, even if it is a 38k JSON that is not chunked, the client can still just write slowly to the socket.

Fair point. In our tests, removing the chunking header solved the issue 100% of the time, but it’s also only a simulation. We also can’t control how external sources send their requests anyway.

You could make it async but I am not sure if it will help? The part of the request that doesn’t need the body would likely just be executed very fast and you would go back to waiting again. Note that the fact this is slow does not affect any other request at all.

Right, we would need to async the parse itself. Another option is to send some sort of an ACK when the initial request is received. This would prevent the timeout detection since it would show our server is responsive. Is there any mechanism for doing this?

I’m also considering simply ignoring the timeout errors. If our server is still processing other requests, and the router is still sending traffic to the server, everything should be fine. Sound sane?

The read_length and read_timeout is supposed to be triggered though if data is not sent fast enough. However, be careful with increasing those values to be more permissive. … Being more permissive can make [Slowlloris] attacks easier to pull off.

Good point. If this becomes necessary, I was thinking to make it less permissive vs. more, i.e. give up on the request before 30s could accrue. I think it’d be tricky to target a total time mark though; seems like we’d just have to play with decreasing the read_timeout until it’s close.