Hi all, I’m encountering an issue with a production app I’ve inherited and I’m having trouble getting to the bottom of it.
The app is an API that communicates with a JS client via both HTTP and WS. It’s a speed test app and the usual action flow from users involves a few HTTP requests (logs and results storing) and a few WS connections to three different servers running the same Phoenix app. The WS connections are used to measure speed using a high number of small packages which are sent to and from the server (depending on whether download or upload speed is measured) as messages on WS channel during a 10 sec period.
Everything worked OK until we started observing what happens when there are 10-15 concurrent users using the app. When we reach that point, all HTTP responses and WS connections take around 15 seconds to resolve. Even the simplest HTTP endpoints which don’t hit the DB and which are usually lightning fast are delayed.
I’m trying to figure out if there’s a connection pool limit or something similar we’re hitting, or if there’s anything that might cause these requests to end up queued in a process’s message queue. Is the number of WS messages being transferred somehow interfering with HTTP request processing?
I’ve attached a console to the running deployed app using --remsh and tried to inspect the running processes when we reach these peak times when everything slows down. While most of the processes with high message queue numbers seem to be processes handling websockets, I’m having trouble using
Process.info() to figure out why everything is responding so slow.
Has anyone encountered anything similar or has any suggestions on how to proceed with investigating this?
It’s hard to say what’s happening without some more information. How long are the requests delayed for after you get to 10-15 concurrent users? Do you have any single global processes? How high are your message queue numbers when the system is overloaded?
Hi @axelson, the delay is around 15 - 18 sec. There’s a single SGP that’s basically a logger used in only in WS channel. Message queues are around 150 to 180 and very slowly growing during longer observation. I’ll take a look at the SGP and see if not using it makes a difference, thanks for mentioning that! How would that affect processes handling simple endpoint controller requests/responses, though?
Will a tool like observer_cli help you see what the channel/WebSocket/sgp is waiting on? One of the columns is what function the process last executed, which can help in this situation.
If you’re not doing a
GenServer.call into the SGP from the endpoint controller, then there’s probably a different problem affecting them. What calls are in some of the simple controllers that are slowing down?
@sb8244 The app is currently deployed in a container using Dokku, I’ll have to check if I can use that tool.
@axelson I’ve actually disabled SGP that might’ve been causing the issue and the waiting times are now much smaller, but still present. No longer 10-15 sec but around 5 sec. When listing processes I see Logger has a slowly growing message queue, and these look like standard Phoenix info logs. Is it possible for Logger to be a bottleneck?
Controllers handling HTTP are really simple, the most complex one is a classing POST to store model data in DB. No GenServer calls made from any of the controllers. The disabled SGP was uses exclusively in channel modules.
Glad to hear that you’re making progress! What version of Elixir and Erlang are you using? Some Elixir versions have had some logger improvements that might help in your situation.
Looks like buildpack was configured to use Erlang 19.1 and Elixir 1.5.2, I’ll look into upgrading those as well, that’s a good point, thanks!