Phoenix LiveView - App responds slow in Production environment every ~20 seconds

marschro · December 10, 2023, 2:28pm

Hello all,
I am currently experiencing some strange behavior on my production environment.

I run a Phoenix LiveView App
Server is FreeBSD with haproxy upfront

The behavior

If you navigate the liveView app via navigation items that trigger link navigate, this is most of the time fine. But every ~22 seconds the socket response is extremely slow and response, event with nearly zero change in the socket is about 4 to 6 seconds.
you can give it a try here: https://portal.devpunx.com (of course you have to click on the navigation for at leas 22 seconds… no need to click different nav-items… just keep clicking “home” or sth.)
I do not have this on local, so it’s probably an environment effect.

What I tried to debug:

I checked and changed timeouts on Server and OS but that seems not to cause the issue
Also I checked if sockets break and reconnect, but they stay stable
So it basically is slowed down for some reason.
What makes me stutzig is the fact, that it occurs roughly every 22 seconds independently of if you click a lot or not so much.

Has anyone any idea how to approach that and how to debug?

Kind regards!

D4no0 · December 10, 2023, 2:52pm

I would suggest to check your proxy configuration first, as that is the culprit in most of the cases.

You can also check the liveview handle_params event telemetry, not sure it would yield useful information but it is a place to start.

Also I would check ram usage, as there was a issue related with some specific OSes that surfaced recently, can’t remember what exactly.

marschro · December 10, 2023, 8:26pm

Thanks for your thoughts, @D4no0,

So yes, first I thought its like a proxy timeout.

I am using HAProxy and my timeout settings for connect is 5s, client, server an tunnel are all on 50s.
So this is enough for the heartbeat to not reload the liveview and I had not an issue with these settings in the past.

Server Provider?

Another thing that changed, is, that I moved from digital ocean to Vultr. But there does not seem to be any security measures by Vultr that trigger this beahvior. At least I found nothing in the docs.

Socket Heartbeat?

I played around a bit with the heartbit interval, but that does not have any effect on the behavior

Server RAM/CPU?

I monitored the host and the BSD jail, that runs the phoenix application
Nothing observable happened, CPU usage stayed in 0,xx percentages when reproducing the issue - so no load on the CPU.
Also memory does not change much, triggering the bahviour kept memory constant.

Anything special with sockets?

when reproducing the behavior and watching socket messages, I can see, that at some point, the socket upstream (phx-leave and phx-join) are pushed to the server with no delay, but the result takes up to 4 seconds. This occurs also on the same payload of the socket - so triggering the same socket exchange over and over (by clicking always the same navigation button) It most time is fast as expected… but as soon as about 20 to 22 seconds are over, one single response is really slow. After that, all is fast again. During that time, neither a proxy timeout is triggered nor does the RAM or CPU does change in any way.

Phoenix telemetry?

I do not yet have an idea how I can get data from live_view in production via telemetry. Might it make sense, to deploy the phoenix dashboard and have a look?

To me this is really mysterious…

marschro · December 10, 2023, 8:48pm

Also what I figured out:

it only happens on live navigation.
i.E. when just updating validating forms or any other event, all is fine…

marschro · December 10, 2023, 9:08pm

Okay, and its probably not even an App-level issue, as I can reproduce this also by just hitting reload over and over… its always fast, except of on request that seems to be extremely slow every 20 seconds…

What also probably is not the cause:

In order make sure, its not because of slow Server performance, I upgraded to a big machine and re-deployed… Issue still persists. So its not because of the machine, but very likely something on network level…
I deactivated all firewalls (network and OS) - the behavior still exists.

… so what else could cause latency on timebased interval when its not the app, the OS, the machine, the proxy? is there a scheduled traffic inspection on any level?

benwilson512 · December 11, 2023, 1:58am

Are you running a release or via mix? Did you accidentally leave the code reloader running maybe?

Do the server logs show a high latency too?

marschro · December 11, 2023, 11:26pm

Hi @benwilson512 and thanks for the reply.

I run a prod release. No mix on prod environment.
Server logs don’t show anything special.

I scratched my head and have no idea.

So what I will do next, is build a small app no complexity and deploy to prod. I want to know if it’s on app layer or network layer.

I also have the feeling that haproxy might not handle the socket connections very well. I also have to dig deeper here.

I will update if I figured sth out. Meanwhile I am happy to get any idea and input from all.

marschro · December 12, 2023, 11:53pm

Further investigation:

I deployed the whole app with fly.io
I was not able to reproduce the issue there.
So its a proxy and infrastructure problem an not an issue on app-level or app-framework level…
… will investigate further

derek-zhou · December 13, 2023, 2:41am

portal.devpunx.com does not respond to ping and I cannot load the website though several popular website performance tools, such as pagespeed.web.dev. Feels like a misconfigured network. Have you complained to your hosting company?

marschro · December 13, 2023, 6:53pm

Oh thats interesting, because that worked well in the past. Thank you @derek-zhou for pointing out!
Well obviously there is sth. messed up with the hosting and network routing…

marschro · December 15, 2023, 9:50pm

Okay, final update as its now fixed…

I was debugging the app for days
checking for memory leaks
optimizing sockets
debugging and logging haproxy
checked server performance and issues
searched for any possible side effect…

Now I opened a ticket at the hostel company and send them logs from mtr which showed significant packet loss.

They immediately knew what to do and fixed it. So it was finally no Phoenix, no Elixir, no Erlang, no BSD jails, no FreeBSD, no haproxy, no firewall issue…
but something in the “network backend” near de-cix… dunno anything on details.

But it shows, that really small problems can take days for debugging…
Heads up! never give up debugging

Thanks to everyone who just supported me, by giving their input to this thread - helped me to keep debugging

D4no0 · December 15, 2023, 10:01pm

Each time I hear about these kind of problems, reminds me of this talk :

marschro · December 15, 2023, 11:04pm

yeah, I know that talk and really love it. One of the best!