Hello all,
I am currently experiencing some strange behavior on my production environment.
I run a Phoenix LiveView App
Server is FreeBSD with haproxy upfront
The behavior
If you navigate the liveView app via navigation items that trigger link navigate, this is most of the time fine. But every ~22 seconds the socket response is extremely slow and response, event with nearly zero change in the socket is about 4 to 6 seconds.
you can give it a try here: https://portal.devpunx.com (of course you have to click on the navigation for at leas 22 seconds… no need to click different nav-items… just keep clicking “home” or sth.)
I do not have this on local, so it’s probably an environment effect.
What I tried to debug:
I checked and changed timeouts on Server and OS but that seems not to cause the issue
Also I checked if sockets break and reconnect, but they stay stable
So it basically is slowed down for some reason.
What makes me stutzig is the fact, that it occurs roughly every 22 seconds independently of if you click a lot or not so much.
Has anyone any idea how to approach that and how to debug?
I am using HAProxy and my timeout settings for connect is 5s, client, server an tunnel are all on 50s.
So this is enough for the heartbeat to not reload the liveview and I had not an issue with these settings in the past.
Server Provider?
Another thing that changed, is, that I moved from digital ocean to Vultr. But there does not seem to be any security measures by Vultr that trigger this beahvior. At least I found nothing in the docs.
Socket Heartbeat?
I played around a bit with the heartbit interval, but that does not have any effect on the behavior
Server RAM/CPU?
I monitored the host and the BSD jail, that runs the phoenix application
Nothing observable happened, CPU usage stayed in 0,xx percentages when reproducing the issue - so no load on the CPU.
Also memory does not change much, triggering the bahviour kept memory constant.
Anything special with sockets?
when reproducing the behavior and watching socket messages, I can see, that at some point, the socket upstream (phx-leave and phx-join) are pushed to the server with no delay, but the result takes up to 4 seconds. This occurs also on the same payload of the socket - so triggering the same socket exchange over and over (by clicking always the same navigation button) It most time is fast as expected… but as soon as about 20 to 22 seconds are over, one single response is really slow. After that, all is fast again. During that time, neither a proxy timeout is triggered nor does the RAM or CPU does change in any way.
Phoenix telemetry?
I do not yet have an idea how I can get data from live_view in production via telemetry. Might it make sense, to deploy the phoenix dashboard and have a look?
Okay, and its probably not even an App-level issue, as I can reproduce this also by just hitting reload over and over… its always fast, except of on request that seems to be extremely slow every 20 seconds…
What also probably is not the cause:
In order make sure, its not because of slow Server performance, I upgraded to a big machine and re-deployed… Issue still persists. So its not because of the machine, but very likely something on network level…
I deactivated all firewalls (network and OS) - the behavior still exists.
… so what else could cause latency on timebased interval when its not the app, the OS, the machine, the proxy? is there a scheduled traffic inspection on any level?
portal.devpunx.com does not respond to ping and I cannot load the website though several popular website performance tools, such as pagespeed.web.dev. Feels like a misconfigured network. Have you complained to your hosting company?
Oh thats interesting, because that worked well in the past. Thank you @derek-zhou for pointing out!
Well obviously there is sth. messed up with the hosting and network routing…
Now I opened a ticket at the hostel company and send them logs from mtr which showed significant packet loss.
They immediately knew what to do and fixed it. So it was finally no Phoenix, no Elixir, no Erlang, no BSD jails, no FreeBSD, no haproxy, no firewall issue…
but something in the “network backend” near de-cix… dunno anything on details.
But it shows, that really small problems can take days for debugging…
Heads up! never give up debugging
Thanks to everyone who just supported me, by giving their input to this thread - helped me to keep debugging