I am currently experiencing some strange behavior on my production environment.
- I run a Phoenix LiveView App
- Server is FreeBSD with haproxy upfront
- If you navigate the liveView app via navigation items that trigger
link navigate, this is most of the time fine. But every ~22 seconds the socket response is extremely slow and response, event with nearly zero change in the socket is about 4 to 6 seconds.
- you can give it a try here: https://portal.devpunx.com (of course you have to click on the navigation for at leas 22 seconds… no need to click different nav-items… just keep clicking “home” or sth.)
- I do not have this on local, so it’s probably an environment effect.
What I tried to debug:
- I checked and changed timeouts on Server and OS but that seems not to cause the issue
- Also I checked if sockets break and reconnect, but they stay stable
- So it basically is slowed down for some reason.
- What makes me stutzig is the fact, that it occurs roughly every 22 seconds independently of if you click a lot or not so much.
Has anyone any idea how to approach that and how to debug?
I would suggest to check your proxy configuration first, as that is the culprit in most of the cases.
You can also check the liveview
handle_params event telemetry, not sure it would yield useful information but it is a place to start.
Also I would check ram usage, as there was a issue related with some specific OSes that surfaced recently, can’t remember what exactly.
Thanks for your thoughts, @D4no0,
So yes, first I thought its like a proxy timeout.
- I am using HAProxy and my timeout settings for connect is 5s, client, server an tunnel are all on 50s.
- So this is enough for the heartbeat to not reload the liveview and I had not an issue with these settings in the past.
- Another thing that changed, is, that I moved from digital ocean to Vultr. But there does not seem to be any security measures by Vultr that trigger this beahvior. At least I found nothing in the docs.
- I played around a bit with the heartbit interval, but that does not have any effect on the behavior
- I monitored the host and the BSD jail, that runs the phoenix application
- Nothing observable happened, CPU usage stayed in 0,xx percentages when reproducing the issue - so no load on the CPU.
- Also memory does not change much, triggering the bahviour kept memory constant.
Anything special with sockets?
- when reproducing the behavior and watching socket messages, I can see, that at some point, the socket upstream (phx-leave and phx-join) are pushed to the server with no delay, but the result takes up to 4 seconds. This occurs also on the same payload of the socket - so triggering the same socket exchange over and over (by clicking always the same navigation button) It most time is fast as expected… but as soon as about 20 to 22 seconds are over, one single response is really slow. After that, all is fast again. During that time, neither a proxy timeout is triggered nor does the RAM or CPU does change in any way.
- I do not yet have an idea how I can get data from live_view in production via telemetry. Might it make sense, to deploy the phoenix dashboard and have a look?
To me this is really mysterious…
Okay, and its probably not even an App-level issue, as I can reproduce this also by just hitting reload over and over… its always fast, except of on request that seems to be extremely slow every 20 seconds…
What also probably is not the cause:
- In order make sure, its not because of slow Server performance, I upgraded to a big machine and re-deployed… Issue still persists. So its not because of the machine, but very likely something on network level…
- I deactivated all firewalls (network and OS) - the behavior still exists.
… so what else could cause latency on timebased interval when its not the app, the OS, the machine, the proxy? is there a scheduled traffic inspection on any level?
Are you running a release or via mix? Did you accidentally leave the code reloader running maybe?
Do the server logs show a high latency too?
Hi @benwilson512 and thanks for the reply.
I run a prod release. No mix on prod environment.
Server logs don’t show anything special.
I scratched my head and have no idea.
So what I will do next, is build a small app no complexity and deploy to prod. I want to know if it’s on app layer or network layer.
I also have the feeling that haproxy might not handle the socket connections very well. I also have to dig deeper here.
I will update if I figured sth out. Meanwhile I am happy to get any idea and input from all.
portal.devpunx.com does not respond to ping and I cannot load the website though several popular website performance tools, such as pagespeed.web.dev. Feels like a misconfigured network. Have you complained to your hosting company?
Oh thats interesting, because that worked well in the past. Thank you @derek-zhou for pointing out!
Well obviously there is sth. messed up with the hosting and network routing…
Okay, final update as its now fixed…
- I was debugging the app for days
- checking for memory leaks
- optimizing sockets
- debugging and logging haproxy
- checked server performance and issues
- searched for any possible side effect…
Now I opened a ticket at the hostel company and send them logs from
mtr which showed significant packet loss.
They immediately knew what to do and fixed it. So it was finally no Phoenix, no Elixir, no Erlang, no BSD jails, no FreeBSD, no haproxy, no firewall issue…
but something in the “network backend” near de-cix… dunno anything on details.
But it shows, that really small problems can take days for debugging…
Heads up! never give up debugging
Thanks to everyone who just supported me, by giving their input to this thread - helped me to keep debugging
Each time I hear about these kind of problems, reminds me of this talk :
yeah, I know that talk and really love it. One of the best!