I’m finding that socket reconnects (following deployments) are very disruptive to users (especially on form pages). At the moment it seems like an unavoidable trade-off as part of the live-view model, but it’s impact on the experience is enough of a problem that I’m considering a switch to client-side frontend (or maybe something in between like livevue) in order to improve the experience. I really love the liveview model generally, and I’m hoping that I’m missing some other opportunities to improve the experience, and avoid going this route.
I’ve read through and implemented a number of best practices to ensure that deployments have minimal impact on the end-user. I’ve implemented form recovery, state is stored in url’s, and I save form data through to the database every few seconds as well.
When a deployment goes out, the user experience is along the lines of:
page re-renders (mount → loading → socket connection → update render). This takes up to a few seconds, and is enough to disrupt the flow of someone actively putting thoughts down into the form. If in the flow of typing, there is often data loss as well.
the form focus is lost, so users need to find where they were
we get a spike of traffic as everyone reconnects, which isn’t a problem right now, but will become more problematic over time. I’ve seen there are ways to combat this with draining, but haven’t looked into that yet.
We have experienced the same issues when a lot of users reconnect at the same time after a deployment. In my experience this is a nice improvement to ensure that reconnections spread out instead of happening at the same time.
If the usage model require the user to spend significant amount of time dwelling in a form, Then it is tough. The only thing I can think of is to break up form into smaller pieces, a la wizard style, persistent work-in-progress in the database, and let user move back and forth.
This is certainly the case, it is a form where users can spend up to an hour filling it out - not due to size of form, but more the nature (think a bit, type, think a bit type, etc).
Thanks, I’ve come across some of this, but the load on the database is kind of secondary at the moment (feels solvable with techniques like this). The user experience feels like there’s not really a good option though…
If you can clearly identify which fields the user tends to spend time on, one approach you could take is to mark the surrounding area as phx-update="ignore" (and deal with its consequences if you’d otherwise need updates like for form validation).
For example, I have phx-update="ignore" in a video player container, and deployments do not affect the ongoing playback of the video, even through WebSocket reconnection.
(I am aware there are cases a full page reload could be triggered, e.g. if the socket fails to reconnect, but that hasn’t been an issue )
Thanks for the thoughts, but I feel like that might open other issues down the line. I think i’m going to experiment with livevue and see how that feels.
Concerning your last point: LiveView sockets are drained by default, but you’ll probably need to adjust the drainer configuration depending on how many clients you expect. The defaults are pretty high and probably too high for most. See the drainer spec for sockets: Phoenix.Endpoint — Phoenix v1.7.20
Yea seconding these thoughts. The described behavior sounds like either the new nodes aren’t being given enough time to connect to the load balancer before old nodes or drained OR the old nodes aren’t being given enough time to drain. Properly orchestrated there should be no points where hitting the load balancer doesn’t result in a connection.
thanks for the input so far - it’s sounding a bit like I’m doing something odd (or it’s quite specific to my infra setup) - which is good news. I’m going to try reproduce from a fresh phoenix app and see what happens. I’ll feedback here when I know more.
I have been mulling over, and have some thoughts on why my particular setup might be making things worse at the moment, but I still need to tinker:
the main paint point happens on a form where i used phx-validate to save to db, and I suspect this might be messing with form recovery.
i’m trying to reduce load by only fetching data when the socket is connected, and this might be disrupting the render flows / ux on first render because it would clear the screen and show ‘loading’ until socket connects, and info is fetched.
app is hosted in EU, most of current customers about 150+ ms away.
I’m still trying to figure things out a bit, but I have made some progress, and things are not as I thought.
Firstly, I tested a couple different ways of disrupting the connection (both locally, and in production) and the experience is so much better than what I initially reported about in most cases.
I can confirm that the experience users are getting in production seems to be sporadic, and I could only reproduce so far when deploying.
Some things I tried, which resulted in the expected experience (i now know) - i.e. the socket disconnected, then reconnected, with no page reloads, or data loss on the forms.
locally, killing the server and restarting it
prod, killing pods, scaling pods to 0 and back up
When deploying, I sometimes hit the issue, but not always. When it happens, it seems to me that the page is doing a full page refresh.
I’m still looking into this, and will try to reproduce again.
(edit, removed some info on a possible lead - looks to be unrelated)
I found something which seems to be a strong contender, but I’m not entirely sure yet.
In the logs, I see a socket-reconnect log, followed by a notice about some pages trying to navigate across liveviews. The navigating across liveviews seems odd to me, because they show up in batch around the socket reconnect, and are otherwise not present during typical usage.
e.g
2025-03-06T19:15:17.765085437Z stdout F 2025-03-06 19:15:17.764 [warning]
mfa=Phoenix.LiveView.Channel.authorize_session/3
navigate event to "https://(redacted)" failed because you are
redirecting across live_sessions. A full page reload will be
performed instead
Have seen the messages too, wondering why they appeared as I don’t navigate while seeing them but merely restart the dev-server.
Just thinking out loud: maybe when a session can’t be found for given session key, LiveView (assumes a user navigates into a route with another new session and) auto-redirects to force a full reload, triggering the unconnected render to embed the new session key in the output.
Technically this might be necessary but only the wording is off?
If you’re saving to the db on phx-change and allowing form recovery on reconnect, you might run into data consistency issues if the user has more than one tab open during a deploy.
Well maybe not. I thought I know the problem, but we actually changed some code that was checking the live_session version (which changes whenever the router is changed) to only check the live_session name instead a while ago. @duncanphillips@BartOtten which LiveView version were you running when you saw this message? It should only happen when clicking on a <.link navigate={...}> or when a push_navigate is performed after a deployment that also changed the router. So just a reconnect should actually not cause this message.
I’ll try putting production into debug mode to get more logs and see if I can reproduce to get more info. Is there anything else I could do to help narrow this down?