How to reduce impact of deployments on a liveview?

duncanphillips · February 27, 2025, 9:01am

I’m finding that socket reconnects (following deployments) are very disruptive to users (especially on form pages). At the moment it seems like an unavoidable trade-off as part of the live-view model, but it’s impact on the experience is enough of a problem that I’m considering a switch to client-side frontend (or maybe something in between like livevue) in order to improve the experience. I really love the liveview model generally, and I’m hoping that I’m missing some other opportunities to improve the experience, and avoid going this route.

I’ve read through and implemented a number of best practices to ensure that deployments have minimal impact on the end-user. I’ve implemented form recovery, state is stored in url’s, and I save form data through to the database every few seconds as well.

When a deployment goes out, the user experience is along the lines of:

page re-renders (mount → loading → socket connection → update render). This takes up to a few seconds, and is enough to disrupt the flow of someone actively putting thoughts down into the form. If in the flow of typing, there is often data loss as well.
the form focus is lost, so users need to find where they were
we get a spike of traffic as everyone reconnects, which isn’t a problem right now, but will become more problematic over time. I’ve seen there are ways to combat this with draining, but haven’t looked into that yet.

looking forward to hearing thoughts

belaustegui · February 27, 2025, 9:07am

We have experienced the same issues when a lot of users reconnect at the same time after a deployment. In my experience this is a nice improvement to ensure that reconnections spread out instead of happening at the same time.

derek-zhou · February 27, 2025, 3:19pm

If the usage model require the user to spend significant amount of time dwelling in a form, Then it is tough. The only thing I can think of is to break up form into smaller pieces, a la wizard style, persistent work-in-progress in the database, and let user move back and forth.

duncanphillips · February 27, 2025, 8:01pm

This is certainly the case, it is a form where users can spend up to an hour filling it out - not due to size of form, but more the nature (think a bit, type, think a bit type, etc).

duncanphillips · February 27, 2025, 8:02pm

Thanks, I’ve come across some of this, but the load on the database is kind of secondary at the moment (feels solvable with techniques like this). The user experience feels like there’s not really a good option though…

rhcarvalho · February 27, 2025, 10:42pm

If you can clearly identify which fields the user tends to spend time on, one approach you could take is to mark the surrounding area as phx-update="ignore" (and deal with its consequences if you’d otherwise need updates like for form validation).

For example, I have phx-update="ignore" in a video player container, and deployments do not affect the ongoing playback of the video, even through WebSocket reconnection.

(I am aware there are cases a full page reload could be triggered, e.g. if the socket fails to reconnect, but that hasn’t been an issue )

duncanphillips · March 3, 2025, 2:23pm

Thanks for the thoughts, but I feel like that might open other issues down the line. I think i’m going to experiment with livevue and see how that feels.

steffend · March 3, 2025, 5:30pm

Hey @duncanphillips,

form recovery should take care of the first part. I’d be interested in a reproduction that shows data loss. For the focus issue, I opened keep active focus when reconnecting by SteffenDE · Pull Request #3699 · phoenixframework/phoenix_live_view · GitHub. When LiveView was initially released, the blur seems to have been added deliberately, but I don’t know why it’s there. Maybe Chris knows, but probably we can just remove it.

Concerning your last point: LiveView sockets are drained by default, but you’ll probably need to adjust the drainer configuration depending on how many clients you expect. The defaults are pretty high and probably too high for most. See the drainer spec for sockets: Phoenix.Endpoint — Phoenix v1.7.20

Happy coding!

benwilson512 · March 4, 2025, 12:56pm

Yea seconding these thoughts. The described behavior sounds like either the new nodes aren’t being given enough time to connect to the load balancer before old nodes or drained OR the old nodes aren’t being given enough time to drain. Properly orchestrated there should be no points where hitting the load balancer doesn’t result in a connection.

duncanphillips · March 4, 2025, 1:53pm

thanks for the input so far - it’s sounding a bit like I’m doing something odd (or it’s quite specific to my infra setup) - which is good news. I’m going to try reproduce from a fresh phoenix app and see what happens. I’ll feedback here when I know more.

I have been mulling over, and have some thoughts on why my particular setup might be making things worse at the moment, but I still need to tinker:

the main paint point happens on a form where i used phx-validate to save to db, and I suspect this might be messing with form recovery.
i’m trying to reduce load by only fetching data when the socket is connected, and this might be disrupting the render flows / ux on first render because it would clear the screen and show ‘loading’ until socket connects, and info is fetched.
app is hosted in EU, most of current customers about 150+ ms away.
app is on kubernetes

benwilson512 · March 4, 2025, 2:59pm

What sort of ingress controller are you using?

duncanphillips · March 5, 2025, 6:17am

I’m using the nginx controller (it’s a bit old, I haven’t updated it in quite some time)

GitHub - kubernetes/ingress-nginx: Ingress NGINX Controller for Kubernetes - I’m using v1.3, and the latest is 1.12.

I’m going to try spend some time on reproducing, and providing more info tomorrow.

duncanphillips · March 6, 2025, 8:02pm

I’m still trying to figure things out a bit, but I have made some progress, and things are not as I thought.

Firstly, I tested a couple different ways of disrupting the connection (both locally, and in production) and the experience is so much better than what I initially reported about in most cases.

I can confirm that the experience users are getting in production seems to be sporadic, and I could only reproduce so far when deploying.

Some things I tried, which resulted in the expected experience (i now know) - i.e. the socket disconnected, then reconnected, with no page reloads, or data loss on the forms.

locally, killing the server and restarting it
prod, killing pods, scaling pods to 0 and back up

When deploying, I sometimes hit the issue, but not always. When it happens, it seems to me that the page is doing a full page refresh.

I’m still looking into this, and will try to reproduce again.

(edit, removed some info on a possible lead - looks to be unrelated)

Going to keep digging into this to find the cause

duncanphillips · March 6, 2025, 8:32pm

I found something which seems to be a strong contender, but I’m not entirely sure yet.

In the logs, I see a socket-reconnect log, followed by a notice about some pages trying to navigate across liveviews. The navigating across liveviews seems odd to me, because they show up in batch around the socket reconnect, and are otherwise not present during typical usage.

e.g

2025-03-06T19:15:17.765085437Z stdout F 2025-03-06 19:15:17.764 [warning]
  mfa=Phoenix.LiveView.Channel.authorize_session/3
  navigate event to "https://(redacted)" failed because you are
  redirecting across live_sessions. A full page reload will be
  performed instead

BartOtten · March 7, 2025, 10:40am

Have seen the messages too, wondering why they appeared as I don’t navigate while seeing them but merely restart the dev-server.

Just thinking out loud: maybe when a session can’t be found for given session key, LiveView (assumes a user navigates into a route with another new session and) auto-redirects to force a full reload, triggering the unconnected render to embed the new session key in the output.

Technically this might be necessary but only the wording is off?

sa-mm · March 7, 2025, 3:20pm

If you’re saving to the db on phx-change and allowing form recovery on reconnect, you might run into data consistency issues if the user has more than one tab open during a deploy.

steffend · March 7, 2025, 11:31pm

This actually helps a lot. I think I know what’s going on. I’ll see if I can come up with a good solution and let you know!

duncanphillips · March 8, 2025, 8:12am

Thanks, really appreciate the help. Please let me know if there’s anything else I can provide to help troubleshoot.

steffend · March 10, 2025, 3:47pm

Well maybe not. I thought I know the problem, but we actually changed some code that was checking the live_session version (which changes whenever the router is changed) to only check the live_session name instead a while ago. @duncanphillips @BartOtten which LiveView version were you running when you saw this message? It should only happen when clicking on a <.link navigate={...}> or when a push_navigate is performed after a deployment that also changed the router. So just a reconnect should actually not cause this message.

duncanphillips · March 11, 2025, 8:35am

ok, thanks for looking.

Here are my phoenix lib versions

* phoenix 1.7.18 (Hex package) (mix)
* phoenix_ecto 4.6.3 (Hex package) (mix)
* phoenix_html 4.2.0 (Hex package) (mix)
* phoenix_html_helpers 1.0.1 (Hex package) (mix)
* phoenix_live_dashboard 0.8.6 (Hex package) (mix)
* phoenix_live_reload 1.5.3 (Hex package) (mix)
* phoenix_live_view 1.0.2 (Hex package) (mix)
* phoenix_pubsub 2.1.3 (Hex package) (mix)
* phoenix_swoosh 1.2.1 (Hex package) (mix)
* phoenix_template 1.0.4 (Hex package) (mix)
* phoenix_view 2.0.4 (Hex package) (mix)

I’ll try putting production into debug mode to get more logs and see if I can reproduce to get more info. Is there anything else I could do to help narrow this down?