Dealing with server load from LiveView reconnects on rolling restarts

pwightman · August 30, 2024, 9:02pm

We’re building an app that has been growing quickly, and generally enjoy the benefits of LiveView.

One area that has started to become a struggle as we grow is deploying new servers. We have a kubernetes cluster that performs rolling restarts for us, but it can result in the equivalent of every active user reloading their page all at the same time, which can give heavy bursts of traffic and put stress on our DB.

Wondering what others have done to mitigate this, specifically as it relates to LiveView? Any good approaches to staggering client reconnects a bit, for example?

rhcarvalho · August 31, 2024, 9:49am

There was a similar thread started a few days ago How to limit the amount of active Websockets per node on Phoenix

LiveView builds on top of Phoenix Channels, so if you want to work out a solution from the client side have a look at phoenix 1.7.14 | Documentation. You can configure several aspects of how the Socket client works, for example the opts.reconnectAfterMs function could be use to add jitter on top of the default exponential backoff.

I’d note however that like on the thread I linked above, we should probably be thinking at the overall system level – if you’re doing rolling deployments with Kubernetes, only a fraction of your currently connected users are being disconnected and then reconnected at a time, not all of them.

If the problem is concentrated around the database hits, another idea that comes to mind is using Phoenix.LiveView — Phoenix LiveView v0.20.17 or Phoenix.LiveView — Phoenix LiveView v0.20.17 to limit most of your queries to when the socket is connected, which could reduce the amount of queries by nearly half.

slouchpie · August 31, 2024, 12:31pm

re: stress on the DB

I solved a problem very similar to this at my previous company.
In that case, the DB stress was happening due to “live updates”.
We were live-updating a very complicated and dense thing. A single action could result in many changes. At first, we put the whole thing in the PubSub payload for live views to update their own assigns. This had problems. We learned that large payloads being sent long distances take a long time. So then we simply had all the LiveViews hit the DB to get fresh data. This led to “stress on the DB”.

To solve the problem, I cached the result of the function that the LiveView used to fetch data. I used the :nebulex library to do that.

The only problem is it takes great care and attention to remember to “clear” the cache every time the data changes.

When we got it set up properly, it worked well and we did not see any more live-updating issues.

I believe something like this could help you, at least with the “DB stress” side of things.

nhpip · August 31, 2024, 6:06pm

We faced a somewhat similar problem trying to perform a rolling upgrade (we are not using LiveView). Our application is distributed in nature (i.e. Erlang distribution) making such an upgrade hard, although our k8s / AWS load-balancer configuration supports it. Currently we simply wait for the new cluster to start in its entirety before switching traffic (Kubernetes maxSurge set to 100%). Effectively we have an instantaneous switch of traffic rather than a rolling upgrade.

What we are planning to do is set the cookie field in the mix.exs releases section to be the current release version. This will allow us to set maxSurge to a lower value, say 25%, so traffic sent to nodes in the new cluster and existing traffic on the old cluster will not interfere with each other.

axelson · September 3, 2024, 1:10pm

I find the default reconnectAfterMs to be far too aggressive, it currently tries to first reconnect after 10 and then 50 ms:

github.com

phoenixframework/phoenix/blob/19b68e136f5326b7e802756ba2be783cf3d5e9a4/assets/js/phoenix/socket.js#L158-L164


      
          this.reconnectAfterMs = (tries) => {
            if(opts.reconnectAfterMs){
              return opts.reconnectAfterMs(tries)
            } else {
              return [10, 50, 100, 150, 200, 250, 500, 1000, 2000][tries - 1] || 5000
            }
          }

I usually use something more like this in production apps:

function defaultReconnectAfterMs(tries: number) {
  const nominalMs =
    [250, 500, 1_000, 2_500, 5_000, 10_000][tries - 1] || 15_000;

  const jitterRatio = getRandomInt(75, 125) / 100;
  return nominalMs * jitterRatio;
}

// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Math/random#getting_a_random_integer_between_two_values
function getRandomInt(min, max) {
  const minCeiled = Math.ceil(min);
  const maxFloored = Math.floor(max);
  return Math.floor(Math.random() * (maxFloored - minCeiled) + minCeiled); // The maximum is exclusive and the minimum is inclusive
}

This helps to spread out the load of all of the reconnects.

pwightman · September 20, 2024, 8:01pm

Thanks to everyone for your replies! We ended up doing a combination of things that seem to have helped, including some from this thread, so thank you!

The thing that had the greatest effect though was specifying the drainer option on the socket in endpoint.ex

https://hexdocs.pm/phoenix/1.7.14/Phoenix.Endpoint.html#socket/3

This allowed us to control how often, and how many, reconnect messages are sent to clients, which gave us the most control.

In case that’s helpful to anyone coming across this thread in the future.

axelson · September 20, 2024, 8:10pm

Out of curiosity what drainer settings did you end up going with?

Also I didn’t realize that those drainer setting were available! I think the last time I looked there was only a Boolean setting available (but maybe I misread the docs).

RudManusachi · September 20, 2024, 8:39pm

This reminded of this talk

They faced somewhat similar issues at Cars.com when after the deployment connected LiveView clients (tens of thousands them) would try to simultaneously reload and hit the backend.