Dealing with server load from LiveView reconnects on rolling restarts

We’re building an app that has been growing quickly, and generally enjoy the benefits of LiveView.

One area that has started to become a struggle as we grow is deploying new servers. We have a kubernetes cluster that performs rolling restarts for us, but it can result in the equivalent of every active user reloading their page all at the same time, which can give heavy bursts of traffic and put stress on our DB.

Wondering what others have done to mitigate this, specifically as it relates to LiveView? Any good approaches to staggering client reconnects a bit, for example?

5 Likes

There was a similar thread started a few days ago How to limit the amount of active Websockets per node on Phoenix

LiveView builds on top of Phoenix Channels, so if you want to work out a solution from the client side have a look at phoenix 1.7.14 | Documentation. You can configure several aspects of how the Socket client works, for example the opts.reconnectAfterMs function could be use to add jitter on top of the default exponential backoff.

I’d note however that like on the thread I linked above, we should probably be thinking at the overall system level – if you’re doing rolling deployments with Kubernetes, only a fraction of your currently connected users are being disconnected and then reconnected at a time, not all of them.

If the problem is concentrated around the database hits, another idea that comes to mind is using Phoenix.LiveView — Phoenix LiveView v0.20.17 or Phoenix.LiveView — Phoenix LiveView v0.20.17 to limit most of your queries to when the socket is connected, which could reduce the amount of queries by nearly half.

re: stress on the DB

I solved a problem very similar to this at my previous company.
In that case, the DB stress was happening due to “live updates”.
We were live-updating a very complicated and dense thing. A single action could result in many changes. At first, we put the whole thing in the PubSub payload for live views to update their own assigns. This had problems. We learned that large payloads being sent long distances take a long time. So then we simply had all the LiveViews hit the DB to get fresh data. This led to “stress on the DB”.

To solve the problem, I cached the result of the function that the LiveView used to fetch data. I used the :nebulex library to do that.

The only problem is it takes great care and attention to remember to “clear” the cache every time the data changes.

When we got it set up properly, it worked well and we did not see any more live-updating issues.

I believe something like this could help you, at least with the “DB stress” side of things.

1 Like

We faced a somewhat similar problem trying to perform a rolling upgrade (we are not using LiveView). Our application is distributed in nature (i.e. Erlang distribution) making such an upgrade hard, although our k8s / AWS load-balancer configuration supports it. Currently we simply wait for the new cluster to start in its entirety before switching traffic (Kubernetes maxSurge set to 100%). Effectively we have an instantaneous switch of traffic rather than a rolling upgrade.

What we are planning to do is set the cookie field in the mix.exs releases section to be the current release version. This will allow us to set maxSurge to a lower value, say 25%, so traffic sent to nodes in the new cluster and existing traffic on the old cluster will not interfere with each other.

I find the default reconnectAfterMs to be far too aggressive, it currently tries to first reconnect after 10 and then 50 ms:

I usually use something more like this in production apps:

function defaultReconnectAfterMs(tries: number) {
  const nominalMs =
    [250, 500, 1_000, 2_500, 5_000, 10_000][tries - 1] || 15_000;

  const jitterRatio = getRandomInt(75, 125) / 100;
  return nominalMs * jitterRatio;
}

// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Math/random#getting_a_random_integer_between_two_values
function getRandomInt(min, max) {
  const minCeiled = Math.ceil(min);
  const maxFloored = Math.floor(max);
  return Math.floor(Math.random() * (maxFloored - minCeiled) + minCeiled); // The maximum is exclusive and the minimum is inclusive
}

This helps to spread out the load of all of the reconnects.

7 Likes