Restarting an application using websockets

Hey,

When restarting or upgrading and application using web sockets.
How do you deal with the fact that after restart all web sockets will simultaneously attempt to reconnect and overload your system.

1 Like

The traditional answer here is to use rolling deploys over a set of M servers, and only have N of those changed out at any given moment. You pick N such that N/M doesn’t overload the system. As a concrete example if you’re running 10 servers, you might do a rolling deploy where you are shifting 2 of those at any given moment. 2/10 servers worth of traffic reconnecting would need to be what your cluster could handle.

The lower your overall traffic, the lower those numbers can be, it just depends on your load profile.

There are fancier answers as well these days where you instead of hard starting / stopping any particular server, you spawn an entire new set of 10, and then your load balancer gradually shifts traffic from the old set to the new set.

NOTE: I’m using the term “servers” here sort of loosely, I really mean “running instances of your application”, in the context of containers these may or may not map to individual whole computers somewhere.

4 Likes

I see what you mean. Thanks!

There’s another aspect of this that I feel strongly about because it’s brought my production down before.

If you use tokens to connect to the WebSocket and the tokens are fetched from a service using async request (either the WebSocket service itself or another in the cluster), then you must be prepared for your real-time application crashing and causing every user to try to refresh their token. This can cause a massive spike in requests in a short period.

I must use a short lived token in my app, so I solved this by fetching a token every 5-9 minutes. I trade off fetching a token I may not need for stability of the token service.

I never had an issue in rolling deployments because the number of users affected at once was smaller, but I did experience issues when the service unexpectedly started having operation issues.

Real-Time Phoenix’s chapter on deployment should be going into beta in the next 2 weeks and includes a bit on this topic.

5 Likes

Hi @sb8244,

your book “Real-Time Phoenix” has a section called “send a recurring message”. In that section you give a code example of sending a new token to the client every 5 seconds.

Would it be an oversimplification to use your code from that section and change the interval from 5 seconds to a random amount of time between 5 and 9 minutes, to achieve what you describe above?

Thank you :slight_smile:

Thank you for your response.

This is an issue that I absolutely looked into, because it goes hand in hand with the fact that if your websocket service crashes then you’ll have everyone trying to reconnect simultaneously.

I’m using somewhat the same way of fetching tokens, which solves this issue somewhat. If the service is down for more than 10 minutes (my access token lifespan) then I’ll definitely still feel a big spike though.

I included that section precisely for this reason, and it’s definitely a viable technique to serve the token in a message from the Channel itself. This doesn’t work for all cases, though. My token service is separate from the Channel service, so I have the client request the token from the token service every 5-9 minutes (10 minute lifetime).

The objective here is to decrease exposure, not remove it completely. There’s a big difference in looking at a log from 30 days ago and accessing the system, or looking at one from 5 minutes ago and accessing the system.

Great, thanks!

I like this token refresh approach a lot. Still trying to figure out how to address one corner case, though: Session (stored in the cookie) has just expired, but the token refresh code for the websocket keeps refreshing the token.

Maybe a check in the refresh code is needed to make sure the session has not yet expired (or been deleted on the server, if persisted in the db)?

Thank you.

If using Phoenix channels, you can also write your own backoff timer (this is what I did to fix the overload issue):

/**
 * Return an exponential backoff based on the given iteration.
 * @param {number} i Number of reconnect attempt (0-indexed)
 * @returns {number} Milliseconds to wait for reconnect
 */
function reconnect_backoff(i) {
  const rand = Math.random() * 90 * 1000;  // Wait for 0-90 seconds per iteration
  return rand * i;
}

It can be used with the Phoenix socket implementation by giving it the option { reconnectAfterMs: reconnect_backoff }.

One challenge with a large back off is that it can make small internet flakes much more impactful for users. That’s not necessarily a problem for some applications, everything is a trade-off. Uptime of a real-time app is one of the most important stats, so I try to keep it to extremely high by having short reconnect times.

This is where things start to get a bit tricky and is one reason I use the “client side fetch” approach. Off the top, you may be able to get away with setting a max age of the channel refresh process, and then have the client fetch a new token via Ajax when that occurs. You could manage the timing of that process completely in the channel, with a small handler on the client side and an endpoint to get the token.

Got it, thank you :slight_smile:

Btw, this afternoon it occurred to me that a super simple solution would be to simply issue the token any time the cookie gets issued (i.e. on page login), independent of a websocket connect, and have the cookie’s max-age match the token lifetime. This way one would never run into the situation where the token lasts longer than the cookie. Wouldn’t use it with non-expiring cookies, but seems reasonable for something where the session cookie has a max age measured in hours (e.g. 8 hours).

Any thoughts on what I’m missing? Is this creating some significant security issue?

Thank you :slight_smile:

P.S.: As I think some more about it, the problem with this approach is probably that tokens are not available across pages, but cookies are?

This seems like it would work. If you have a longer session life or a logout feature, then there would be a few extra hours of overlap. Some security audits may not like that, but it seems like a pretty low issue imo.

It also might depend on how long users stay on a page. Our app is a SPA and many users leave it open all day and even multiple days/weeks.