Websockets & Loadbalancing

why · October 25, 2020, 6:55am

Hello,

I really like the deployment model of having a load balancer with two (or more) application servers behind it. The reasons I like it are:

SSL termination on the load balancer
Ability to update / restart one server at a time without affecting the user experience (in a stateless world), assuming the remaining server(s) have enough spare capacity.
Some degree of high availability

However, this seems to get more complex when entering the websocket world, with its more persistent connections/state. This article ("The unsolved load balancing problem of websockets") makes it sound a bit like territory that is not all that well charted. At the same time, it sounds if one followed the practice of only restarting one server at a time, it would work out ok?

Any thoughts how unsolved / tricky this really is?

Thank you!

mindok · October 26, 2020, 1:01am

You may find some pointers in the Discord engineering blog - they are using websockets at scale. There are links etc here: https://elixir-lang.org/blog/2020/10/08/real-time-communication-at-scale-with-elixir-at-discord/

why · October 26, 2020, 9:31am

Thanks @mindok, will look into this

I also just realized that my copy of “Real-Time Phoenix” from @sb8244 has a section outlining several strategies for dealing with loadbalancing websockets. Will study these in detail now also.

Matsa59 · October 26, 2020, 2:17pm

If I remember you can configure your loadbalancing to always send a client to the same instance of your backend. If this instance crash then the LB will redirect to another one and follow a simple (re)connection flow.

However, keep in mind that you can connect your elixir instances to each other. So you can share the state of a websocket and then create a special flow for reconnection.

Here a video that could help you : https://www.youtube.com/watch?v=nLApFANtkHs

In the video, speaker use Horde, you can also configure this behaviour by yourself, naming process with {:global, "name"}.

fxn · October 26, 2020, 4:02pm

In AWS, the load balancer is able to keep sockets evenly distributed with this algorithm.

why · October 27, 2020, 8:50am

Thanks @Matsa59 and thanks @fxn!

@fxn: The AWS load balancing algorithm looks great, and like a really simple solution! If I’m not mistaken this algorithm is also called “least connections” with other load balancer offerings (haproxy being one them).

Not sure about AWS, but with some of the load balancer solutions out there one might have to additionally turn on “sticky sessions”, I think. EDIT: Hmm, it looks like “sticky sessions” are an option available only for http (but not tcp), I need to investigate some more.

Very exciting thought that there might be a relatively simple path here after all

why · October 28, 2020, 10:51am

Just wanted to update my last post (can’t edit it any longer), and delete the section about “sticky sessions”, which is not relevant / correct.

From my latest understanding the only thing one really needs to do is to use the “least connections” balancing algorithm (or equivalent) to balance websockets in a reasonable way across multiple servers.

wanton7 · October 28, 2020, 11:38am

WebSocket connection still starts as http connection and is upgraded to WebSocket connection so I don’t see why you couldn’t use sticky sessions (implemented with cookies) with WebSockets.

ityonemo · October 28, 2020, 11:12pm

just keep in mind {:global, “name”} takes out a transaction lock on the entire VM, so if a lot of things are in motion it can block, and performance degrades dramatically if you have a cluster (or so I hear, I haven’t used it yet).

Matsa59 · October 29, 2020, 12:13am

I didn’t read anything about that. On https://erlang.org/doc/man/global.html it explains the global system works « locally » on each node.

So, if I’m not wrong, it works exactly the same way that non global processes (except they start on every node)

If you have any source I’m really interesting to know more about that.

ityonemo · October 29, 2020, 12:16am

it’s right there:

For any action resulting in a change to the global name table, all tables on other nodes are automatically updated.

Global locks have lock identities and are set on a specific resource. For example, the specified resource can be a pid. When a global lock is set, access to the locked resource is denied for all resources other than the lock requester.

other bits and pieces:

When new nodes are added to the network, they are informed of the globally registered names that already exist…
This function is completely synchronous, that is, when this function returns, the name is either registered on all nodes or none.

I guess they don’t come out and directly say that there are transactional locks, but I’m pretty sure they aren’t implementing a faster/more available consensus protocol like raft (which wouldn’t have those properties anyways).

please someone correct me if I’m wrong =D

Matsa59 · October 29, 2020, 12:44am

Oh I get it, yeah you’re right. That should be the reason why libs like Horde exists

IMO global is nice for a state that don’t change too many times (edit: like @ityonemo said previously)

Thanks a lot @ityonemo

sb8244 · October 29, 2020, 11:16pm

I like the ALB strategy and would definitely look at that if it’s available. We don’t really have that capability, so had to adapt a bit.

We check the whole cluster every 15s or so and query the number of connections. If one node is outside of a statistical range, we terminate a small number of connections. This causes it to load balance within a few minutes (Using round robin). Only one node will terminate connections, which keeps the balancing steady and as slow as we want.