HashRing and Swarm and SGP, oh my

Cadamis · October 25, 2019, 8:42pm

I have a use case here at work that seems like a great fit for a Single Global Process, but (as Chris Keathley points out) actually putting together an SGP system with data consistency and resiliency is very tricky. I know this topic gets brought up regularly, but I figure I’ll see what wisdom Elixir Forum has for this use case.

We track individualized video streams for large streaming events, like people watching some sportsball game on their phones or Roku, that kind of thing. Generally speaking, for each individual viewer, we need to track where they are in the live stream, which ads they’ve viewed, and which ads will be coming up next. Concurrency for big events is already hitting over 1 million viewers pretty regularly, and we expect that number to keep going up.

Today we use lots of stateless web servers that load up a viewer’s session data, do a bunch of processing, send out the video manifest to the video player, then store that session data back to a zone-wide redis cluster. But this model isn’t scaling well, especially as stream latency (time-behind-live) goes down, and the video players need to make requests to our servers more frequently.

We initially explored sticky sessions, which allows for more local caching per node, but implementing sticky sessions comes with its own headaches and doesn’t solve some of our consistency problems anyway. A more consistent solution would be a Single Global Process, where each viewer gets a GenServer that updates its metadata independently of the web requests coming in.

A quick proof-of-concept showed this to work GREAT on a single node. My first stab at horizontal scaling for this was to throw it on top of Swarm and hope it solved all my problems. Unfortunately a quick load test of that solution implied that it just wasn’t going to scale to the degree I needed (also its process recovery is still a little problematic).

So I searched and read and researched. Discord has a couple of amazing blog posts that make it seem like I should be able to do something like this and make it work. But their blog posts and github READMEs are light on implementation details. Seems like they track processes using their hash ring lib, but I’m sure how they provide any resiliency with it (i.e. recovering from an EC2 node just disappearing into the ether).

I’d love to hear anyone’s experiences or anecdotes with implementing something on the scale of millions of user sessions, and if they recommend using a single global process (backed by a redis cache or something for recovery). And if so, what’s the best way to go about it?

tristan · October 26, 2019, 2:50pm

Your use case sounds like something that would be a fit for Microsoft Orleans https://dotnet.github.io/orleans/Documentation/index.html – check out some of the docs and presentations about the real world usage of Orleans.

Sadly the closest we (beam) have to Orleans is my project https://github.com/erleans/erleans that has not been kept up for a year because I had to move jobs (and the company this was written for was acquired, the project shelved).

I very much do not want this project to die and started planning the other day to implement another Orleans example, https://github.com/dotnet/orleans/tree/master/Samples/2.0/ChatRoom, to force my bringing the code base up to OTP22.

I’d also really like to provide a native Elixir interface.

Anyway, Orleans research and documentation might give you some ideas for what to build. Or better yet, for me :), you might find Erleans a good starting point :). If you are interested in taking a look maybe a good way to do so would be collaborating on what grains would look like written in Elixir and an Ecto storage provider.

tristan · October 26, 2019, 2:53pm

Oh, I should have mentioned, since you linked to that swarm dynamic supervisor issue, the difference with orleans/erleans is that grains “always exist”. You don’t have to restart crashed grains or those that are lost when a whole node is lost because a new request coming in for grain X will boot it if it isn’t found in the cluster – and load the grain state from persistent storage.