I have a use case here at work that seems like a great fit for a Single Global Process, but (as Chris Keathley points out) actually putting together an SGP system with data consistency and resiliency is very tricky. I know this topic gets brought up regularly, but I figure I’ll see what wisdom Elixir Forum has for this use case.
We track individualized video streams for large streaming events, like people watching some sportsball game on their phones or Roku, that kind of thing. Generally speaking, for each individual viewer, we need to track where they are in the live stream, which ads they’ve viewed, and which ads will be coming up next. Concurrency for big events is already hitting over 1 million viewers pretty regularly, and we expect that number to keep going up.
Today we use lots of stateless web servers that load up a viewer’s session data, do a bunch of processing, send out the video manifest to the video player, then store that session data back to a zone-wide redis cluster. But this model isn’t scaling well, especially as stream latency (time-behind-live) goes down, and the video players need to make requests to our servers more frequently.
We initially explored sticky sessions, which allows for more local caching per node, but implementing sticky sessions comes with its own headaches and doesn’t solve some of our consistency problems anyway. A more consistent solution would be a Single Global Process, where each viewer gets a GenServer that updates its metadata independently of the web requests coming in.
A quick proof-of-concept showed this to work GREAT on a single node. My first stab at horizontal scaling for this was to throw it on top of Swarm and hope it solved all my problems. Unfortunately a quick load test of that solution implied that it just wasn’t going to scale to the degree I needed (also its process recovery is still a little problematic).
So I searched and read and researched. Discord has a couple of amazing blog posts that make it seem like I should be able to do something like this and make it work. But their blog posts and github READMEs are light on implementation details. Seems like they track processes using their hash ring lib, but I’m sure how they provide any resiliency with it (i.e. recovering from an EC2 node just disappearing into the ether).
I’d love to hear anyone’s experiences or anecdotes with implementing something on the scale of millions of user sessions, and if they recommend using a single global process (backed by a redis cache or something for recovery). And if so, what’s the best way to go about it?