To me one of the most important point in the blog post was right at the start and felt like a sidenode:
When adding external persistance to a stateful process it becomes a cache, and with it come all the problems of caches. My goto solution is the “Just don’t use the SGP” option. Just remove the cache and only rely on the persistance layer to hold data for as long as this is a viable solution. Distributed state is hard and that already starts at two nodes. Sadly firenest never got of the runway, because I feel like the realm of 2 or 3 nodes’ setups is one many people could benefit from.
Agreed. Firenest would have been a nice project; using a small cluster really helps with rolling upgrades.
I really like Horde and use it in production but even with Horde you can run into these problems if you rely on node up and down events to update your cluster. With Horde you can use the majority node strategy for the registry but that means you need to manage a canonical set of nodes in some way. Otherwise you run into the problems I described in the “Consistent Hashing” section. If you do allow nodes to come and go based on node events than you’ll have to reconcile state somehow through CRDTs or some other mechanism.
Thanks for sharing.
I thought for some time about this (not particularly your blog post, that I read with your non-projecting voice, but the distributed issues) and it seems like an impossible to solve problem? I mean, at a physics level. The only way is to build in redundancy and external “supervision” (which in turn needs to be redundant as well), but the problem just never really goes 100% away, just the probability of it happening is diminished with every additional redundancy layer?
And some things the persistence layer does not seem like a good “fit” for solving it? Say in a game, you have a global queue for players to join. When N players are in, it starts something for those players and empties the queue so a new “batch” can join, and keeps repeating this procedure. How would you tackle this?
Your thought is technically correct. You can’t eliminate all problems because you’re always dealing with a physical network. But we can significantly reduce the probability of catastrophic failure by adding redundancy and by using proven algorithms for managing state. I recommend checking out this paper for some more insights into this: https://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf.
I think you’re right and for this kind of problem, you don’t need to persist the queue. You probably want to store metrics, but that’s different. The queue state probably falls in the “who cares” category. If the process crashes or the node goes offline, you try again with a new queue on a new node. This happens all the time in games. You still need to solve the problem of discovery. But this could be handled by having the client connect directly to a single game server or by using static clusters and consistent hashing.
Hm, but then I might run into the issue of two parts of the split having a queue and requests being routed to either (that you mentioned in the post), like, the queue is actually being backed up in mnesia so that if the node running the queuer dies another would restart it with those same members (which is wanted if the node/process actually dies), but in case of a split the two sides would start with the same queue that was pre-split, and if both fill up, the players that were in both would be sent to two different events. Perhaps it would actually be better to not try to share the queue between nodes ? So if it splits at least the side that is going to start the queuer will start an empty one, that way not messing it up… I need to read some more on distribute algorithms/prots right now it just goes way above my head.
This is exactly right. I don’t think I fully explained what was in my head. But I had envisioned a system where players would be routed to exactly one queue on one node. If you can’t reach that queue you can’t play the game. If you’d rather not have these periods of unavailability than you can start a new queue on another node. I had also assumed - maybe incorrectly - that the queue state just wasn’t that important. So if you did crash and had to start over you wouldn’t make any attempt to recover the previous state. In that scenario the queue state is ephemeral which means an available solution works exceptionally well.