My 14-yo son and I are playing around with a multiplayer online game idea. Phoenix & Elixir are a good fit for the task, so here we go.
The app could be deployed on fly.io or gigalixir, typically.
An important part of the game is that round can take a bit of time (e.g. 5 minutes), and we do not want players to see their game interrupted because we deploy a change.
Initially the game won’t have a “lobby” (place to gather before launching a game), but we will likely add that later.
I wonder what are the available options in 2022.
I have thought about these ones:
storing the state in Ecto/Postgres is perfectly valid for the type of game
hot upgrades could be doable in that case (the upgrade path is quite clear, the app quite simple), maybe coupled with some form of state snapshot (mnesia etc)
I’d be happy to get some insight from people already doing this today.
Another option is to separate the app with state from the one with logic, making them talk to each other using messages. Then only the logic app is restarted and no state is lost.
However, since Postgres is an option I would recommend that.
ps. Don’t forget that Mix releases do not support hot upgrades. Distillery does support hot upgrade releases.
The idea of splitting logic and state is a good one, yet too much work at this point (but I’ll make sure to separate the code properly should we get there).
All in all I think I agree with you, just going DB given the scale is probably perfectly fine. Thanks!
Fly does zero-downtime blue/green deployment by default. If retaining state isn’t absolutely critical you could attempt a handoff from the old app to the new app before shutdown.
Another option is to use DETS and write data to disk. Fly has persistent volumes for apps that you can provision.
However, I see no reason not to use Postgres for your son’s use case
I’m currently doing exactly this with a game I’m developing and deploying to fly.io. Here are the design considerations I’m working with:
Assign a UUID to each game and store it in the process state
For extra guarantees in a cluster you might try to globally register the process using the UUID so it can’t accidentally be started twice on different nodes (I’ve been using Horde.Registry but currently exploring alternatives)
Trap exits and save the state in the Postgres DB on any exit that isn’t :normal (or whatever other reason not to save, for example I use an :expire reason that is considered the end of the game so there’s no handoff)
Monitor the game from the client and keep a reference to the game_id so the client can restart the game if it goes down for certain reasons
Always check the DB in init/1 to see if a state handoff exists and load it when possible
Use bluegreen deployment strategy so you know there is zero downtime
EDIT: If you need your new VMs to connect to your old VMs during the deployment, then you’ll need to set a static RELEASE_COOKIE
It gets a little trickier if you are clustering the application and distributing it in geographically distant regions. There might be race conditions where the game could restart on another node before the data becomes available in the local read replica. I have a solution that’s working for that which I’d be happy to share if it’s relevant to you
Edit: Another option would be to just persist the game state on every state change so you don’t have to worry about trapping exits and the terminate callback. This might be especially useful if there are a limited number of changes per game. In any case, I have a GenServer doing periodic cleanup of stale game states in the DB.
If you are able to segregate state based on how much staleness can be tolerated:
Tier 1: No staleness allowed, data drives active game session, must be hot and accessible all the time to ensure proper gameplay
Tier 2: Staleness is ok, metadata around lobbies and games, not updated every second
Tier 3: Historical game data
Etc.
Then you could combine them and implement some kind of write-through cache with checkpoints.
Also worth considering, if your deployments are infrequent, and game lobbies do not last long, then simply drain each old instance of the app of all active connections, and eventually you will be able to retire these instances. This design allows you to not have to hand over so frequently.
e.g.
Deploy Version 1 to entire cluster of 2 Nodes
Lobbies started based on code in Version 1 on both Nodes
Deploy Version 2 to 2 new Nodes, set existing 2 Nodes to drain-only, no new lobbies can start
Wait until the games complete on the old Nodes then recycle them
vs.
Deploy Version 1 to entire cluster of 2 Nodes
Apply hot upgrade for Version 2
vs.
Deploy Version 1 to entire cluster of 2 Nodes
Replace each node, games have to migrate to valid hosts (this can take a minute or so)
I think it does boil down to whether the game clients can tolerate handover at all (think shooters vs puzzles) and also it is better to have a design where if you borked the release it does not crash everything.