Available choices to keep state across deployments?

Hi!

My 14-yo son and I are playing around with a multiplayer online game idea. Phoenix & Elixir are a good fit for the task, so here we go.

The app could be deployed on fly.io or gigalixir, typically.

An important part of the game is that round can take a bit of time (e.g. 5 minutes), and we do not want players to see their game interrupted because we deploy a change.

Initially the game won’t have a “lobby” (place to gather before launching a game), but we will likely add that later.

I wonder what are the available options in 2022.

I have thought about these ones:

  • storing the state in Ecto/Postgres is perfectly valid for the type of game
  • hot upgrades could be doable in that case (the upgrade path is quite clear, the app quite simple), maybe coupled with some form of state snapshot (mnesia etc)

I’d be happy to get some insight from people already doing this today.

Thanks!

3 Likes

Another option is to separate the app with state from the one with logic, making them talk to each other using messages. Then only the logic app is restarted and no state is lost.

However, since Postgres is an option I would recommend that.

ps. Don’t forget that Mix releases do not support hot upgrades. Distillery does support hot upgrade releases.

2 Likes

I had forgotten about that, so thanks for the reminder.

The relevant doc is here.

The idea of splitting logic and state is a good one, yet too much work at this point (but I’ll make sure to separate the code properly should we get there).

All in all I think I agree with you, just going DB given the scale is probably perfectly fine. Thanks!

Fly does zero-downtime blue/green deployment by default. If retaining state isn’t absolutely critical you could attempt a handoff from the old app to the new app before shutdown.

Another option is to use DETS and write data to disk. Fly has persistent volumes for apps that you can provision.

However, I see no reason not to use Postgres for your son’s use case :slightly_smiling_face:

6 Likes

I’m currently doing exactly this with a game I’m developing and deploying to fly.io. Here are the design considerations I’m working with:

  • Assign a UUID to each game and store it in the process state
  • For extra guarantees in a cluster you might try to globally register the process using the UUID so it can’t accidentally be started twice on different nodes (I’ve been using Horde.Registry but currently exploring alternatives)
  • Trap exits and save the state in the Postgres DB on any exit that isn’t :normal (or whatever other reason not to save, for example I use an :expire reason that is considered the end of the game so there’s no handoff)
  • Monitor the game from the client and keep a reference to the game_id so the client can restart the game if it goes down for certain reasons
  • Always check the DB in init/1 to see if a state handoff exists and load it when possible
  • Use bluegreen deployment strategy so you know there is zero downtime
  • EDIT: If you need your new VMs to connect to your old VMs during the deployment, then you’ll need to set a static RELEASE_COOKIE

It gets a little trickier if you are clustering the application and distributing it in geographically distant regions. There might be race conditions where the game could restart on another node before the data becomes available in the local read replica. I have a solution that’s working for that which I’d be happy to share if it’s relevant to you

Edit: Another option would be to just persist the game state on every state change so you don’t have to worry about trapping exits and the terminate callback. This might be especially useful if there are a limited number of changes per game. In any case, I have a GenServer doing periodic cleanup of stale game states in the DB.

6 Likes

Thanks @BartOtten @sorentwo and @msimonborg for the input, all valid and interesting!

If you want to move a library instead of a module please see these lines:

and

I bypassed this in my plugin, and it is working fine, but I did not write a document for it yet…

Hi

If you are able to segregate state based on how much staleness can be tolerated:

  • Tier 1: No staleness allowed, data drives active game session, must be hot and accessible all the time to ensure proper gameplay

  • Tier 2: Staleness is ok, metadata around lobbies and games, not updated every second

  • Tier 3: Historical game data

Etc.

Then you could combine them and implement some kind of write-through cache with checkpoints.

Also worth considering, if your deployments are infrequent, and game lobbies do not last long, then simply drain each old instance of the app of all active connections, and eventually you will be able to retire these instances. This design allows you to not have to hand over so frequently.

e.g.

  • Deploy Version 1 to entire cluster of 2 Nodes
  • Lobbies started based on code in Version 1 on both Nodes
  • Deploy Version 2 to 2 new Nodes, set existing 2 Nodes to drain-only, no new lobbies can start
  • Wait until the games complete on the old Nodes then recycle them

vs.

  • Deploy Version 1 to entire cluster of 2 Nodes
  • Apply hot upgrade for Version 2

vs.

  • Deploy Version 1 to entire cluster of 2 Nodes
  • Replace each node, games have to migrate to valid hosts (this can take a minute or so)

I think it does boil down to whether the game clients can tolerate handover at all (think shooters vs puzzles) and also it is better to have a design where if you borked the release it does not crash everything.

2 Likes

Just for reference: share the state in a cluster.

Make sure there is always one instance with the data :slight_smile:

1 Like