How to hold a reference to a process even after it is restarted?

Hi, I’m fairly new to Elixir. I wrote a game lobby server with the good old create / join room functionality. Here’s what my process tree looks like:

  • App
    • PlayerSupervisor (DynamicSupervisor)
      • Player (GenServer)
    • RoomSupervisor (DynamicSupervisor)
      • Room (GenServer)

In the Room process, I stored a list of pid of Player processes, in order to broadcast messages to the players in the same room. Now, if the Player process crashes, the pid that Room holds is no longer available, even if PlayerSupervisor has restart that Player process. So, how can I hold a reference to a process even after it is restarted by DynamicSupervisor? Or did I model the application the wrong way?

1 Like

Take a look at Registry. Rather
than tracking the pid of the process, you just need to know the key it is registered as.
You may even be able to use the broadcast functionality that comes with Registry dispatch/3

2 Likes

For your use case I think the best would be for the Room to monitor the Player process, and remove it from the list of pids when it dies. The player process should re-join the room(s) when it starts up.

2 Likes

Thanks for the reply. But what happened if the Player process crashes when using Registry? In the documentation, it says:

Each entry in the registry is associated to the process that has registered the key. If the process crashes, the keys associated to that process are automatically removed. All key comparisons in the registry are done using the match operation (===/2).

So when the Player process crashes, it just will be removed by Registry. What I want is to retain the reference to the Player process, which means the Room process does not even know the Player process has been restarted by its supervisor.

You do not specify if You use Phoenix, or just Elixir.

If You use Phoenix and websocket, then the whole Player system could be replaced by Presence.

Is there really a need to store player state? apart from tracking when they join or leave? They don’t really have any worthy state to preserve.

Tracking those events (join/leave) allows You to store players in a room property, like… who’s in. It also allows the room to check when there is no more players, or to put the game in idle state when a user leaves.

Eg. in any board or card games, it must be paused when a user leaves (by accident or not). It’s even more important if You have a time based game.

The room state is really important, I would use something like this.

  • Registry
  • Room Supervisor
    • Room Manager (GenServer)
    • Room Worker Supervisor (Dynamic)
      • Room (GenServer)
    • ETS_Cache (GenServer)

The Room supervisor is in charge of restarting Room manager (maybe Room Worker, ETS_Cache)
The Room manager is in charge of most Room restart/cleanup logic
The Room Worker Supervisor is as dumb as possible
The ETS Cache server is keeping a copy of room state, to allow restart in case of room crash.

It is a bit more complex, but it should cover any crash in any part of the system.

A good test, run your system, open observer, crash any processes, and see what happens :slight_smile:

1 Like

Instead of a pid, you can communicate with player processes using via-tuples.
This is a layer of indirection that allows the room to not care if player processes are restarted. docs

I kinda understand what you mean. However, the Room Manager part is not clear for me. I mean, if the Room Worker Supervisor is responsible for restarting Room processes, how does the Room Manager handles restart/cleanup logic. Or, when the supervisor restarts the process (spawn a new process B when A crashes), how does the Room Manager know that the process B is the new restarted A?

I did not say Room Worker Supervisor is responsible for restarting Room processes.
I said The Room Worker Supervisor is as dumb as possible.

You should be able to differentiate between a crash and a normal exit. For example, after someone wins the game, it should not be restarted.

You can achieve this by

  • Setting workers to temporary
  • Monitor workers from manager
  • Trap exit in manager
  • Check if exit is normal, or not
    • normal -> cleanup state, maybe persist game in db
    • abnormal -> restart worker from manager, using dynamic supervisor

The manager can also link consumer processes… thus it can detect when a consumer goes down.

There is a similar example in this book describing poolboy library. Although not to date, the example is using single_one_for_one instead of dynamic supervisor, it is very instructive.

Often You will see this combination of top supervisor, manager, worker supervisor, workers, like in this picture…

And it is kind of fractal, because each workers could potentially be another combination of supervisor, manager, worker supervisor, workers…

I recommend this video if You are interested in game server.

1 Like