Spreading out genservers on hosts of different flavour

markusgod · October 27, 2024, 11:30pm

I was reading Discord’s engineering write-ups, and while the fanout of Sessions and Guilds seems straightforward at a high level, I am now curious about some details. Since they are running BEAM VM instances and using a Guild process as a stateful container for server information, what happens if a Guild process crashes? What if a node holding multiple Guild processes crashes? Did they also write a Kubernetes-like process manager for BEAM that will spread out Guilds across the cluster? Are there any Elixir/Erlang/OPT built-in constructs for such tasks? I’m sorry if it sounds too much about Discord, this is just the closest point of reference for my right now, but I want to understand if app is fully conformant to such actor model, how orchestration should be built.

RudManusachi · October 28, 2024, 5:19am

Hi, Mark !

I think there are couple of orthogonal questions.

I don’t know exactly how it’s setup for discord, so the following are just an assumption.

what happens if a Guild process crashes?

I would expect that the Guild process is supervised, and it will get restarted according to the strategy.
User sessions connected to Guild might somehow monitor Guild process, so they are notified about the fact that it was crashed/restarted… and reestablish the connection after the restart. OR something like a :pg (process group) is used to determine members etc.

What if a node holding multiple Guild processes crashes? Did they also write a Kubernetes-like process manager for BEAM that will spread out Guilds across the cluster?

Well… if Erlang node crashes - then yeah… as far as I know that’s on K8s to restart it… how they distributed those nodes across cluster is more of that DevOps setup.

Are there any Elixir/Erlang/OPT built-in constructs for such tasks?

AFAIK, no, that’s a different layer of infrastructure.

Maybe this article could shed some more light

markusgod · October 28, 2024, 6:55am

Thank you, for the explanations! This is indeed a bunch of semi-related questions, that are probably more related to system design using actor model.
The main bottleneck in their design as I understand a “guild”, a genserver that can have from a couple, up to 5 million users, so I wonder they distribute them across physical servers. there are several dimensions of the hash-ring spread, but they are probably doing some smart placing based on the size to maximise utilisation. So you can have hundreds of small genservers on one node, and just a couple of big ones on another, that what I meant by k8s-like, but this is very bad analogy.

cmo · October 28, 2024, 10:50am

You can have a backup running in another location that monitors a process and takes over when it crashes.

This is a good watch: https://m.youtube.com/watch?v=pQ0CvjAJXz4

markusgod · October 28, 2024, 3:43pm

Is there a way to monitor resource usage by a GenServer?