I run an agency and we have multiple projects. In most of them we actually don’t have to use any clustering, as it often happens we are running on a pretty standard single-database-multiple-web-nodes or even single-database-single-web-node set ups (yeh, Elixir is fast!). But there are projects where we do use clustering too.
Let’s start with use cases I see:
Actually the most common use case is that we want to have PubSub (as in Phoenix PubSub) functionality shared across the cluster, without the need for additional dependency (like Redis). This is used by default by Phoenix Channels, but not only. We also use Absinthe GraphQL server, and when you do real-time push updates this way (with GraphQL subscriptions), you often have to trigger these as well with a PubSub message, and you want all clients, connected to all nodes, to receive that update. The third, also PubSub-related use case us LiveView. It’s similar to GraphQL subscriptions in a way that whnever some event happens, as in certain record gets updated, all currently connected LiveViews should receive a messge and do something like render new version of updated record. We do that also with PubSub, and when our nodes are clustered this is a no-brainer in usage and configuration.
After PubSub, the second use case for clustering I think is the need to perform a cluster-wide lock, i.e. critical section. Or more generally speaking: limit the concurrency of something cluster-wide, to one or N concurrent actions of given kind. For example, if you track some usage of your system, and have pay-as-you-go or billing level plans, you want to warn user when they are approaching limit, and then maybe suspend or switch plan, or apply more charges once they exceeded usage. You may want, these events to happen precisely once. Or, you want to throttle the usage of the user when they are exceeding some limit of API calls. You usually can do these things without clustering, and rely on something like locking records in database but that has own disadvanteges like, well, that your database connections are locked for a long time for example.
Third thing are things that can go wrong across multiple nodes. For example, if you need a circuit breaker that’ll shut off some part of the system that contacts an API that started timing out, or a rate limit usage of external API across cluster. Again, it’s probably possible to do with something like Redis or database locking, but just so much easier and more natural to do it within Elixir processe in cluster.
And finally caching and keeping the system “hot”, as in warmed up after deployments. If you have the need to keep soemthing in memory (versus in database or external service) you can duplicate the thing on all of the nodes. You can do that without clustering. But you can also form a cluster and have cluster-wide cache, provided your node-to-node connections are fast (which they should be). Or, you can do a mixture of both things, keeping some things on hand, in memory, for all nodes in cluster, and other things local for particular node. Then, what becomes interesting is the ability to pass on that cache to newly started instances. If you release to prod often, you may find your system needing some “warmup” time, when it doesn’t yet know what’s going on, and has no caches, and is building it up as the clients make requests. This starting carte blanche style may not be desirable, and it can even lead to some serious performance issues if you deploy during the high traffic hours. So, by briefly forming a cluster between shutting down, and starting-up nodes, you can pass the relevant state, as in cache/counters/processes from old version of application to new version of application.
When it comes to what platform we use:
We have been doing that on dedicated hardware, EC2 instances (with ECS) and also recently on Gigalixir. The last option is definitely the easiest to set up as they have figured most of these things out, including the state passing form shutting down to stopping nodes is possible (which I learned only recently, silly me!). Unfortunately i have no experience with your stack :/.
Hope that’s helpful!