Does anyone have experience, advice, or thoughts on running an Elixir/Erlang cluster distributed across multiple AWS availability zones (AZ), but within a single region?
I want to replicate the deployment architecture detailed in the AWS start kit templates, as shown below, for an Elixir application. This consists of an application load balancer distributing incoming traffic between two auto scaling groups residing within two separate availability zones, both within one geographic AWS region.
One approach to forming the cluster on node start is to use the AWS command line tools to fetch the IPs of running instances in the Auto Scaling group (or use a custom tag you set), to populate a
.hosts.erlang file. As described in clustering your Elixir application on AWS inside an Auto Scaling Group.
I cannot find much advice about whether multi AZ is a good idea for distributed Erlang cluster. Would standard Erlang message passing be acceptable to use, or would it be necessary to use an alternate way of messaging between the AZs?
Thank you for bringing this up! We are doing more or less the same via aws elastic beanstalk but we use the https://github.com/kyleaa/libcluster_ec2 strategy to detect the nodes. We have not used the setup in production yet. so we are also curious to hear if it is a good idea or not.
I think that for a one region app it is acceptable to have the nodes form a cluster per AZ via autoscaling groups. This is just better since it would give you more room to breadth for updates & debugging in case of error. This method will limit the message passing between nodes in the same AZ(Assuming passing works as it does in elixir). When it comes to scaling you need to have service discovery which you are doing with .hosts.erlang.
The database will need some form of replication, its ideal to try a hosted service or alt just set replication between DB. Any important stuff will be written to the DB & made available to the other cluster when replicated. When you want to increase HA by adding another region, I would then recommend bonding the AZ’s by joining two nodes together & correcting the firewall config to allow subnets to communicate maybe by using tags. The create the same in the other region. This provides multiple levels of reliability.
AZ1 --joined-- AZ2 ======= AZ1 --joined-- AZ2