Deployment process with load balancer, mix release, multiple nodes, and distributed elixir

We have a Elixir Phoenix application currently deployed in a highly available configuration on AWS. We use a application load balancer (layer 7) backed by 2+ individual EC2 instances (that are all members of an auto-scaling group). Each instance is running a version of our code base that was build via mix release. All EC2 instances talk to the same postgres database.

We use AWS Code Deploy to update each instance one at a time (not blue/green) with the new release. Code Deploy handles the logistics like pausing new traffic to an instance, taking it out of service, cleaning up and redeploying the new release, testing the app after startup and putting the EC2 instance back into service on the ALB.

We do NOT use sticky sessions and allow for enough time between ALB → node traffic blocking and actual app shutdown on each node so that any API requests are completed. This way, nodes/EC2 instances can come and go from the target group with minimal interruption to the application.

With that, right now, each EC2 instance / node, does NOT know about the others. There is some concern about cachex as well as things like the live dashboard not recognizing all nodes.

We are looking into using libcluster with the EC2 tagging cluster strategy (GitHub - kyleaa/libcluster_ec2) to resolve those issues as well as form a true elixir cluster.

My question is are there any additional deployment considerations with libcluster for this setup? Like, does a node somehow need to signal to the rest of the cluster its going away so all its tasks are delt with gracefully? Or is the act of stopping the app itsself (we use unit files to call the release stop command so .../bin/MyApp stop) enough?

4 Likes

In my experience, libcluster itself won’t bring you any extra requirements. Your usage of the cluster, though, may require some handoff or graceful shutdown.

Thanks @v0idpwn !

May I ask for some examples for what you mean by “usage of the cluster”? Is this things like additional error handling within the application logic for cases that may arise from nodes coming and going?

I should have added that in our case, we are not concerned with any type of quorum within our business logic. Any persistent data needs are met in the database or with outside object storage like S3. EG our nodes are stateless

Or are you referring to state hand-off, like whats mentioned here: Are you using a clustered Elixir deployment? - #9 by hubertlepicki