Graceful Restart of an Elixir v1.9 Release

Hello!

I’m working on a Phoenix app, but my question is more related to Elixir v1.9 releases. I’ve setup Ansible playbook which builds a release using mix release and uploads it to a production server. So at the time of a deployment, there are actually two releases on the server:

  1. A current one which is up and running.
  2. A new one, which has been deployed with Ansible and is not running yet.

What I do is finding a currently running release and using stop command on it. But I feel like there should be a better way to gracefully “replace” an old release with a new one, so that there is a little downtime during restart. For example, in my case I need to run migrations for the new release, which may take a while.

Do you have any ideas on how to handle it? Thanks!

1 Like

You’ve got two options: Hot code load, or rolling deploy. In the hot code loading method you use one of Erlang’s unique features whereby you can simply replace the code that was running with new code, and there are some hooks for how to migrate data. The challenge with this approach is that not only do you need to think about migrating database data, any data you have in memory needs to be migrated as well, cause it’s straight up replacing old code with new code without stopping. Notably you need to use distillery releases for this, mix releases don’t support hot code loading.

Rolling deploy is more traditional, and it’s how you solve this in any other language too. In this method need more than 1 server running, and a load balancer that directs traffic to your nodes. You tell the load balancer to stop sending traffic to node A, then you have node A stop the old release, then start the new release. Then you direct traffic back to A, and move on to the next node. Any database migrations need to be compatible with both versions of your code since you’ll have a period of time while both versions are running.

The actual mechanics of how this is accomplished depend on how you’ve got your servers setup and how you have your load balancer(s) setup. There are also variations where you start brand new servers (or docker containers) that run the new version, migrate traffic over, and then delete the old servers (or docker containers) running the old versions.

6 Likes

Thanks Ben for your detailed answer!

The hot reload way sounds complicated for the early stage app I’m working on. It definitely requires certain architecture decisions before it can be even implemented.

I can see a potential in setting up a couple of production servers with a load balancer to implement the second strategy, it does make sense. So traditional way it is.

P.S. Coincidence or not, I’m currently learning Absinthe and was looking for a good learning resource. And now I know about your book, I’m going to check it out!

You can’t do out of the box hot code upgrades with 1.9 releases, so your best option is to handle it with a load balancer. One way to do this is to make a blue-green deployment setup. It’s similar to what @benwilson512 says, you start up node b, when it’s running then switch your load balancer (e.g. nginx) to direct traffic to node b instead of a, and then shut down node a when all traffic has been redirected. This all happens on the same server.

You will have to set up you ansible playbook so it’s aware what port/socket each node runs on, and can switch between them during deploy. I’ve had plans for a while to expand on my Ansible guide with blue-green deployment, but have had no time to dig into it.

2 Likes

I didn’t know that it is called a blue-green deployment. So do you mean that I can handle this on a single server? For example, app A is on port 4000, app B is on port 4001. And I can switch them, so there is a zero downtime. I’m not very proficient with complex setups and load balancers. Does it make sense to setup load balancer in front of these apps A and B?

Yeah it can be handled on the same server, and nginx will be your load balancer. It’ll be pretty close to zero downtime, though a cluster of individual servers would be more resilient, and probably simpler to work with (but you’ll have three servers instead).

A very simple version can be found here, which should be easy to understand: https://medium.com/@miket969/blue-green-deployments-with-nginx-cbaa9938bcf8

The above example isn’t really zero downtime, since it just cuts off all requests completely from the first server. If a request hasn’t been completed, it’ll fail, but the switch will be instantaneous. A better version is one where nginx does a graceful cut-over and wait for all traffic to have been redirected to the new server.

This is where it can get complex. HAProxy/Nomad or other tools are probably a better choice here.

If you are going down this path, these docs on Nomad and HAProxy may help you:
https://danielparker.me/haproxy/blue-green/deployments/canary/nomad/simple-blue-green-haproxy/
https://www.nomadproject.io/guides/operating-a-job/update-strategies/blue-green-and-canary-deployments.html

4 Likes

Thanks! This is very helpful Dan.