I know zero downtime deployment can mean different things depending on your application and what your users can tolerate so expectations can vary vastly between different projects.
I’m building an app which will execute long running tasks for users and I want to be able to deploy new revisions of the code without a fear of interrupting customers’ work. Basically it’s not a nice user experience to crash hundreds of tasks which may have ran for more than an hour and ask users to start them over. I’m looking for books or other resources which offer patterns or solutions for how this could be achieved in Elixir.
Any suggestions for books and articles are much appreciated.
You should always have a strategy how to restart a process (think of if it crashes on the same node and gets restarted instantly by its parent supervisor… for example, store in the DB some key part of the state that could help to reinitialize the process from the step it was terminated, but don’t blindly try to store the whole state).
Now when we know that we can start our process, stop it and continue with the same initial arguments, there are 2 general ways to provide with zero down time deployments (at least I’m familiar with):
There are other options like swarm, Horde, syn you could take a look at. They all have their pros and cons.
But one thing is common: they are eventually consistent!
That might be fine for some use cases (when it’s ok to have a duplicate process for short period of a time when network split happen), but might not be acceptable for other scenarios. (There are companies that even have an “internal ban” on those libraries because some new devs often would think of them as an easy solution for a “single global process” and get burned by their eventually consistent nature).
There are implementations of strongly consistent consensus available for erlang and elixir (e.g. ra, and waraft) but I’m not aware of any distributed process registry or a supervisor implementation based on those.
One of the workers picks up the job and starts the long running process.
Only once the process is done - worker acks the job.
If the connection get’s closed because of the node termination - job get’s requeued and other active worker picks it up and starts the process (NOW here, you want to make sure that when process is restarted, it “continues” the work as mentioned earlier!)
Another option could be “holding” a DB transaction lock. (I think that’s what Oban uses for its queues). But that might be more limiting than RabbitMQ.
I would say, relying on 3rd party (single source of truth) queue-like mechanism is the more battle tested and trusted approach.
Agree, this is the most important part. It is very important to break down a big task into smaller tasks, that can be retried once the node has been restarted, without a big time loss in the overall execution of the task.
As for the persistence of jobs, nowadays I always reach for Oban, as not only you most probably will always have a DB laying around the server, but it solves 95% of potential problems you might encounter down the road out of the box.
You could look at Erlang resources about hot code loading but that comes with plenty of caveats and provisos that have driven most of the Elixir community away, however zero downtime deployment was one of the required features the drove early Erlang/BEAM design decisions.
Hot-code reloading was a must-have feature for their use-case, telephone switches, you don’t want to end active calls on migrations.
I think that option is an overkill for 99% of use-cases, especially since I imagine your deploy setup should be different from what we use currently, not to mention that this feature makes little sense if you are running on a single node and can afford some downtime.
I would agree that for most apps hot code upgrades is overkill, as much as I think they are cool, they are fragile and need testing in non production to really be confident
that you don’t trip yourself up.
Some considerations:
You always need to have a strategy for runtime upgrades and OS level patching and upgrades.
Solving (1) can be done with a rolling release where you add new upgraded nodes to the cluster and cycle out the old ones. Phoenix 1.7.2 even added features to drain connections on large scale deployments for this purpose.
Solving (2) means you need to keep your database in a state that supports old and new schemas. This is typically done using a number of phases where you only add new columns and tables using concurrent approaches that don’t lock the production workloads.
Another option is a very clever Postgres Reshape tool for zero downtime schema migrations that don’t stop production workloads and let you have both old and new apps running seamlessly (which may or may not be elixir apps, they could be multiple other apps using the same database schema on different release cycles) so your app code can be simplified. It’s an application agnostic schema management tool so you use it’s semantic migrations to ensure all changes never lockup the db, and believe me it’s trivial to lock a table just creating a unique index. It’s probably the most well thought out approach I am aware of but will still require plenty of care and testing.
So once you solve those problems and you still think it’s necessary to do hot code upgrades, then you can delve into diminishing returns vs rolling releases. You likely won’t have have a use case for hot code upgrades, but if you do there are some resources that explain what you you need to do:
Be aware that exrm was basically replaced by distillery, and distillery was replaced by mix releases sans, hot code reloading. I am not aware of a viable tool for doing it with elixir releases but I haven’t looked too hard.