Docker in combination with Hot-Code-Swapping?

Qqwy · April 6, 2019, 1:08pm

The current cloud infrastructure seems very much based around the concept of containerization: Running each service in its own virtual environment. The main advantage seem to be that you can normalize configuration and storage options for these services, regardless of how they have been built themselves, and that you are able to automate the spinning up of new containers when necessary to handle increased load (as well as stopping them if no longer necessary).
Containers are usually very stateless, with a database as source of truth, meaning that you clusters are usually updated by (one by one) starting the new version of the software in a new container in the backround, and then replacing the old container by this one.

Unfortunately, this is sort of orthogonal with Erlang’s own way of doing Hot-Code Upgrades (‘code replacement’), because these are all about treating your cluster as a stateful system (which is obviously necessary to keep existing sockets etc. alive). Or at least, that is what I thought at first.

(See also: this question on StackOverflow)

I think that there are two ways to upgrade a system that have the same result:

A)

Move the new release inside the current docker container running Erlang.
Do a hot-code Erlang upgrade to the new version.
Your docker container now runs the new software update, or at least until restarted.

B)

In the background (e.g. on your build machine) start a new docker container with your current software.
Do a ‘hot’ Erlang upgrade of this system.
Commit this docker container as a new version of the docker image.
You now have a new docker container that might replace the old production nodes, when they have to be restarted.

It seems to me that, if these two techniques are used together, we can have hot-code upgrades of a running Docker cluster, where if nodes might need to be added/removed at a later time (or when you later have to restart larger parts of your cluster) the cluster will still be running the latest version of your software.

The reason this is under ‘Discissions’ is because I do not have enough experience with Docker and Kuberneters (and my knowledge of Distillery is not that extensive either), so I would like to know if this idea is sound. Also, I’d like to know if there are other people that think that this is an exciting idea as well, and if there are other alternatives that might make more sense .

lpil · April 6, 2019, 1:38pm

Both seem viable though I would ask why use hot code upgrades when you’re in an environment where you can cycle nodes and perform immutable deploys?

What business problem does hot code loading solve for you that makes it worth the additional complexity?

amarraja · April 6, 2019, 2:27pm

I’ve never tried it, but gigalixir has support for hot upgrades in a container, and seems closely aligned to your first suggestion. On the topic of restarts, I guess a new container starting will take the latest image anyway so there will never be a “rolled back” state.

https://gigalixir.readthedocs.io/en/latest/main.html#life-of-a-hot-upgrade

sribe · April 6, 2019, 2:37pm

If you have anything where clients need stateful persistent connections. Now, web clients are not written that way, but not everything is an HTTP service. Of course, much other software also uses stateless connections, but…

Consider the classic use of Erlang: phone service

keathley · April 6, 2019, 2:46pm

You can have persistent connections and immutable deploys. These are not mutually exclusive. It does mean that you need to add orchestration at the systems level. But this is how every other runtime that utilizes RPC over persistent connections works.

peerreynders · April 6, 2019, 4:20pm

Currently my position is that code replacement in a containerized environment is a “square peg in a round hole” kind of situation. Code replacement has its use cases where a typical cut-over is just not feasible, be it on telephone switches before containers, upgrading a drone in mid-flight or upgrading high availability IoT devices.

In any case certain modules/processes have to be written with an awareness that code replacement is a possibility in order for the upgrade to be successful which tends to increase the complexity at the lowest level. So while the VM has the basic capability for code replacement, it isn’t a capability that can be taken advantage of “for free” on the system level - it adds yet another concern that winds through the entire codebase.

With containers you have the opportunity to address that complexity explicitly through architecture. Daniel Azuma describes one approach taken in tanx that uses Horde supervisors.

TANSTAAFL

tristan · April 6, 2019, 4:21pm

Yea, it would be neat if the new Erlang socket module was able to create a socket from a file descriptor like wat is done in other languages to use systemd’s socket activation. This would allow for full node restart without losing the listen socket.

AlchemistCamp · April 6, 2019, 6:39pm

I use docker for CI/CD on Gitlab and then deploy from that container to a (non-dockerized) DO droplet via hot upgrades.

I don’t actually need the hot upgrades, though. The driving motivation of the setup was simplicity. I don’t need to run docker (thus it’s not a problem working from my windows laptop), deployments are automatic whenever I merge to master and tests pass, and all I had to set up was Edeliver and a .gitlab-ci.yml file.

YMMV if you’re in a more enterprisey setting.

sribe · April 6, 2019, 9:35pm

Well, yeah–not trivial, but completely possible.

lpil · April 7, 2019, 12:36pm

Hot upgrades are also not trivial, especially if your application is stateful. According to the Learn You Some Erlang book at Erricson they spend as much time testing their upgrades as they do their actual application code, effectively doubling the amount of work.

garazdawi · April 7, 2019, 3:42pm

Can’t you just use the :gen_tcp.connect/listen fd option? Or am I missing something?

tristan · April 7, 2019, 4:01pm

Oh, wow, I somehow missed that option! I guess I was expecting gen_tcp:listen(Fd) and not gen_tcp:listen(Port, [{fd, Fd}]). But yes, this look like it should work. I’ll have to give it a try with systemd soon.

garazdawi · April 7, 2019, 4:02pm

Please do. Let me know if you have any issues.

brentjanderson · April 12, 2019, 3:37pm

Approach A would fit well if you use Kubernetes to manage Elixir/Erlang containers AND you absolutely need hot-code upgrades. I would suggest that for serious production cases the operator pattern could be very useful.

From the linked post:

An Operator is an application-specific controller that extends the Kubernetes API to create, configure, and manage instances of complex stateful applications on behalf of a Kubernetes user. It builds upon the basic Kubernetes resource and controller concepts but includes domain or application-specific knowledge to automate common tasks.

So, an Erlang/Elixir operator would:

Accept API calls for rolling forward and backwards between versions
Translate those calls into commands directly into your running Erlang containers for:
- Fetching the next Erlang code release from some artifact storage (Could still be using Docker, or you could fetch it from S3)
- Performing the hot-code upgrade to the new version
- Reporting on the progress of the upgrade
- Handling an automatic or manually triggered rollback if the upgrade is failing for some reason

This would give you k8s-level APIs for managing the process without having to drop into containers by hand. I suspect this approach would be best-fit for applications running in k8s, but other orchestration systems could benefit from the general idea as well.