Update distributed stateful elixir apps in kubernetes

genserver
deployment
kubernetes

#1

I’ve read different threads in the forum about elixir and docker, like Elixir applications in Docker containers?, and I opened this thread to open a discussion about a specific case which is still a bit unsolved to me: updating a (distributed?) stateful elixir app running on kubernetes. At least I didn’t see a well known good practice to handle this.

I think that the issue of connecting elixir nodes, running in different containers, in a Kubernetes cluster is pretty much solved. The @bitwalker 's libcluster library, with the Cluster.Strategy.Kubernetes.DNS strategy, makes it really easy to cluster different elixir containers.

But what about preserving the state processes’ state? So…we can’t do hot code swap with containers/kubernetes… then we need to do a sort of blue/green deployments killing the old containers (which hold the state) and spawning new containers with the new image.

@dazuma in his recent talk Docker and OTP Friends or Foes suggests to use hordehttps://github.com/derekkraan/horde) and CRDTs to push the state to another live container, during old containers termination. I’ve tried this approach and to me seems a bit too fragile, but maybe I’m doing something wrong. The graceful termination time in kubernetes is fixed, and I don’t have any guarantees that the state is replicated correctly over another healthy elixir node/container. @dazuma, do you have something public I can see to easy replicate what you did in the video?

@dazuma mentions also another way to tackle this doing a hot code swapping within the container, without updating the container image itself. Does gigalixir really do this? This way goes against the containers best-practice… BUT honestly… it could be much less complicated (and maybe solid) than state replication during termination etc. Still, when we need to upgrade the containers to new images (like new elixir version)…we have to kill the containers, loosing the state.

Any other pattern we could use? What about stashing the state in something like redis and making the new containers to recover it?


#2

First thing for anyone thinking about this is, of course, always, do you really need/want this? Just because it is easy to cluster nodes doesn’t change that there are many issues with distributed Erlang, particularly in a cloud environment and particularly when dealing with state as opposed to simply control messages.

Assuming this has all been considered and state is required then I’d suggest simply using a database or a k8s stateful set (https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/).

Erleans (https://github.com/erleans/erleans) which takes from Orleans (https://dotnet.github.io/orleans/) uses a database to keep a grain’s state – only Postgres in the case of Erleans right now.

Both the statefulset and shared database method allow for rolling deploys.


#3

Hi @tristan, thanks for your answer. sts alone is not enough, since when k8s does a rolling update it (gradually) kills the pods to spawn new one. These pods have to store the state somewhere like a volume or database as you said.

Just a simple example: an sts of two elixir pods (counter-0 and counter-1) where we have a distributed counter app, where each counter process has its own state starting from 0, which increments for each :incr message. Now, when we want to update the pods, kubernetes terminates the pods starting from counter-1. All the counter processes, and their states, in counter-1 will be lost. Sure, we could use a database, or for this example redis is be more than enough, but this is not my point. I’d like to see if there are feasible options to do a blue/green deploy preserving the state of the processes, without the need of a db or an external component.


#4

The stateful set lets you store the state to disk and know that when counter-1 comes back up it gets the same state that it last wrote.


#5

so, you suggest to let counter-1 write a snapshot of the state in the volume (serialising it with something like :erlang.term_to_binary ?), and then the new counter-1 loads the state right? What if you need to scale down from two containers to just one container, container-0? What do you think about temporarily pushing the state to a db or redis, and then let a distributed supervisor (like horde) to spawn the counter process in container-0 ?


#6

“suggest” :). Just throwing out the possibilities with k8s. Any solution is going to be more specific to the actual application’s functionality and restraints. But I do think pushing state to a db is fine, there are a number of considerations to take into account when doing so and Orleans covers many in their paper.


#7

Hi! Just a few thoughts on this.

Re: seems a bit too fragile
Process state is fragile, period, even without the distribution. It’s just a property of OTP that processes (along with their state) may go away because we “let it crash”. Therefore, it’s important to design our applications to be resilient to that. So no, there’s no “guarantee” that the technique I described can preserve state, and I don’t know that there is (or even should be) a way around it. If you have critical state and need guarantees, use a database.

Re: other patterns
The CRDT method of preserving state that I described in the talk, is just one option. It works well for simple cases and is wicked-fast (and is easy to deploy, and makes a great demo…,) but it may have trouble scaling to large clusters. As you mentioned, you could also stash it in an external key-value store like Redis, or maybe even a queueing or pubsub system. (I’m actually currently investigating the latter myself.)
Overall, we are still pretty early in the development of best practices around this problem. The point I was trying to make in my talk was that it’s a hard problem, but there are solutions and ideas out there to explore, and it’s important for us in the community to be trying things out and sharing our findings.

Re: public code
The code I used for the demo in my talk is at https://github.com/elixirseattle/tanx but I haven’t updated it since ElixirConf and it’s a bit out of date. (Horde in particular has evolved since then.) It’s also not really well documented, sorry. However, Chirag Singh Toor did a detailed writeup of how he produced a similar setup. You might try looking through that.

Re: Gigalixir and containers best-practice
My understanding is that Gigalixir effectively mounts your app (release) as a volume in a container. So the container is used more like a VM, for the OS image and process isolation but not for the application image. So yes you can do hot code swapping there. Additionally, elixir releases can include ERTS, OTP, and the elixir runtime in the release itself. So you shouldn’t have to kill the container for a new elixir version. Only an OS update should require new containers. (@jesse may have more to say, or may need to correct some of my understanding here.)

That said, yes, techniques like this kinda-sorta go against current “container best practices”. However, I’d suggest not taking that too rigidly. We are, again, still very early in the development of the container ecosystem, and “best practices” are very much in flux. For example, the whole issue we’re dealing with here is that current container “practices” are tied to a “traditional” stateless web application model, whereas OTP explicitly breaks out of that mold. That’s our opportunity as Elixir and OTP developers: we have a disruptive technology that enables us to do new things, and we have the opportunity to develop new patterns to make containers work for us, even if that ends up looking different from what you see with traditional web apps.


#8

Thanks a lot for your super detailed answer, really helpful! :smiley: I’ll do some experiments, I keep you updated if you want.

My gut feeling is that hot code swap within containers at the end would be much simpler… it would be really interesting to know from @jesse how gigalixir does it. Does it have a second container in the pod that runs the release update?


#9

@alvises, @dazuma pretty much described it perfectly. No second container in the pod.


#10

I always feel the need to push back on this :). First, i wouldn’t say the container practices are tied to traditional web application model. Containers (in an early form) started the move to that model.

I don’t see Elixir/Erlang/OTP as a disruptive technology, particularly in this area. Erlang distribution is instead pushing old ways against “better” (obviously depends on your environment, but certainly better in the case of running in the cloud) models.

I cringe when Erlang and mnesia are hyped this way. It leads to people being disappointed when they discover this stuff was cutting edge distribution in the 90s under a very different type of setup.

Obviously I think there are advantages in specific cases to stateful applications and that Erlang’s concurrency, fault tolerant design and easy control plain distribution is beneficial to building such solutions – otherwise I wouldn’t work on Erleans – but I’ve also seen the hype and eventual crash before.


#11

I agree. Outside of use cases such as cacheing or presence, where by definition the state is ephemeral and will automatically reconstitute itself, I think its better to use a database server or message queue to manage state, and keep your application stateless. Of course if you are writing a database server or message queue the answer is completely different, but most people aren’t doing that and shouldn’t. It is not only easier for deploying in Kubernetes, its actually easier everywhere. Designing your servers for hot updating can get pretty tricky. I’m skeptical its often worth it.


#12

I’d also strongly recommend against kubernetes if performance matters, even if it does make networking easier. Containerizing BEAM results in a non-trivial performance hit.


#13

Can you point me to some analysis on the performance issues of containerizing BEAM? (Benchmarks or reasons for performance reduction?) I’ve occasionally heard this claim and I’d like to understand where it’s coming from. Thanks!


#14

Was about to ask the same… I don’t know how it could have any BEAM specific performance hit, assuming properly configured.

If not properly configured and you end up with N active schedulers but limited cpu slices you are going to take a BEAM specific hit.


#15

I would suggest making your application resilient against node downtimes.

Have you thought about integrating protocols that implement consensus to hold state in a distributed environment.

Then upgrading the app would just move the “leading” node once it is shut down.
Writes and consistent reads might be slightly degraded during rollout, but would eventually recover if implemented properly.

I would advise against container best practices. That said, why must it be containers in the first place?

After all, putting state to some place where it can be centrally managed is most probably the best option. Pick a database and roll with it, save yourself the headache.

In memory state is volatile, even on Erlang. Less so when taken care of, more so if run in dynamic hosting environments.

Just my 2c


#16

because containers are great for isolation and environment consistency. In my not so distant past I was used to deploy in AWS using OpsWorks and Chef scripts… a nightmare compared to Kubernetes… But obviously I’m not saying “always containers”. My team and I, for example, tend to keep databases outside kubernetes. We had big issues with MongoDB replication inside k8s.


#17

I understand. Chef and in particular AWS OpsWorks is really bad. I have been there too.

So I am also running some Elixir applications on Kubernetes, but I found that Kubernetes is practically replicating much of what Supervisors have to offer.

On the other hand, for any other language, service it is pretty darn powerful, and a bliss for maintenance and operations.


#18

I’ve never understood this comparison.

If this were the case then I don’t know why Erlang would come with heart when supervisors are already there.