Running into challenges with nodes discovering each other and maintaining cluster stability using Libcluster on Kubernetes

Hi everyone,

Am trying to set up node clustering for my Elixir app using Libcluster on Kubernetes (hosted on DigitalOcean). I am running into some challenges with nodes discovering each other and maintaining cluster stability, I am also using Oban for background jobs.

Specifically, would love guidance or examples on:

  • Configuring Libcluster for K8s service discovery

  • Best practices for running Elixir nodes in pods

  • Handling dynamic scaling while keeping the cluster healthy

Any tips, example configs, or experiences you can share would be really appreciated!

Thanks in advance

1 Like

I don’t know what your use cases are, but my advice would be to consider if you really need Kubernetes.

A single BEAM instance (or two, for fault tolerance) on DigitalOcean can serve a large number of concurrent users.

K8s adds a lot more concepts and moving parts to your deployment (I’m a former OpenShift dev), and while it does make sense in some cases, it is not always the best solution.

Are you using Phoenix? The official docs give you some guidance:

7 Likes

This is the right answer. And setting up K8 cluster for distributed Elixir becomes more complex without the benefits.

I would only use K8 if I had to manage a mixture of different stacks and wanted a standardize way.

4 Likes

Considered using this strategy ?!

2 Likes

Small correction. There’s no fault tolerance with one or two instances. Always go with three or larger odd number.

Context matters here. Maybe you’re thinking about distributed consensus?

Also, systemd provides must of the functionality that K8 would do without the complexity. So even if your application crash it will get restarted.

K8s create pod with dynamic hostname & IP is hard for Elixir cluster.

Our way, maps hostname to role (or Id) and store it on every nodes. Any node in cluster can search other nodes by role/id for calling rpc. In this way, we can scale out for our service (has same role for multi nodes, select one from that for doing task).
Ex: :“pod1@dns”, :“pod2@dns” have role :web and :“pod3@dns” has role :data.

I think other way is use prefix for hostname of pod and select of of that by check prefix (ex: :“web.pod1@dns”, :“web.pod2@dns”, :“data.pod@dns”,…) by convert Elixir node name to string.

I solved this issue for my MVP by write new small lib for mapping role to node name. I shared on my blog.

Hope you can get some idea for your case.

Updated: Add more details

1 Like

We are using BEAM clustering on Kubernetes both in production and also in several development and testing environments. We are using libcluster with Cluster.Strategy.Kubernetes.DNS — libcluster v3.5.0 . It works quite well.

My main issue with this setup (besides Kubernetes complexities) is the fact that all BEAM nodes will start quite a few BEAM processes that are :global registered, which will be shutdown as soon as the node joins the cluster.

We do have dynamic scaling using Karpenter, but are currently doing it by CPU load, which is not an ideal metric.

As other’s have also said, I would strongly consider if Kubernetes is needed, as it does complicate things. But I think it does work quite well together with the BEAM even though the BEAM can live without it :slight_smile: