Am trying to set up node clustering for my Elixir app using Libcluster on Kubernetes (hosted on DigitalOcean). I am running into some challenges with nodes discovering each other and maintaining cluster stability, I am also using Oban for background jobs.
Specifically, would love guidance or examples on:
Configuring Libcluster for K8s service discovery
Best practices for running Elixir nodes in pods
Handling dynamic scaling while keeping the cluster healthy
Any tips, example configs, or experiences you can share would be really appreciated!
I don’t know what your use cases are, but my advice would be to consider if you really need Kubernetes.
A single BEAM instance (or two, for fault tolerance) on DigitalOcean can serve a large number of concurrent users.
K8s adds a lot more concepts and moving parts to your deployment (I’m a former OpenShift dev), and while it does make sense in some cases, it is not always the best solution.
Are you using Phoenix? The official docs give you some guidance:
K8s create pod with dynamic hostname & IP is hard for Elixir cluster.
Our way, maps hostname to role (or Id) and store it on every nodes. Any node in cluster can search other nodes by role/id for calling rpc. In this way, we can scale out for our service (has same role for multi nodes, select one from that for doing task).
Ex: :“pod1@dns”, :“pod2@dns” have role :web and :“pod3@dns” has role :data.
I think other way is use prefix for hostname of pod and select of of that by check prefix (ex: :“web.pod1@dns”, :“web.pod2@dns”, :“data.pod@dns”,…) by convert Elixir node name to string.
I solved this issue for my MVP by write new small lib for mapping role to node name. I shared on my blog.
We are using BEAM clustering on Kubernetes both in production and also in several development and testing environments. We are using libcluster with Cluster.Strategy.Kubernetes.DNS — libcluster v3.5.0 . It works quite well.
My main issue with this setup (besides Kubernetes complexities) is the fact that all BEAM nodes will start quite a few BEAM processes that are :global registered, which will be shutdown as soon as the node joins the cluster.
We do have dynamic scaling using Karpenter, but are currently doing it by CPU load, which is not an ideal metric.
As other’s have also said, I would strongly consider if Kubernetes is needed, as it does complicate things. But I think it does work quite well together with the BEAM even though the BEAM can live without it