We currently have our elixir apps deployed with kubernetes using libcluster.
I’ve identified some funny behaviours that I can confirm is related to nodes connecting to the cluster and other issues related to how we’ve setup our clusters and process groups.
I am confident it’s from how we’ve setup and configured our deploys, but I’ve been unable to find the correct tools that will make identifying the anomalies easy and investigating where we have misconfigured something.
Are there any tools that are available that I’ve missed that focus on solving these problems? Phoenix live dashboard and observer are great for this for node specific stuff, but it won’t help with checking how we’ve registered process groups or find orphaned nodes.
If there aren’t any tools available, I will try and see if I can build something useful. If there’s any reading or erlang/elixir docs that you think will be useful, please let me know.
Some examples of behaviours that I’ve seen.
When nodes aren’t successfully connected to the cluster, the application still starts all the processes and tools.
So for queue based things like Sqs or Oban, the disconnected node will still pull the job or handle the message, but if it uses Phoenix.PubSub or sends a pid message, it can go into the ether.
Another scenario we’ve had, I think we have not made good use of process groups or global namespaces, so with a tool like Quantum, even with the correct clustering strategy set, we get duplicate behaviours.
The tooling I’m looking for would be focused on providing visibility on nodes in and out of the cluster, and additionally how genservers and the like are registered at a cluster level.
Any information you can add would be useful