Are there any existing tools that monitor and visualise clusters?

herman · December 6, 2024, 8:31am

We currently have our elixir apps deployed with kubernetes using libcluster.
I’ve identified some funny behaviours that I can confirm is related to nodes connecting to the cluster and other issues related to how we’ve setup our clusters and process groups.

I am confident it’s from how we’ve setup and configured our deploys, but I’ve been unable to find the correct tools that will make identifying the anomalies easy and investigating where we have misconfigured something.

Are there any tools that are available that I’ve missed that focus on solving these problems? Phoenix live dashboard and observer are great for this for node specific stuff, but it won’t help with checking how we’ve registered process groups or find orphaned nodes.

If there aren’t any tools available, I will try and see if I can build something useful. If there’s any reading or erlang/elixir docs that you think will be useful, please let me know.

Some examples of behaviours that I’ve seen.
When nodes aren’t successfully connected to the cluster, the application still starts all the processes and tools.
So for queue based things like Sqs or Oban, the disconnected node will still pull the job or handle the message, but if it uses Phoenix.PubSub or sends a pid message, it can go into the ether.

Another scenario we’ve had, I think we have not made good use of process groups or global namespaces, so with a tool like Quantum, even with the correct clustering strategy set, we get duplicate behaviours.

The tooling I’m looking for would be focused on providing visibility on nodes in and out of the cluster, and additionally how genservers and the like are registered at a cluster level.

Any information you can add would be useful

LostKobrakai · December 6, 2024, 9:01am

How would you show “out of cluster nodes”? You’re by definition not connected to them, hence have no knowledge of them. You cannot infer anything about disconnected nodes status. They could be up, down or anywhere in between and that’s only if you know about them existing in the first place. Things get a bit better if you can (somewhat) statically define how the cluster is meant to look like and figure out which subset of the cluster has a quorum and stop doing work if no quorum can be reached.

These are quite classical distributed computing complexities. If those are new to you there’s the two generals’ problem thought experiment commonly pulled out.

hubertlepicki · December 6, 2024, 9:14am

I largely have similar experience with Kubernetes in particular to you in Elixir. It gets even more tricky if you have autoscaling and the pods get resized horizontally and vertically.

I tend to report on an ongoing basis, from each node, the amount of nodes you see in the cluster. We report that using a custom Telemetry metric that gets sent to DataDog. If different pods keep reporting different number, it means something is wrong and we raise an alert. I think it can be further automated to mark the pod as broken using liveness probe and get it to shut down/restart in such case.

I don’t think the PubSub should be used to send any mission-critical information. We pretty much use it nowadays to send the message the “something happened, the UI must be updated”. We don’t send important payloads over the wire, rather than we send just a notice. And on top of that we have polling, so that if a message is in fact missed, a scheduled refresh will happen in worst case scenario in 30s or so and the UI will update anyway to the user, just with a slight delay.

We use Oban more and more. Oba Pro can replace Quantum, and we have not been seeing the same problems we had where duplicate scheduled jobs are executed. Oban does not rely on cluster being established, instead it synchronizes everything through PostgreSQL, to ensure that the job executes on only one of the nodes at a time.

We have observed quite a big delay between when the pod starts, and when it connects to the cluster. We don’t want Oban jobs to be executed in that time, and we also don’t want some other things to happen. Instead of starting Oban in the application.ex, we start a custom supervisor in application callback module, and a watchdog GenServer process. The Watchdog process checks if we see successfull connection to database, and if we managed to connect to the cluster, by checking if we see other nodes. Only then we start Oban and some other processes to increase the likelihood that the messages sent out from background jobs do not end up in “ether” as you said.

We got rid of all the cluster-unique GenServer processes, be it :global , or through Swarm or Horde, this has proven to be too much of a hassle and too unstable in a multi-node, quite rapidly changing number of pods connected over cloud connection. Stuff like passing over data was unreliable and we never got it 100% right, instead again we rely on Oban unique jobs for the most part and centralised communication via Postgres.

This is probably not what you were looking for in terms of answer but that’s what I figured out works in an enviroment like Kubernetes with autoscaler, which is quite unpredictable. The Erlang clustering implementation is really the best suited to several servers connected through a cable directly. I believe folks who are serious about having multi-dozen-nodes of Erlang clustered doing mission-critical stuff have their own clustering layers, that’s what WhatsApp does and others I believe.

herman · December 6, 2024, 9:14am

How would you show “out of cluster nodes”?

I have not quite figured out exactly what the implementation would look like, but my current thinking is that the reason why this is an issue for us is because there’s nothing wrong with connectivity to external sources. I can keep a simple node connection/cluster status registry in something like redis, that I’d be able to interogate.
That with logs will help me find where it’s failing, or do something to self-heal.

If those are new to you there’s the two generals’ problem.

I don’t think we are quite there yet. I know for sure that there’s tracking and confirmations we have not implemented yet. That would be the next step.

For the process groups, I think it’s possible to confirm for instance with Quantum (Picking on it only because it’s in my example), whether each node is part of a group or independently created as a process.
I’ve only done the reading, but it looks like I could use some built erlang functions to raise this. Though I’d need to either simulate a cluster locally or have a way to run the interogation in the environment. It could possibly lean into the registry component I mentioned above

LostKobrakai · December 6, 2024, 9:21am

Why would you think your nodes would be able to be perfectly connected to that redis node, but wouldn’t be able to do the same for the erlang cluster? An approach like this can help, but is not a solution. It has the same problems you have right now.

One of the more recent OTP versions did indeed implement a feature, which does prevent cluster partitions. That’s something else though. It kicks nodes out of the cluster, which are not able to establish a connection to all other nodes on the cluster, but only to some.

This does not prevent true split brain situations where you have a node or even a subset of the cluster completely separate from other nodes/subsets.

herman · December 6, 2024, 10:23am

Why would you think your nodes would be able to be perfectly connected to that redis node

A big part of the reason why I would try to build something is that this is exactly what we are spotting. The application starts perfectly, pulls from the db/sqs perfectly, but then if we do anything that requires passing a message to another node it disappears because it’s not in the cluster.
The node would also not start at all if it can’t connect to the services, I am actually not sure why it starts when it can’t be part of the cluster.
We currently get around it by not depending on node to node communication, but at the same time it feels wrong to not use core capabilities of the VM/language.

I fully get that the idea could lead to a complete dead end, but I also don’t want to do nothing or avoid the problem.

This does not prevent true split brain situations where you have a node or even a subset of the cluster completely separate from other nodes/subsets

This touches on a part of my question, what do people currently do to detect a split brain/partition? Surely the approach is not to wait for fires and sadness.

In hindsight it might have been better to split as two topics because I’m okay if the solutions are separate.

The first is about disconnecting nodes, and there correctly might not be a nice answer.

The second is about monitoring clusters, and having visibility on behaviours and health.

hubertlepicki · December 6, 2024, 10:28am

Delay whatever starts processing these jobs until you see other nodes in the cluster. This will be the lowest hanging fruit and may not solve all your problems but like 90% of them.

herman · December 6, 2024, 10:34am

I tend to report on an ongoing basis, from each node, the amount of nodes you see in the cluster. We report that using a custom Telemetry metric that gets sent to DataDog. If different pods keep reporting different number, it means something is wrong and we raise an alert. I think it can be further automated to mark the pod as broken using liveness probe and get it to shut down/restart in such case.

This is a great idea/approach.

I don’t think the PubSub should be used to send any mission-critical information.

Agree, and nothing around “liveness” is mission critical. I also have polling fallbacks, We do large volumes of processing for a “root” that reports the status. So if you have 2000 Oban jobs, and 10 workers, with 1 disconnected, you would ~have 200 events that don’t reach you by default. I work around some of that by reporting on aggregate when the even comes in, but if it’s at the tail end of events. You notice a consistent need to fallback to polling.

Oba Pro can replace Quantum

It’s something we are talking about doing, I’m keen to do it regardless but I don’t want to move on until I’m certain why it’s not behaving and opt whether to deal with it or not.

We have observed quite a big delay between when the pod starts, and when it connects to the cluster. We don’t want Oban jobs to be executed in that time, and we also don’t want some other things to happen. Instead of starting Oban in the application.ex, we start a custom supervisor in application callback module, and a watchdog GenServer process. The Watchdog process checks if we see successfull connection to database, and if we managed to connect to the cluster, by checking if we see other nodes. Only then we start Oban and some other processes to increase the likelihood that the messages sent out from background jobs do not end up in “ether” as you said.

I think there’s a strong chance that our problems are related to this thanks!

We got rid of all the cluster-unique GenServer processes, be it :global , or through Swarm or Horde, this has proven to be too much of a hassle and too unstable in a multi-node, quite rapidly changing number of pods connected over cloud connection

Very interesting, what did you use to check the stability? More metrics to datadog? I suspect we’ll lean this way as well.

This is probably not what you were looking for in terms of answer but that’s what I figured out works in an enviroment like Kubernetes with autoscaler, which is quite unpredictable.

This has been a fantastic and useful response! It’s very much along the lines of what I was looking for.

herman · December 6, 2024, 10:35am

I’m definitely going to give this a try