Libcluster intermittent disconnects - what can I configure to create a more reliable cluster?

aramallo · July 4, 2023, 6:37am

Ah nice I didn’t know that!

aramallo · July 4, 2023, 7:21am

BTW, Partisan doesn’t have support for global yet nor other capabilities like pg but I plan on adding those too. As always the challenge is implementing it correctly and efficiently when using HyParView or any user-defined membership strategy using partial-views/partial-mesh

chasers · July 14, 2023, 8:16pm

Was just reminded of this QUIC distribution protocol lib.

chasers · July 14, 2023, 8:17pm

Also fwiw, in the regions we’re in the Fly network has been a lot better in the last few months.

hubertlepicki · July 17, 2023, 7:05am

I’m interested in that too. I’ve seen similar issues with nodes disconnecting or traffic between them “hanging” from time to time and it’s been quite frustrating. I understand that Erlang distribution wasn’t built for this sort of deployment in mind, and yet people do run huge clusters in the cloud (see WhatsApp) and it somehow gets orchestrated more reliably than what fly.io and (Gigalixir)[Clustering | Gigalixir] docs suggest.

D4no0 · July 17, 2023, 7:24am

With the amount of money they have, they might have some specific in-house built tools to deal with these problems.

hubertlepicki · July 17, 2023, 7:36am

AFAIK they used to run a custom Erlang distribution mechanism called “WANDIST” but they gave up on the idea and the cluster as they currently have is a mostly standard Erlang distribution:

Maybe we can summon @max-au here to share some tips on things like tweaking the defaults and general approach to clustering over semi-reliable infrastructure that happens to be the cloud.

But overall, this should be a “solved problem” and not happening on public clouds, especially those that target Elixir/Erlang deployments like Fly.io or Gigalixir and it would be within their best interest to develop and share techniques that we can incorporate to avoid random clustering issues from time to time.

hubertlepicki · July 17, 2023, 7:56am

Wow, this is very interesting. Did you ever manage to test it on fly.io or not yet?

hubertlepicki · July 17, 2023, 7:59am

Generally this wouldn’t be a big issue but most folks here probably would like to use Phoenix PubSub, and it relies either on Redis or pg backends. Do you happen to know a Partisan PubSub backend?

aramallo · July 17, 2023, 2:17pm

Hi @hubertlepicki , not yet. Trying to find the time in the coming months.

aramallo · July 17, 2023, 2:19pm

Partisan doesn’t offer PubSub. I’ve developed a distributed PubSub implementation in Bondy which itself uses Partisan

aramallo · July 17, 2023, 2:22pm

What I am definitively planning on doing is adding QUIC support for Partisan (echoing @chasers mention of it)

max-au · July 18, 2023, 3:12am

I would not call it “gave up”. In fact it was a long process of explaining why maintaining a fork of Erlang distribution (and an OTP fork too) isn’t the best approach. On one side, it allows for quick hacks here and there - when we needed to change some distribution behaviour, we could easily do that.
On the other side, it allowed for quick, dirty and ugly hacks that were only good for short-term PSC - “performance review” - goals. For real - "I drilled that many holes and short-circuited some things, breaking signal ordering guarantees, but I saved 3% CPU globally.
Those unfortunate enough who wanted to work together with community, bring new OTP versions (not that easy with forks of everything!), fix bugs upstream instead of making local patches, were suffering the most. Often working nights when dirty hacks ruined servers.
Thankfully, a strong engineering lead succeeded in pushing that project through.

Now, back to original question, Erlang distribution works best with reliable networks, with predictable latency and throughput. There are, indeed, a large number of options one could leverage to make it work better across the pond. Overseas, I’d say
But the first thing to do would be to understand what issues are actually there, and whether something could be done even outside of the BEAM. To begin with, I’d recommend collecting network statistics (which we meticulously did!). That includes, but not limited to, TCP retransmit counters, send and receive buffer sizes (and utilisation), disconnection reasons. Generally make the underlying network more reliable. This is relatively easy within one datacenter, but could be challenging otherwise.

When there is some statistics, and some understanding of what are the problems, and what are the causes, I could probably help with figuring our some configuration changes. One was already mentioned, net_kernel tick timer, but you may want to go further and have a gen_server that would periodically “ping” other nodes and collect latency statistics - and even disconnect_node/connect_node as a mitigation.

hubertlepicki · July 18, 2023, 7:40am

Thank you for your time, it’s very helpful.

This is precisely why I mentioned that it’s in the hands of public cloud maintainers or even more so in the hands of PAAS platforms like Gigalixir or Fly.io to tweak their network infrastructure, which I hope they do. But things like Fly.io from the very start push towards a globally distributed network of nodes, and I strongly suspect ensuring network reliability in this context is a task that will only become possible once we have some sort of quantum entanglement near-instantaneous communication between data centers - and not underwater cables connecting those nodes.

I am yet to try Parisan @aramallo and it looks like it is now actively developed and maintained, which is great. I had a look at it before but it didn’t seem to be the case. One thing I mentioned above would be great to have, however, is Partisan-based Phoenix.PubSub backend (Phoenix.PubSub — Phoenix.PubSub v2.1.3). This is often heavily used by Phoenix apps, both having LiveView UIs but also doing any sort of GraphQL/Subscription or other tasks, as a notification layer and currently there’s pg and Redis backend, and I think pg doesn’t work with Partisan (? Partisan - high-performance, high-scalability distributed computing with Erlang and Elixir - #8 by aramallo - Libraries - Erlang Programming Language Forum - Erlang Forums). So obviously making pg work would make Phoenix.PubSub work, which then means that probably 90% of Phoenix apps can be switched over from either standard Erlang distribution clustering or Redis backend to Partisan, which would be very exciting, as I understand it’s designed to handle these sort of cloud networking infrastructure problems better than standard Erlang.

benwilson512 · July 18, 2023, 10:35am

Yes and no. Partisan does not make the network more reliable, it just handles a less reliable network with different trade offs. If your nodes are in fact not connected to one another, the Phoenix.PubSub paradigm flat won’t work, Partisan or not.

Indeed, but there’s a big difference between having a globally distributed set of stateless nodes and a globally distributed set of stateful nodes. The latter is an incredibly hard problem. If you take an Erlang / Elixir process architecture designed for a local reliable network and you scale it out globally you’re going to have a bad time. You’re looking at really cutting edge stuff in any language.

max-au · July 20, 2023, 5:56am

I’m afraid this applies to all programming languages and solutions. Agree with @benwilson512 that there could be an implementation with different tradeoffs. For example, forfeiting signal ordering guarantee with using non-synchronised UDP datagrams, which will alleviate packet loss. But complexity has to live somewhere - it is much harder to write software that does not rely on signal ordering.

Hence I’d think of defining a problem first. In most cases, disconnecting a node isn’t a problem per se, while lost redundancy (or consistency) caused by lost connectivity might be. Defining high-level metrics that need to stay below (or above) a certain treshold (e.g. “error ratio for read request”) could be the first step. It’s important to know what matters (e.g. user perceived latency) versus what is easy to observe (stream of log messages about disconnected nodes).
In the absence of such metrics it’d be hard to know whether changing distribution carrier moved the needle in the right direction.

low.sock · July 22, 2023, 12:26am

Thanks everyone for your responses. Other things have taken my focus for a while but I’m focusing on getting as much as I can measurement wise. That way I can more confidently understand the configuration changes that I’m making and the (potential) impact on users.

chasers · August 29, 2023, 2:33am

Relevant:

:erlang.system_monitor(self(), [:busy_dist_port])