Distributed features and dynamic cluster membership

stefanchrobot · January 6, 2020, 8:28pm

I’m going through “Elixir in Action” for the second time while I’m working on building my next project. Throughout the chapters, @sasajuric is building a distributed web application which has a process per business entity (to-do list) that synchronizes the access to the resource. The application uses a global process registry to keep track of where each process is. From what I understand, the application is built for availability rather than consistency.

I’m wondering if the same technique could be reliably used for an application built for consistency. In my use case I’m working with a daily timetable and multiple users scheduling events. I have a business requirement that only one event can take place at a time. The requirement is a bit more involved so I’d rather not solve this with a RDBMS constraint especially that I expect the requirements to change in this area. It’s very tempting to solve this with a functional model wrapped with a process as a means for synchronization and consistency.

The more I read about the challenges of distributed solutions the more paranoid I get. When I realized that in some cases it can take over a minute for a node to acknowledge that some other node went down I begin to question if things like :global or Swarm are even usable in my scenario. I have a feeling that I’m getting something like “it’s going to work in 9X% cases”, but at a certain scale those few missing percentages are going to be a pain. And then there’s the issue of deployments: how does :global work with rolling deployments (how does it handle the case of node being stopped but another node forwarding requests to a process inside of it)? Are hot code upgrades the only reasonable solution for distributed systems?

On top of that, I’m going to run my application on Gagalixir, where nodes come and go and there’s a period of time during deployment where there’s always one extra node running which makes it a bit difficult to decide what exactly the quorum is if I were to use one.

With all that being said, it seems that as a general advice it’s best to keep away from clustering nodes and to rely on external synchronization mechanism (RDBMS constraints, explicit RDBMS locks, Redis) if consistency is important. Am I getting this right?

I think that adopting a usual request-response/sync-via-DB approach might work in my case (I can impose a granularity of 15 minutes for event scheduling and use a unique constraint in RDBMS), but I hoped I could leverage some of the Erlang’s features here. If I were to sacrifice some of the consistency for availability, are the any good recipes for this? Should I run some kind of background auditing job that would catch all the inconsistencies?

Are there any resources (books maybe?) about building such systems and how to tackle those challenges?

PS. thanks if you made it this far.

keathley · January 7, 2020, 1:16am

I’d say that this is correct when it comes to storing important information. If you need consistency don’t try to write your own database.

But not all data in your system is crucial and in those scenarios clustering can be a useful tool. We use distributed erlang for caches and short lived user interactions, among other things. Both techniques are useful when used appropriately.

I wrote about some of these issues here and have given talks on it in the past.

stefanchrobot · January 8, 2020, 11:22am

Wow, thanks! Your blog post and the talk touch on exactly the points that I was confused about. Great stuff.