How to implement this distributed feature?

stefanchrobot · November 16, 2019, 5:10pm

Hi, for the fun and learning of it I’d like to implement the following feature in a distributed setup: an in-memory feature flags that can survive rolling deployments.

I have two instances with a load balancer in front. I want to keep a set of feature flags that can be turned on and off by API calls. Since the API call will arrive at only one service, the update has to be somehow seen by the other instance. I want to keep those feature flags only in memory (not persisted to disk). I’m doing rolling deploys: the nodes are restarted one by one in a way that at least one node is operating. I don’t really care about the case when both nodes are down.

I’d like to tackle this with distributed Erlang. I’d like to try two approaches:

Have only one process to keep the feature flags: whenever a node needs to check the status of a feature flag it has to find where the process is and talk to it. What would be the best tool for the job? Is swarm an overkill?
Have one process per node: upon joining the cluster, the process should sync the current state from the other processes and then be able to broadcast any updates. Is this even a good idea? What would be the right tool for broadcasting the updates? Should I look into :rpc?

keathley · November 16, 2019, 5:22pm

I built a version of this for an application at work: https://github.com/keathley/rollout. It solves the problem in a very specific way that worked for my use case. But it might be useful inspiration.

chasers · November 17, 2019, 4:33pm

I’d probably keep the flags on both boxes.

GenServer.multi_call the older one on init to sync state. multi_call both when you update the flags.

sb8244 · November 17, 2019, 7:25pm

Chris’s library has a caveat that I think is really important to note:

Flags are not maintained in between node restarts. New nodes added to your cluster will be caught up on the current state. But if you bring up an entirely new cluster your flags will revert to their default states. You can mitigate this problem by setting defaults for each flag in your Application.start/2 callback.

This problem is not specific to the rollout library, you’ll encounter it in any application that has in-memory persistence only. Be really sure you’re okay with potentially losing data. This is likely to happen at inconvenient times (like after some type of application outage), which can then be frustrating to fix.

Regarding sending messages across nodes, I’ve found that a simple send works really well. This requires knowing the remote process pid, which I use pg2 for. Something like this gist: Local/Distributed Caching · GitHub

I’ve used this technique in many projects now and it has worked really well in practice.

keathley · November 18, 2019, 3:24pm

I can provide some context for the decisions I made with Rollout, and that might help answer @stefanchrobot’s original questions.

I built Rollout for a clustered application that runs on a minimal number of nodes (< 10). We use these feature flags in the critical path of our requests, so checking a feature flag needed to be as fast as possible. The amount of data in these feature flags is small, so that meant that syncing the full state and putting a materialized view into ETS made the most sense here. There’s not that much data that needs to be sent to each node, so I opted to use consistent naming for each process on each node. This decision simplifies how we send messages to each node, and meant that we didn’t need to use a process registry.

We wanted these nodes to be available even if they couldn’t talk to other nodes. But we didn’t want to deal with race conditions from multiple commands being issued to different nodes or from stale data. So I opted to use last write win register CRDTS. We use a logical timestamp called a hybrid logical clock (HLC). These timestamps give us one-way causality tracking. When an update to a flag occurs, we broadcast the new state to the other processes with the same name. If another node has the same register, they’ll keep the register with the highest HLC. A similar version of this process occurs when a new node joins the cluster.

This design removes a lot of complexity around syncing data and ensuring that a node has the most current state. One of the main benefits is that it allows us to use casts in all of our communication since we don’t care if nodes miss specific messages. Each node will do a full sync of its flags to the other nodes in the cluster. Doing a full sync like this is a pretty naive way to try to have the nodes converge. This solution wouldn’t work for larger node sizes or if you had lots of data to send. But for my use case, it’s acceptable.

Hopefully, that helps provide a little more insight into my thought process and how I approach these kinds of problems.

Here’s the HLC library that we’re using in case you wanna check it out: https://github.com/toniqsystems/hlclock.