Erlang/Elixir native Etcd, Zookeeper alternative

andre1sk · January 28, 2018, 9:19pm

? Given the number of network roundtrips needed for all the consensus protocols I don’t think disk IOPS play much role there

liveforeverx · January 28, 2018, 10:07pm

It seems for me too, there is no good building block to create distributed CP system with Erlang/Elixir, like for example underling Raft-implementation for etcd allows it. Every time I was involved in building any distributed system, it had self-written implementation of Paxos/Raft or some simplified algorithm for own needs. (riak_ensemble is no except, because it is used only in riak as I know).

Just to note, for example etcd/raft is an building block for such applications ( for example raft of etcd is used by many more applications: https://github.com/coreos/etcd/tree/master/raft#notable-users, because of a good architecture of a library, which really allows easily embed and customise to own needs). There doesn’t exists such thing in Erlang/Elixir ecosystem at the moment, which can allow easily build systems with CP properties.

Probably rabbitmq/ra could become something like etcd/raft in Go for Erlang/Elixir. I really hope, that may be something like rabbitmq/ra could become defacto standard raft implementation for Erlang/Elixir ecosystem, which is battle-tested and usable to build software on top of it.

Than, building something like etcd to use directly inside of VM will become very easy task.

CptnKirk · January 29, 2018, 10:54pm

https://community.hortonworks.com/articles/62667/zookeeper-sizing-and-placement-draft.html

I’m not a Zookeeper expert, however, there are a few “Productionalizing Kafka” talks that caution against slow Zookeepers. The link above also advises that Zookeeper be given its own disk to avoid contention and multi-second delays.

Apparently, every piece of data that ZK acts upon is first written synchronously to a transaction log. Transaction log writes then need to be as fast as possible or everything else slows down.

andre1sk · January 29, 2018, 11:13pm

There is no need for it to be fast it is generally used to store config. data, cluster membership and to conduct leader elections it’s not used to write the actual data

CptnKirk · January 29, 2018, 11:42pm

Kafka doesn’t use ZK to store the data either. It “only” uses it for:

List of brokers
List of topics
List of replicas of a topic
Consumer offset

However, that’s potentially some rapidly changing data. Especially the consumer offset.

Now imagine you have a library that is leveraging Zookeeper to track your Distributed Erlang cluster. It might be tracking:

List of nodes
List of processes on a node
State offset of each process (maybe you’re doing CQRS)
Replica process tracking

I can see every sufficiently advanced clustering use case driving a lot of small transactions in order to coordinate distributed state, even if the state itself is held somewhere else. Then if you also have strict ordering guarantees, I think you quickly end up with a lot of little serial disk writes.

Fast disks are the most critical factor for etcd deployment performance and stability.

etcd is different but same wrt wanting fast disk

andre1sk · January 29, 2018, 11:58pm

This is can have some volume but that was true for older versions of Kafka as far as I remember newer versions no longer store Consumer offset in ZK.

CptnKirk · January 30, 2018, 12:18am

That’s good, I guess. But not super relevant. Of course, if you don’t actually drive ZK/etcd very hard, its performance doesn’t have as great an impact. But any Etcd/Zookeeper alternative (or Elixir implementation) would have to plan for heavy usage. And when that happens you end up with hard problems to solve. For both of these products, fast disk seems to be a requirement.

You’re also still back to the same quandary. These products need something special beyond what “normal” nodes are required to have. If you melt their functionality into your app/node, then your app/node also needs to assume this specialized characteristic, in addition to whatever product level requirements your distributed nodes have. That was all I was trying to point out in my original response. You can pass the buck around, but it still needs to be paid by something.

andre1sk · January 30, 2018, 12:24am

True but I am not aware of any product that stores anything data intensive in either etcd or ZK it is generally pretty much for config and meta data. Even Kafka itself has pretty low requirements since it’s basically writing to the log sequentially and is batching writes.

CptnKirk · January 30, 2018, 1:02am

It’s not about being data intensive, it’s about being coordination intensive. The more you have to coordinate the more coordination work you have to do. If your use case requires the serially deterministic coordination (PC/EC) that these products say they support, then your app is at the mercy of these products before/after/maybe during the heavy lifting that its doing.

It’s similar to what happens to the Erlang scheduler when you sprinkle in NIFs. If those NIFs are super fast, you’re ok. If they aren’t, your whole system can become destabilized.

And this coordination load can be proportional or independent of application write volume.

Ask Kafka if it would rather process 10x 1MB messages a second or 5 million 1 byte messages a second. I’m pretty sure the answer isn’t “it doesn’t matter”.

andre1sk · January 30, 2018, 4:18am

No argument there although I highly doubt anyone is in a hurry to put a burden on their team and try to implement RAFT or Paxos themselves vs using ZK or Etcd.

CptnKirk · January 30, 2018, 7:26am

If you want RAFT internal to your cluster, this looks pretty good:

mkunikow · January 30, 2018, 9:07am

When you want to access Kafka you go to ZK and it points you direct to node.
In this way you can add nodes for scaling and client does not need to know about this …
The same works Solr http://lucene.apache.org/solr/

But ZK can be also problematic
http://www.se-radio.net/2015/06/episode-229-flavio-junqueira-on-distributed-coordination-with-apache-zookeeper/

andre1sk · January 30, 2018, 12:34pm

Yep looks interesting thank you

CptnKirk · January 30, 2018, 9:55pm

I think other’s have also pointed out:

Swarm uses libcluster under the covers (it seems nice for stand-alone node discovery). You might also roll in rafted_value on top of one or both of these to build your solution.

You’ll just have to ask yourself what kind of solution you’re trying to bake into your app cluster. Is it distributed queueing with individual consumer offset (like Kafka)? Is it stateful actors? Is it stateless workers?

For distributed queueing with individual consumer offsets, Kafka is pretty popular. For stateless workers registered across a cluster, swam seems like it solves that.

It could be interesting to marry stateful persistent actors and an intelligent global registry. This appears to be missing in the Elixir world, although there may be some distributed Ecto project that I’m unaware of. The characteristics would be:

Actors/processes have state
They delegate to backends for save/restore functionality
They have a formal API for interaction (GenServer vs Agent – although a persistent agent wrapper would be ok for some projects)
They are sharded across your cluster
They are communicated to in a location transparent way
Node failures are detected immediately
Network splits are detected immediately and fall back to a well defined “mode”
Node rebalancing and recovery is fast and deterministic

This ought to be a pretty idiomatic strategy for “stateful elixir”, but it’s not easy to get right. It depends on a single authoritative writer and those aren’t easy to deterministically manage in a distributed system. Process crashes, node crashes, and netsplits need to be considered. If you want to increase read throughput you can add replicas, but this further complicates things. If you want to get really fancy you could configure dynamic support for ETS or FastGlobal reads.

A lot of the complexity is in the distribution and coordination methodology. Let’s say you need need to track 10 million persistent actors. That’s too big for gossip. How about RAFT? I don’t know. What happens when the cluster detects a node has gone down? Do you have the ability to hand off work to another node immediately and deterministically? If the original node comes back, does it have the ability to take back over? Do nodes benefit from local state caching, or do nodes that fail need to regenerate state from a remote master, possibly flooding that system with millions of requests as they come back up?

This would be a PC/EC libriak that seeks to keep A has high as possible. You could probably tune this to be PC/EL if you wanted to support additional replicas and could track exactly what the latest state version of each actor was.

That would be a project I’d be interested in playing with.

michalmuskala · January 31, 2018, 7:50am

I think some of the goals are shared by Erleans - it strives to implement a similar ecosystem of “virtual actors” to that available in .NET through Microsoft’s Orleans framework. It also uses lasp underneath and CRDTs to guarantee evental consistency. It’s definitely an interesting project to look out for.

aseigo · January 31, 2018, 8:35am

As has already been pointed out by many here, there are many, many great libraries / frameworks that provide the underpinnings for these things for Elixir already. RiakCore, RaftFleet, libCluster, mNesia, etc. etc.

But I think what is being missed a bit is the “batteries included” part of this. Yes, you can build pretty much any distributed system you need from those lego blocks but etcd/zookeeper are very specific applications: distributed configuration data.

It would not be hard to build something that is a config store with distribution, pub/sub on changes, initial load on application start, etc. but AFAIK that specific application does not exist. The lego pieces are still in the bucket waiting to be put together, and this pattern results in many higher-level applications (like distributed config management) left as an exercise to the individual developer rather than being pulled together into reusable projects.

Hopefully over time more of these sorts of projects will hit hex.pm, so Elixir can have more “finished pieces” to use alongside the underlying toolboxes.

cdegroot · January 31, 2018, 1:49pm

The answer is “it depends”. Kafka does a lot of batching

Just for fun, I wrote a Raft implementation as well (https://github.com/cdegroot/palapa/tree/master/apps/erix); it’s not too hard, and I religiously TDD’d by basically converting the specifications from the paper into tests, and I might this year or next year give a fun demo at some meetup or conference about it…

But would I use it in any production-like case? Nope. Ditto for most of the other Elixir/Erlang based stuff out there, including riak_core. Look at the stream of work that goes into Zookeeper, Consul and etcd; getting this right is hard, and I think that if you want industrial strength, you have to leverage one of these large projects. Using your own or having something maintained by the tiny Elixir/Erlang community is most likely going to end up in tears.

Yes it would. Do a round through the issue lists of the aforementioned “big three” products.

aseigo · January 31, 2018, 2:02pm

The counterpoint to the size of the community is the size and mission-criticality of the products some of these components are used in, and the testing they get.

You mentioned raft; we’re using am Elixir raft implementation here at work that is written, used and maintained by a reasonably large Japanese company. They commit regularly, and they use the libraries themselves. And that’s a very small example compared to e.g. riak libraries being used for core functionality at companies like Bet365.

On the flip side the Javascript community is absolutely immensely huge, and the quality of projects that come from there is often scary-low and the lifespan way shorter than one would hope for.

Size of community alone is really not enough to make decisions based on. It can help, but it needs to be put into the context of the technology and commitments to it.

I’m very familiar with them, but thanks! The original poster mentioned etcd and zookeeper, and having something that provided that specific functionality out-of-the-box would really not be a huge effort within the existing Elixir ecosystem as a lot of the really hard work (distribution, global naming, CRDTs and other EC helpers …) has already been done either in the core libraries or widely used add-on libraries that have been proven stable over time, under load, and at scale.

So why are these things such monumental efforts for other communities? Because they don’t start with the tools Elixir benefits from when it comes to distributed computing. They have to build a lot more of the primitives from the ground up which we get to just pick off the ol’ hex.pm tree, or which are baked into the language design itself.

andre1sk · January 31, 2018, 2:30pm

Well tiny Erlang community already maintains pretty much the most advanced commercially viable VM for distributed systems. CoreOS (or whatever the current name is) was a tiny company until few days ago when it was acquired by Red Hat etcd obviously benefited greatly from being part of k8s efforts though. This def. would take concentrated effort from larger players that have use cases for something like this e.g the likes of Toyota, bet365, Erlang Solutions, Ericsson, Pinterest etc. or alternatively by a startup that for example was building event processing PaaS.

cdegroot · January 31, 2018, 2:35pm

I’d grant you that for ZK ;-). etcd and Nomad are written in Golang, and while our respective communities may disagree about a lot, I think we can agree on the fact that Golang was created with distribution in mind.

My experience with these sorts of systems are that they’re 20% use case, 80% edge case. Apache ZooKeeper: What It Is & Does & Our Discovery of Poison Packets is a good example.

You fix an edge case, and another edge case appears on someone else’s system. Frankly, the couple of times I wrote low level networking stuff I had the feeling it was a fractal all the way down and it ended up costing me way more time than I cared for. I guess that if you have limited deployment and stay in the sweet spot, things will be relatively easy; something that works for all of the Erlang/Elixir community, I’m less sure. My take so far has been to just use ZK for the one or two use cases where we need network-wide singletons; pretty much everything else we do uses Kafka and thus relies on its consumer group functionality to divide up work, hence indirectly using ZK as well. One less headache for me. YMMV.

Most Javascript projects deserve a lifespan of zero. Starting with the language itself. As far as I have any hopes for the GUI platform called the web browser, it’s that webassembly takes off and proper languages can be used as first-class citizens