The state of Riak Core, Lasp and distributed programming in BEAM

I suspect it is for a variety of reasons:

  1. When we are talking about orchestrating nodes, we are also talking about multiple technologies, and many prefer to avoid a VM-centric solution, be it BEAM or JVM centric. Since we would end-up treating those things as a black-boxes, the technology choice ends up less relevant

  2. If you want to run Paxos / Raft, you usually want to have a different set of machines running those services rather than embed them on each BEAM VM instance you are running. Since we should treat those effectively as a separate service, it brings us to the previous point about black boxes

  3. The BEAM community has usually focused on solutions that side more with AP rather than CP. Plus most systems already depend on an external tool for CP, so we tend to just use that for our CP needs

Donā€™t get me wrong, it would be great to have a battle tested Raft implementation in BEAM (and maybe we do have and nobody knows which would be another problem in itself), but I personally prefer specialized solutions that stay close to our domain rather than large black boxes. I find thatā€™s what plays better on the BEAM strengths. You can indeed use :global for resource allocation and planning and, if :global turns out to be a limitation once you reach 30-40 nodes, it is likely that you will write something that focuses on this problem rather than a generic solution.

10 Likes

I think this might be a bit a chicken-and-egg problem: We require outside service besides running the BEAM (or the JVM), which result in us needing other outside services to manage(/orchestrate) those services.

Do you know what the reasoning is to running your Byzantine Consensus software on a separate set of machines from your application? I cannot come up with an intuitive reason to do so, so please enlighten me! :smiley:

There is raft_fleet, but it probably is not battle-tested at this time, although it definitely is a starting point.

It might be interesting to note here that for many systems we ā€˜fakeā€™ CP by taking an eventually-consistent AP system and define a synthetic point in time that we call ā€˜eventualā€™, together with a risk calculation of how often this synthetic definition will not match reality (and how costly this would be). This is exactly what all Cryptocurrency-systems do: The blockchain-consumer application specifies a ā€˜minimum number of confirmationsā€™ after which they expect that the probability that what is currently tought to be agreed upon is reverted/altered is so low that we can treat it as unexistent.

In this way weā€™re able to traverse from an availability-centered system more towards the consistency-centered side, with a direct trade-off in time-efficiency (and in general weā€™ll never be fully CP because ā€˜eventualā€™ might very well mean ā€˜longer than we would like to wait for the results to fully propagate and us knowing that it hasā€™).

I actually think that there is more effort spent in AP than CP in general rather than only on the BEAM, because AP systems are more scalable (A full CP system requires coordination of a majority of nodes at the same time (which already has a significant bandwidth overhead), and if this is not possible at a given time, the system will have to ā€˜waitā€™ until more than the majority can be reached again to keep everything consistent.), as well as being able to usually being able to perform some risk-calculation to be able to let go of the full consistency requirement of your app.

I also think that it might be very interesting if some kind of hybrid system could be created, to be able to keep everything that can be conflict-resolved in an AP-maintaining system, and only put the things in CP that really have to be because their conflicts would be impossible to resolve.


Phoenix.PubSub and Phoenix.Prescence are definitely wonderful tools that Elixir has without adding extra dependencies. The ease in which users can communicate with channels on different Elixir nodes (and how well its internals are abstracted away for the ultimate application-programmer) is absolutely amazing .

@michalmuskala How is Firenest coming along, by the way?

3 Likes

Something that I donā€™t think enough people talk about is that in certain systems eventually consistent solutions never converge together. Rather they are always in a state of inconsistency. Thats OK for a lot of systems. But there are a number of problems that canā€™t tolerate this sort of inconsistent view of the world.

I think this is exactly right.

Iā€™m sorry but this is just patently false. Zookeeper is the only distributed store to pass jepsen on its first try. Kafka and Cassandra arenā€™t exactly toys. I think you may be conflating your feelings on Java as a runtime and language (feelings that I would probably totally agree with) with the reality.

Again, Iā€™m not making a statement about the languages or runtimes or how suitable they are or whatever. Iā€™m simply saying that thereā€™s a ton of movement to build solutions in these languages. I donā€™t have a particular love of Go or debugging channel panics (which Iā€™ve had to do a lot of). Iā€™m just being a realist.

I donā€™t. Iā€™m also not sitting around waiting for anyone else to solve this for me. I work on distributed systems in elixir as my full time job :slight_smile:.

Bro Iā€™m here contributing solutions myself. I get it. :wink:

This just isnā€™t true. These places absolutely exist. They just exist inside giant, monopolistic companies like Google, Amazon, Twitter, Facebook, Linkedin, etc. Hell, Microsoft has produced some of the best distributed systems research in the last 10 years. But actually this sorta proves my point. Tools like Kafka, Zookeeper, Cassandra, etcd, consul, vault, storm, spark, etc. are written in ā€œpopularā€ languages not because they were the best technical choice across the board but because thereā€™s a massive amount of people working in those languages. Those communities are large enough to have companies that are dedicated to only working on these sorts of problems. We donā€™t have that sort of embarrassment of riches in the elixir community. At least not yet. Thats not to say we canā€™t solve these problems or provide robust solutions that rival anything else. Just that without a mass amount of company backing its going to be harder.

I completely agree with this but with the added bit that marketing distributed systems solutions is incredibly hard. At least in my experience most app developers donā€™t have the vocabulary to discuss these tradeoffs and design decisions. Generally everyone is (appropriately) distrustful of distributed systems solutions and tends to opt to build it themselves because they feel that way theyā€™ll at least understand what theyā€™re getting into.

6 Likes

I will still encourage people to actually go out and build things. Waiting around and expecting the core team to solve all problems isnā€™t going to work :wink:. The main goal of the core team is to maintain elixir itself - nothing more, nothing less. The language is designed to be extensible and itā€™s up to the community to leverage that and build libraries for solving problems they face. Relying on a handful of people does not scale.

12 Likes

I fully agree and I apologize if I implied otherwise.

A fact is a fact. Hereā€™s to hoping that some of us will get actual free time and resources and will then give back ā€“ something this community deserves so much. I know I would if I find myself richer and more free.

3 Likes

This isnā€™t realistic. You may not like the solutions that are available, but they are robust, proven, documented and supported by MANY commercial entities. While popularity isnā€™t the best indicator of quality, it does indicate a level of capability that is still a question mark for other technologies. I would guess at least one of Cassandra, HBase, Hadoop, Spark or Kafka are present in the vast majority of large parallel data processing implementation you will find today. You would not be making a mistake using any of those technologies if they fit your use case, even if your app dev framework is on the BEAM.

1 Like

You are right of course but I believe something got lost in translation because I was talking about the features of the languages themselves (and their stdlib).

Needless to say, good software can be written in any language, it just takes more time and effort in some of them.

2 Likes

is the only thing that would be nice to have as a native BEAM thingy a well integrated solution that does not rely on ZK would put BEAM into very unique position for data pipelines. With FoundationDB open-sourced it might not be a huge feat to come up with such a solution.

1 Like

Maybe Horde will become the answer ?
https://github.com/derekkraan/horde

1 Like

Soā€¦ they do not actually exist? :wink:

I was talking about places where thereā€™s no commercial agenda ā€“ or at least not a direct one. The people who conceived most of the tech we use to this day werenā€™t given orders ā€œdevise a modular OS!ā€ or ā€œmake a stateful layer on top of a stateless protocol!ā€ etc.; they were pretty much just handed lab space and tech so they work on whatever they feel is worth doing and their pay was paid unconditionally without measurement of what they achieved.

These benevolent labs definitely do not exist today. Or if they do and I donā€™t know about it, then I might go sleep at their doors because I need a huge break from CRUD apps trying to mask themselves as the best invention after the wheel.

Creative freedom almost does not exist today. And our profession definitely is not that of an assembly worker. Itā€™s a pretty weird animal; we definitely need a bunch of mechanical skills to be perfected but also need some of the freedom that all artists need in order to actually create something useful and valuable.

2 Likes

Now that you guys mentioned riak_core and bad marketingā€¦

I feel the power of BEAM something similar to The List Curse.

A lot of people can write their own distributed database in Erlang, itā€™s not an easy thing, but itā€™s just too easy compared to other languages. Itā€™s a little bit sarcastic.

Java, for me. Itā€™s a extreme weak language that most people cannot write most of the infrastructure right. Thatā€™s why people in Java world have to solve those problem they created really hard. Like a typical web solution, youā€™ll need Tomcat or Netty, Redis, Kafka, Zookeeper, possibly Cassandra etc, plus a whole bunch of Spring Cloud stuff. And manage runtime and deployment using Docker and Kubernetes.

In Java world, people really appreciate those Apache components because itā€™s painful to make them right.

Seriously, a lot of components mentioned is too easy to achieve with Erlang, or just already there. So people donā€™t have so many runtime issues that almost unsolvable, thus the market is not growing fast.

Now that Java have a lot of concerns pushed to Ops side, the only thing programmer have to solve is just their domain model. But seriously, which general purpose language cannot do that? (Seems Haskell can do a better job hereā€¦)

3 Likes

This is a great discussion. I signed up for the forum just to join this discussion :slight_smile:

The claim Erlang/OTP is great for distributed systems has always concerned me when I see it. It is excellent for fault tolerance and concurrency, distributed is another story.

The distributed story for Erlang/OTP is not one that fits the modern cloud, which most people are talking about when they discuss ā€œdistributedā€. There has been research (http://www.dcs.gla.ac.uk/research/sd-erlang/publications.html) on the topic for years and it may be getting integrated into Erlang in the coming years but the current offering is going to feel out of place running in Kubernetes. Even just epmd is out of place. I plan to focus on making epmdless on k8s seamless soon, main annoying issue Iā€™ve found is needing to implement some changes to what relx gives you for connecting a remote console since it canā€™t just use the same port as defined in vm.args of the already running nodeā€¦

The next issue that I think was already mentioned is a lack of exposure. This is due to either being a library internal to a company that almost no one even realizes uses Erlang and the fact that the Erlang community has long suffered from not having a package manager and people simply re-implementing everything.

Especially with distributed systems I donā€™t think it is a bad thing to build your solution around your specific requirements and not bothering to make a generic lib that others can use to kinda also make it fit their requirements (esp dangerous when it encourages not actually considering what your requirements are, do you really need x,y,z. But it is also a detriment when you end up having to implement a very complex algorithm and will end up building yet another partially broken solution of consensus, leaders, crdts, etc

Anywayā€¦ after all that I have forgotten where I was going with this. So Iā€™ll stop for now and just mention if you are looking for Paxos checkout riak_ensemble: http://marianoguerra.org/posts/multi-paxos-with-riak_ensemble-part-1.html And a raft that should become production quality https://github.com/rabbitmq/ra

And I would love to get an Elixir interface on top of Erleans if anyone is interested in helping get that started.

9 Likes

This is true. However I still like the BEAM VM because what it does, it does it excellently. With the rise of tools like Kubernetes, the distribution task seem to be in the hands of other people and that is fine.

I cannot speak for the Erlang and Elixir teams but to me they look focused on improving the raw performance and the concurrency mechanisms so the BEAM can continue to excel at its strongest points.

As an aside, having a good transparent distribution mechanism looks to be mind-bogglingly hard to me: there is NAT traversal, firewall troubles, finding optimal routes, dealing with latency, dealing with packet lossā€¦ and thatā€™s only the network part.

1 Like

Just a side thought. ā€œYour computer is already a distributed system. Why isnā€™t your OS?ā€
https://www.usenix.org/legacy/event/hotos09/tech/full_papers/baumann/baumann_html/

Planetary scale applications, there is a team (not the least!) working on that.

We are right in the middle of a sea change from building monolithic applications to building 
true cloud-first distributed systems. And yet, of course, the thing about sea changes is that
you seldom know itā€™s happening until after it has happened.

The ā€œconfigurationā€ situation, when viewed in the above light, makes sense. In the early 
days of VMs, we took our existing applications and tossed them over the fence for someone 
to add a little bit of INI or XML glue to get them running inside a VM for more flexible 
management. This approach to configuration stayed with us as we ā€œlifted and shiftedā€
these same VMs into the cloud. And it worked, because we got the boundaries 
approximately correct.

Expressing the relationships between container-based microservices, serverless
functions, and fine-grained hosted services using this same style of configuration 
has led to significant accidental complexity. Turning an application into a distributed 
system canā€™t be an afterthought. The cloud, it turns out, pervades your 
architecture and design. And the way that we know how to best express architecture 
and design in our programs is using code, written in real programming languages 
with abstractions, reuse, and great tooling.

Early on, Eric and I interviewed a few dozen customers. What we found was universal 
disillusionment from developers and DevOps engineers alike. We discovered extreme 
specialization, and even within the same team, engineers didnā€™t speak the same 
language. Iā€™ve been hearing this even more in recent weeks, and I expect a 
NoYAML movement any day now.

Specialization is a good thing, and we want our best and brightest cloud architects 
elevated into senior DevOps and SRE roles, but teams must be able to speak
the same language when collaborating. Not having a common lingua franca 
introduces a hard, physical separation between teams rather than divvying up the 
work based on policy and circumstances. Pulumi aims to give all of us the tools 
we need to solve this problem too.
Pulumi is a multi-language and multi-cloud development platform. It lets you create 
all aspects of cloud programs using real languages and real code, from infrastructure 
on up to the application itself. Just write programs and run them, and Pulumi figures 
out the rest.

http://joeduffyblog.com/2018/06/18/hello-pulumi/
support for elixir could be built https://github.com/pulumi/pulumi

2 Likes

Excellent point that has been left out so far. Speaking from the operational perspective, there is a lot to improve here on the every day usage.

Regarding the scalability perspective, I have some thoughts:

  • Many people say the limit is of 60 nodes, which is more than enough for 99% of the cases. You can likely double this with careful coding.

  • My concern on this area is that, once you need more than 60-100 nodes, it is likely that whatever is built-in likely wonā€™t be a good solution for you anyway. Back on Erlang/OTP 19 when they were talking about a Kademlia based distribution, they were aiming to support 300-500 nodes, which wouldnā€™t be enough for cases like Goldman Sachs which are reportedly running on 4k+ nodes.

My concern regarding the distribution is that we may start a race that we wouldnā€™t ever be able to win. I would rather focus on the semantic design of those solutions: is separating our nodes in groups meaningful in terms of design and operation? Could something be done to lift the single TCP connection restriction we have today? I would also ask about large messages but IIRC this particular issue has been addressed in the latest releases?

3 Likes

About scaling to more nodes: I believe that that was one of the goals that Partisan attempts to archieve: It implements a different way of distributing messages between processes (that live on different nodes) that results in the system as a whole being more scalable.

I have no experience with running Kubernetes or other orchestration tools, but I do want to point out that they are not a requirement for distributed software (but rather, for some situations, a ā€˜nice to haveā€™).

What is ā€œdistributionā€?

That is an important question as across this sprawling thread there are so many different uses of that term.

Does it mean to spread code around a bunch of VMā€™s that are able to address each other? Is it a means of explicitly defining communication paths between instances of an application? Is it about application models? (As in: what makes an ā€œapplicationā€; where are the dividing lines ā€¦ which is not as trivial a question as it may seem at first ā€¦) Is it distributing individual computations / computational units across multiple systems? Is it doing that without configuration / elastically? Is it about sharding and replicating data so that it is later retrievable from arbitrary other ā€œlocationsā€ in the application? ā€¦

These are not the same things, and as such they need different sets of solutions. Kubernetes, riak-core, RAFT (p.s. check out CASPAX if you are into RAFT) ā€¦ they are (potentially) complimentary solutions that address very different parts of ā€œdistributionā€.

When people come together to talk about a topic but have different usages of the same word, it is unrealistic to expect useful results (as in: something that can resemble and actionable vision / plan) to emerge.

Probably requires different solutions for different scales, and certainly different aspects of distributed computing. I would worry less about winning the race upwards, and think more about compatibility between the various solutions: the interfaces and protocols that may bind them together when used to a single purpose.

In some cases, yes. In other, absolutely not. For the ā€œyesā€ cases, it is worth developing and making easy to accomplish. It should also be easy to define another application topology (which is what this question is about) without moving to an entirely foreign(-feeling) solution.

The real questions :slight_smile:

Between nodes? Not really. term_to_binary remains a horrible bottleneck. I am running ERTS 20.0 and Elixir 1.6.6 and we continue to have message that nodes simply can not even serialize (!) because of the inherent inefficiencies in the external term format. I actually came to the forum this morning to post about something Iā€™ve been working on that we will be using here at work to work around this specific issue.

These issues still donā€™t address ā€œdistributionā€ ā€¦ they do, importantly, make it more possible to build those solutions.

IMHO, the minimum working start point is a system that provides:

  • A coherent, built-in, and common-tasks-easy/uncommon-needs-possible deployment solution with distribution in mind; something that can work next to docker containers, but which is not (because we have something much better already with releases; docker is for applications that were built for the OS-centric model)

  • A distributed configuration system that is not based on file system artifacts extant on the nodes of a cluster

  • ā€œZeroā€-configuration cluster building that works with gossip, kubernetes, AWS, ā€¦ etc. There are a few Elixir/Erlang implementations of such things. But in the end, the cluster configuration can not realistically be shipped with the application even as configuration

  • Service discovery within and across clusters

  • An implementation of a performant, reliable, and easy-to-use consensus system (recommend CASPAX, for which there is a start of an Elixir implementation; the author seems to not have the time to continue it, as my 4 pull requests have sat for 2 months now. We exchanged emails and he just seems exceedingly busy ā€¦)

  • An efficient message passing system, which probably means one approach for small messages and another (side-channel?) for more complex/large structured data

  • A definition of, not necessarily an implementation of, how application units [can|should] be arranged. ā€œMicroservicesā€, ā€œserverlessā€ (winner of ā€œmost ridiculous name in computingā€) and ā€œcloud computingā€ are so poorly defined in the first place and do not map cleanly to how the BEAM or applications written on it can be (should be? ought to be? are?) implemented and distributed

Without that, it is really hard to get started on the, admittedly perhaps more interesting, higher-level parts such as data storage and retrieval on distributed systems ā€¦

10 Likes

I should clarify this a bit, as it is a bit more complex than what this suggests.

The time to serialize is a lot better in 21.0 but the resulting message sizes are still very large. On machines with constrained memory (still talking about a couple GBā€™s, though) it is pretty easy to end up with non-serializable data structures. It takes ~500MB to serialize a map with 100,000 entries where the keys are tuples and the values are lists of tuples (so, about 10,000,000 tuples, all-in), and at the end it generates a 500MB+ message binary. At least now it does that (more) quickly, and with less memory usage in the meantime! :slight_smile:

Encoding that same structure to a binary with an even moderately more efficient approach results in a message size of just 1.6MB for the same data.

One version obviously puts less strain on the network and is quicker to transmit between nodes. Not to mention that we can only afford so many 500MB processes at once even on our stock nodes with 32GB RAM each (which is admittedly not a huge amount, but also not tiny). Even passing that datastructure between two locally running nodes on a decently powerful workstation takes 1-3s (depending on the phase of the moon, etc.) on ERTS20.0/Elixir1.6.6.

And for cases of even larger data structures (though still well within the realm of reason, one would think), it can exceed the max message size limit enforced between nodes. I donā€™t know if that limit has been dropped in 21.0 (I havenā€™t checked; it could still be there, it could not be), but it was a real issue for us up until now as that is (was?) a hard limit that there was no way around and completely breaks the promise of message passing transparency.

So while there has been nice strides on this point in 21.0, there is still room for improvement and is a barrier for distribution involving any non-trivial message passing.

EDIT: I kept writing 20.0 out of habit; I meant 21.0, which we have been testing since it came out. With 20.0 it was all even worse.

2 Likes

I thought there was work on serializing messages of large sizes in multiple parts? But I canā€™t recall exactly. I will try to find more information.