Hi everyone! Just some general question about the state of distributed programming in BEAM.
One of the strongest feat of the BEAM is its concurrency and we all know that (and after this post I think the situation is highlighted dearly). One thing though that is sometimes mentioned but not always discussed is distributed computing.
The thing is: most of the state based libraries we have published on Hex are not meant to be used on a distributed system. This responsability is left to whoever depends on them. Take “cache” libraries for instance: most of them are not truly distributed in the sense that their state would be lost on a node down event.
Currently it appears we have 2 options to deal with that: Riak Core and Lasp lang.
Riak Core is living through different forks with the community making patches to support recent Erlang versions but I guess we are all waiting for the Bet365 position on the matter. Will they put someone in charge of accepting patches, reviewing code and things like that?
Lasp is a bigger departure from traditional Erlang. The projects has a big motto (“Planetary Scale Applications”) but the documentation, samples and whatnot is really difficult to grasp (at least for me).
This is something that, in my opinion, should be included in the ecosystem. If Elixir maintainers could add some “flag” to mix new to make it more distributed friendly it would be awesome (like an easier riak-core). Maybe for Elixir 2.0?
Am I lunatic? How would you people go about with a distributed app? What is the state of distributed programming in BEAM?
I agree here. I also found it odd that a language which claims to make distributed programming easier don’t have a good, solid, distribution story.
I always started developing something and then I wanted to have high-availability and found that erlang doesn’t actually give you any ready solutions.
They do have distributed applications but they rely on global which is bad in terms of net-splits.
The problem is distributed computing is really hard and requires various trade-offs which often are project dependent. This makes it hard to come up with a generic solution.
riak_core is but one way to do distributed computing and it uses a set of “distributed primitives” to achieve this. Lasp is probably a much more comprehensive solution.
I’m not qualified to answer
How to approach the distributed computing depends on your requirements (which to some degree ties into the CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem)). For example all the nodes in your distributed environment be available for reading and writing? What are the latency between nodes? Are nodes up all the time or come and go? Must your state be strictly consistent? Can you afford to lose data? Must data be linearizable?
The distributed state of erlang is not ideal but this is likely the same as for lots of other languages. It is just that it is so much easier to “get” to the stage where you need distribution in erlang.
The distribution layer of erlang can be improved (and I think the OTP team is working on this). Things like using the same channel for data and control plane is not considered best practice. A full mesh node system limits the number of nodes you can have and I’m sure I’ve heard a few more complaints about it which I can’t recall now
It would be nice with more well tested and documented “distributed primitives libraries”. For example, various gossip and consensus protocols and replication strategies.
In the end, erlang is still very much suited for distributed programming. It is just that you, as for most other languages, need to implement a number of parts your self.
I would say that Erlang has a good solid distribution story but that is not built-in. Riak Core, Lasp and just some examples.
The idea of distributed primitives seems like a nice compromise. CRDTs, vNodes in a hash ring, Active Anty Entropy in the cluster and etc are good examples, IMHO. That shouldn’t be in the standard library, but on an extra library (much like plug, ecto and etc).
Totally agree that is application dependent, but if there were a set of common primitives, we could build libraries aiming more at those than single node applications.
This work was supposed to land on OTP 20 (Zandra gave a talk about this in 2016 but I can’t find where to track this work … any help?)
That is my gut feeling but currently, even to setup a local development environment with a cluster and test how it behaves when you kill nodes is not a straightforward task (of course we have docker with compose, swarm etc and kubernetes and etc, but this is not a much talked about topic…). Perhaps this could also be in the distributed primitives.
Lasp was about that if I’m not mistaken: a set of tools for making planetary scale applications. Though I’m not sure the state of it as of today. The documentation is really lacking and I couldn’t find any sample, tutorial or whatever. (There is unir but it is still poor on documentation).
I remember chris and jose talking about a service discovery layer built on CRDTs(Presence) some time back - hopefully we will hear more about that idea one day… sounds like an awesome and intuitive way of building out a distributed system…
Mesosphere has also done some significant work in this area with Erlang. It’s just part of their overall DC/OS offerings.
IMHO, the BEAM makes a lot of sense to maximize available hardware on a single machine. The story becomes less good as you transition to a higher latency/availability boundary. After 25+ years working in various distributed computing systems, I firmly believe that you cannot hide this performance transition from the programmer and have a reliable system.
The next layer needs to be built with the appropriate assumptions. As Mesosphere has demonstrated, BEAM languages are well suited to build the components these larger meshes require. However, I think it’s a mistake to attempt to bind that mesh to a specific language/runtime.
Implement Protocols, not APIs.
- Given a 256 core server, running the BEAM vs running 256 docker containers will be much more efficient. But I’m not convinced the benefits of the BEAM extend beyond the borders of a server.
Am I the only person who feels disappointed over-all in the maturity of this space? From the way some people describe Erlang it seems like these kind of solutions are the entire raison d’être yet the actual state of it is fragmented and largely of dubious production value. If I want to build a large distributed application today in Elixir and the scale outstrip the capabilities of :global (maybe 30-40 nodes) I feel like I’m on my own and honestly would probably base it on Zookeeper or Consul and grpc rather than using a native beam solution.
I think this is a chicken and egg problem. Lots of people are working on solutions to these problems. But at the end of the day many people don’t feel comfortable using said solutions and also aren’t willing or able to get involved and support those solutions. Of everything that’s been mentioned in this thread riak_core is probably the one that’s best supported and battle tested in the industry and even it has problems with adoption in elixir because it had bad marketing for a few years. My impression is that very few companies in the elixir community are interested in building infrastructure tooling and are more interested in building apps. Most companies using elixir don’t need their engineers to build distributed systems tooling and are content (and probably better served) to use off the shelf and battle tested infrastructure like zookeeper, etcd, postgres, etc.
At the risk of sounding defeatist, I think the community is just too small, focused on other problem spaces, and at times working across purposes to have the number of robust solutions that you’ll find in Go or Java. My suspicion is that if we want to improve the state of tooling and adoption then we need more people with corporate backing solving those sorts of problems and contributing solutions back to the community.
IMO that’s true everywhere. Not going Elixir fanboy mode here but everywhere I was, be it a wage programmer, or consultant, or hired-for-a-project role, and no matter the language and stack, everybody was focused on delivering the next sprint or, in the best case scenario, mid-version release (say, going from 1.1.5 to 1.2.0). Strategic investments in tech blessed by the business people are as rare as flying unicorns.
There are no places like the original labs where UNIX, Plan-9 and a lot of other tech we use to this day were conceived. Too much short-term capitalism for those to ever see the light of day again.
I respectfully both agree and disagree.
Java doesn’t have robust solutions per se. And where it has good solutions they are very rarely out of the box and painless. The ecosystem has 500 metric tons of crap and are riding the horse named “look how many companies are using us, we have to be doing things right, we are sure!” – while in truth the slogan should be “everybody is too damn scared to move away from us because nobody has an idea how their systems work anymore”. Sorry for being a cynic but for 8 years in the Java space (2003 - 2011) I’ve never, ever, seen one Java team that legitimately knew what they were doing. And I’ve never seen a team actually being on top of all their business logic and code. Maybe only a bunch of very bright young people in SAP Bulgaria where I contracted once. And that was because the project wasn’t big.
Java has a lot of useful libraries for hundreds of protocols and integrations with many 3rd party APIs, I am not denying that – and that’s its main selling point in the last 7-8 years and is what maintains its momentum. But in terms of distributed computing Java is, to this day, worse off than Erlang/OTP and is almost on par with Go. Their only chance is their new functional constructs and producing a zero-fuss deployment of those to something like the cloud serverless functions or short-term specialized containers. But they can’t do that yet because the JVM takes a long time to start (there is a lightweight distribution though, so they might get there). Ad infinitum.
Go is not much better but I still like its community much more than Java’s. Go programmers seem to be a very particular bunch and they don’t mind getting their hands dirty. And most of them are very well aware of the pros and cons of the language and the runtime and they mostly stick to highly specialized libraries, CLI tools or microservices. Go’s community is a really excellent example of programmers who are realists. That being said, goroutines and channels are very far away from the robustness of the OTP. Deadlocks usually don’t happen but a subclass of the multicore problems still apply. This is well documented over the net; try googling about for “goroutines deadlock” or “go panics channels” or along those lines, you will find some quite disheartening articles… Gotta say, these discoveries shook my faith in the language’s multicore story.
I will agree that this community is small. But you must not underestimate the core team. Many things that have to be reinvented in every Erlang project are being gradually added to Elixir (Task.async_stream, GenStage and Flow being pretty good examples). Lately there was a huge thread with tons of discussion about how to solve the compile-time / runtime / app-startup-time configuration and a lot of progress has been made in terms of the core team gaining better understanding of the shape and the area of the problem.
The vibe I am getting from Jose and the others is that they are well-aware of the current pain points but are gradually creating the building blocks of the solutions. The Elixir core team is very judicious before adding stuff and many of us are applauding them for that. When they do it, it will be done right.
That being said, nobody forces you to wait. By all means, use ZooKeeper or a particular Kubernetes setup in the meantime; we all have paid jobs to do after all and we cannot wait indefinitely.
Last but not least, distributed computing is one of the most stubborn and complex problems in this computing age. It would be unfair to judge Elixir by its inability to provide transparently and out of the box functionality that 99% of all others cannot provide even with a bunch of tooling attached.
The main missing link currently, in my opinion, is to be able to quickly get these systems running in your own application, and understanding how they work/can be used (documentation).
riak_core is very stable but really difficult to start using because documentation is virtually non-existent.
Lasp seems to work very well and there is now some documentation at lasp-lang. org but I found the installation process to be very difficult and was not yet able to include it in my own app. Also, the only way to currently find out what kind of CRDTs are supported is to dive in the source.
CouchDB is more mature,well-documented and works well (it is used in huge scale applications) . But although it is built on Erlang, it will not run in the same BEAM as your own app AFAIK (This has since been confirmed).
EDIT: Added links to resources now, and improved formatting. Much easier to do this on the PC than on the phone ^.^
It should be mentioned by the way that all of Partisan, LASP and Erleans are written and maintained by the same person, Ulf Wilger. Let’s not underestimate the amount of work this is, and what he already has done. Huge kudos to him!
EDIT: What I wrote above is false and confusing; LASP and Partisan are created and maintained by a group of people, first and foremost Cristopher Meiklejohn. Erleans is maintained by Tristan Sloughter, and uses lasp_pg under the hood but is otherwise separate. There is definitely communication between these projects but they are separate.
Ulf Wiger (whose name I also spelled wrong…) was not involved with these packages at all, but is also a great person and author of the widely-used process registry package gproc, as well as some others such as unsplit that does mnesia conflict resolution.
The important point I wanted to make though, is that conceiving and creating these systems is a tremendous effort, and I would like to thank you for the enormous amount of work that has been put in this far.
Are you sure you got the name right? Ulf has written a number of great libraries but as far as I know he has not been involved in any of the 3 mentioned (according to github contributors for the various projects)
Oh my… Indeed. I am completely misremembering. My apologies.
The (primary but not only) author of Partisan and LASP is Christopher Meiklejohn. Erleans is mainly maintained by Tristan Sloughter. Erleans internally uses lasp_pg for process discovery, so there is some communication between the projects, but they are not directly related.
My apologies for the confusion I created, and my apologies to the people whose name I used in the wrong context.
I would make a completely different assessment from this situation. We are talking about distributed systems, the space will be fragmented because there are too many different trade-offs you can make.
To me the benefit of the BEAM is that solving a problem in this space is much more accessible, which affords me to reach out to specialized solutions. I would rather have that than having to rely on something like ZooKeeper as a clutch on all of my systems. So what you’ll find in BEAM are small, specialized solutions that solve one problem well and are used in production within that scope.
We don’t even have to look far. Take how Phoenix.PubSub and Phoenix.Presence build really accessible and powerful abstractions without bringing in any third party dependencies.
So if you are building a large distributed system, the first question I would ask is why do you think you need :global in the first place. Or why do you think you need something like ZooKeeper?
If you really need ZooKeeper, then by all means, use ZooKeeper. But to me the main feature in BEAM is that I can ask these questions and maybe pick something simpler. It is the same when it comes to databases. I don’t lose sleep over the fact we don’t have a SQL database in Erlang/Elixir, I just use PostgreSQL, but I do rejoice whenever I am working on something and Elixir affords me to skip the database altogether.
How does Elixir help in skipping the database? Elixir is not a persistence layer, it’s a language runtime optimized for network communications (EDIT: strictly speaking, elixir is not a runtime, the BEAM is, or course). Do you implement your own persistence layer? Or do you work on projects which don’t need a persistence layer beyond flat files or something?
An example would be to support something like a cluster-level supervisor. There should be nodes doing various jobs, and exactly one controller that starts new nodes when old ones fail. There may be also need to be a specific process doing resource allocation and planning. Waiting in the wings are nodes ready to take up these roles if the primary fails, but they need to smoothly mediate succession. When you boot your cluster, which one gets to be the controller? This is the sort of problem that Paxos / Raft are often used to solve. Most people don’t build systems like this, they just use Kubernetes or Hadoop and call it a day. I just find it surprising that BEAM doesn’t have a single well-established implementation of one of these algorithms.
In conventional apps PG, Redis, Memcached etc. are often used to hold ephemeral data (e.g. form data between pages, presence information, cache), and synchronize it between nodes. On BEAM you can keep this data in memory and restrict database usage to things that actually need to be kept long-term. Chris McCord talked about this “ephemeral vs persistent state” in this keynote: