The state of Riak Core, Lasp and distributed programming in BEAM

This could well be, and would be useful, but 500MB of data is still 500MB of data, even if chunked. The external term format is rather unfortunate. Hopefully this can be addressed in future updates. The old format for connections to nodes using the older serialization would be necessary, but it is all versioned so this should be possible.

That said, I am very happy to see the serialization time improvements, as they were truly horrendous for these large data structures up until now. So there is at least that! :slight_smile:

edit: In case there are people curious / wondering why 500MB of data, even when chunked, is an issue given this is not disk-bound or anything like that ā€¦ Gbit ethernet gives us 125MB/s. 10GBit, which costs more but is at least a fixed cost (and still affordable), gives us 1.25GB/s. So with Gbit networking, it takes 4 seconds to transmit that 500MB. On 10Gbit this drops to a nicer 0.4s ā€¦ buuut ā€¦ we arenā€™t sending one message at a time between two nodes. With even a small cluster of 10 machines, it is easy to end up in situations where the network is saturated with such messages and that becomes the bottleneck. (We have hit this exact wall without really pushing too hard ā€¦) There is also the overhead of receiving and allocating that much memory for the messages. While all that happens, CPUā€™s sit idle waiting for data to arrive. Which is, of course, a little disappointing knowing that it could be orders of magnitude smaller.

Compression is another issue: being limited to zlib is a bit silly given the advances of recent years; weā€™re using zstd currently which is both faster and gives better compression rates than zlib does. :confused: The plumbing underlying BEAM distribution is aging. It is good to see this getting attention in recent releases, though, so we remain hopeful ā€¦

Right, it is less an issue of how many nodes you can scale too but instead how much churn (if any) can you handle of what nodes are in the cluster.

Currently Erlang full mesh and single channel with global state synced in full to each node may work fine for near zero churn and small messages for a control plane (I donā€™t think the large message/tick issue has been solved in recent Erlang?), but can fall apart in the environments people expect to work today.

Not that this isnā€™t also a problem for other popular tools like zookeeper and kafka, but one I think stands out more because the use cases of Erlang and Elixir are more general and broad than those.

Though that only speaks to the underlying distribution, simply having a library for concensus that is battle tested is still useful, even if in so many cases someone could just use a central database or zookeeper instead :slight_smile:

So I think you are right about the scalability perspective and think work in that area should be concentrated. Operational needs will require work from all over. That is what I hope to focus on while letting the smart people figure out the hard parts :slight_smile:

2 Likes

Ok, I found my ā€œsourceā€. Note it is still a prototype though: https://www.youtube.com/watch?v=0E49Q5fbs5g

The important message here though is that folks can address many of those limitations without blocking on the OTP team as the mechanism is now pluggable.

7 Likes

Neat stuff! I will have to watch it fully at home later. Though I only managed to quickly shuffle through the slides (heading out in a few minutes here ā€¦), the IS-IS networking looks potentially very good and head-of-line blocking / out-of-order messaging -> yes! Not sure this will help with our issues related to serialized message sizes, but as I noted I havenā€™t had a chance to properly watch the videoā€¦ looking forward to doing so though. Thanks for the link, @josevalim :slight_smile:

Ah, yes, the pluggable distribution! I need to watch this talk.

Once upon a time I thought sctp might help and gave a shot to implementing that dist proto, but sctp currently is blocked from being used as a distribution protocol, http://erlang.org/pipermail/erlang-questions/2016-January/087414.html

SCTP would have been such a better network transport layer for the BEAM mesh, TCP was a very non-scaleable choice. With SCTP you can have distinct ā€˜channelsā€™ between two servers, say with one dedicated for heartbeats, so transferring 60 gigs on one channel wouldnā€™t disrupt heartbeats for example (or other messages in other channels). In addition SCTP has a lot more features. SCTP is an actual standard IP transport layer like TCP or UDP, but because Windows is stupid and never implemented it, it never got into popular use anywhere. However usrSCTP has recently taken over (built on top of UDP) that implements (most of) the same features but at the application level very cheaply.

2 Likes

After talking to Peer about what they are doing (Jose posted a link above to their talk about pluggable distribution) I gave sctp another shot, now on OTP-21, and was able to complete the handshake.

Still a lot to figure outā€¦ but its a start :slight_smile:

Edit: Link to gen_sctp_dist https://github.com/tsloughter/sctp_dist

4 Likes

I did, when I worked with the Jini team. They very much knew what they were doing, and guess what they came up with: a toolkit, not a complete solution :slight_smile:

2 Likes

It is, for more and more shops, not ā€œnice to haveā€, but ā€œwhat you haveā€. And that removes the impetus to solve these sorts of problems, because the tools are there, theyā€™re mature, and reinventing that wheel is pretty much on the same level as replacing parts of your OS because you think it should live on the BEAM level. Sure, itā€™d be neat to have a TCP/IP stack with Raft humming inside my VM and Iā€™m probably going to spend spare time on tinkering in that space at some point, but for my day job we need to rely on battle hardened and proven stuff even if it is ugly like Zookeeper.

But could we please have just one production ready Raft library? :wink:

1 Like

I heard for the first time about Lasp. So you gave me a new thing to learn?! :slight_smile:

Thanks for the share!

Edit: but wait, I heard that OTP alone was the answer to every problem about scalability, fault-tollerance and distribution.

1 Like

Apples to oranges though. I was speaking about a regular dev team on a schedule to deliver commercial projects.

As for open source and library / framework designers, I am pretty sure most are pretty bright folks no matter the programming language they work with.

I would imagine a lot of teams at Google know what they are doing as well as many teams related to Apache Foundation projects.

Ah, ok. In that case, Iā€™ve yet to encounter a team working in any language that knows, as a whole, a ton about distributed computing. Nor should they, necessarily :wink:

Serves me right for not providing more context, I guess.

To avoid derailing this thread: what I meant in more detail was that many Java teams I knew and worked with tried to emulate OTP and never succeeded. Additionally, I was also saying that a popular language does not mandate competence of the average programmer that uses it.

Maybe it is just very very much too hard for a regular dev team to do that and maybe it should not be necessary for those teams to know a ton about distributed computing. They want to deliver business value, you know, and not compete who has the biggest muscles. The state of Riak Core, Lasp and distributed programming in BEAM

1 Like

bump

I stumbled across this thread from ~ 3 years ago and found it really insightful.

I am interested in hearing how things have changed (especially new libraries in the positive direction) in the past 3 years.

1 Like

One thing that has definitely changed is OTP, now containing pg and not relying on global.
Weā€™ve tested it to a scale of a 40k (yes, 40,000 nodes connected to each other), connected over TLS (as itā€™s hard to imagine such a huge application locked in a single secure datacenter).

I am also entertaining the ā€œlambdaā€ project, implementing service discovery native to OTP, and a probabilistic load balancer (based on random sampling without replacement and credit control). So Iā€™d be interested in a use-case for ā€œdistributed programming in BEAMā€, what does it actually mean?

Scaling 3-tier architecture (front/compute/storage)?
Horizontal meshes, data planes, control planes (e.g. Envoy/Istio native to Erlang)?
Data processing pipelines?

4 Likes

Interesting, when you have 40k distributed erlang/elixir nodes ā€“ how are they connected to each other? A connection between each pair of nodes (seems ridiculous) ? Some type of tree? Fixed degree node where any two nodes are at most O(log n) jumps away ?

Just s fully connected mesh, for test purposes, with each pair of nodes connected. Done with out-of-the-box distribution over TLS. Production setup is smaller, but still there are meshes over 10k in size.
This greatly simplified everything else, avoiding whole classes of problems with multi-hop routing, and enables very convenient logical service discovery (pair of connected nodes automatically gets services published on each other).

3 Likes

Interesting. For some silly reason, I had a hard time seeing a single machine handling 40k tcp connections. However, after revisiting C10k problem - Wikipedia it appears a single machine can now handle 10,000,000 concurrent connections.