Beam/otp for distributed databases/transactions

zeroexcuses · March 19, 2021, 2:43pm

Is there something like OTP, but instead of focusing on distributed processes / severs, it focuses on providing building blocks for distributed databases / distributed state / distributed transactions ?

dimitarvp · March 19, 2021, 6:31pm

Well, what do you think a GenServer / Agent is? Each and every one of them can hold its own state which you can then persist somewhere else (not only in memory).

The building blocks are already there in the BEAM. It just all depends how you want to model your software. You don’t have to expose a server in the classical sense. You can make a distributed BEAM app that only responds to particular requests (like a DB).

kokolegorille · March 19, 2021, 7:03pm

Maybe Mnesia.

zeroexcuses · March 19, 2021, 10:07pm

Have you ever heard of DynamoDB, Cassandra, Voldemort, Redshirt, HTable, Riak, … ? I assure you the core technical challenge they solved was not “use GenServer”.

dimitarvp · March 19, 2021, 10:09pm

In that case it sounds like you made a decision then? Because I truly don’t understand your question.

If you are looking for a library that’s the building blocks of said databases, maybe their sources can help?

zeroexcuses · March 19, 2021, 10:44pm

I have no idea what ‘decision’ refers to. I agree that you don’t understand the question.

I know how to use git clone. I am asking because I believe someone in the Elixir community might know of a resource for building distributed databases / transactions that is more useful than just ‘read the source’.

stevensonmt · March 19, 2021, 11:19pm

I’m hesitant to jump in given the tone of this thread, but just to help clarify what you’re asking, is Noria (written in Rust) similar to what you are asking about? Are you asking if a similar tool has been written in Elixir already?

Maybe Sage is what you’re after:
https://hexdocs.pm/sage/readme.html

derek-zhou · March 19, 2021, 11:51pm

FYI, Riak is written in erlang.

zeroexcuses · March 19, 2021, 11:56pm

Based on a very superficial read of github readme, Noria dramatically speeds up MySQL reads through pre-computation / caching – which is a very cool idea, but not what I am looking for.

I still don’t understand Sage/Sagas does – so I’m not sure.

To clarify a bit, let us consider compilers/parsing for a moment. At one point in time people wrote parsers by hand. Then people invented Regex, CFG, LL, LR, Peg, Packrat, Parsec – and now, it is not uncommon to write a parser for a toy language in 1-2 hours.

If we look at databases that run on multiple machines – these are currently herculean efforts – and what I am curious here is if Erlang/Elixir has libraries / building blocks that drastically simplify these tasks. The reason why I believe this might be possible is that Elixir/Erlang already (via OTP) greatly simplifies writing distributed / fault tolerant severs; so it seems plausible that there would be some similar library that also helps with distributed databases / distributed transactions.

stevensonmt · March 20, 2021, 12:38am

Another Erlang example that I think fits what you’re after:

Actually they depend on Riak, which you’ve already mentioned. Nevermind.

zeroexcuses · March 20, 2021, 1:00am

I appreciate the effort here, but I should probably clarify a bit more.

3 = stuff built on top of these databases

2 = DynamoDB, Cassandra, Voldemort, BigTable, Riak, Spark, RedShift, ...

1 = ?????

0 = Language / Runtime: C, C++, Rust, Go, Java, Spark, Elixir/Erlang/Beam

If we look at this diagram here, what I’m really interested in are the stuff in “layer 1”. At layer 0, we have the languages / run times. At layer 2, we have these heroic efforts of distributed databases.

What I’m really after are ideas / libraries that exist in “layer 1” – stuff that we build on top of existing languages / runtimes, but are also useful in building these distributed databases.

mpope · March 20, 2021, 1:46am

Your question about level 1 is quite broad.

Does level 1 include storage engines? You could make a storage engine based off of existing k/v stores written in Erlang / Elixir, like CubDB and Leveled. There are also tools that have erlang wrappers like eleveldb, which mnesia_eleveldb uses as a backend. That being said, Mnesia itsself exists as a database. So is DETS

Does level 1 include serialization? sext, message pack implementations, term_to_binary, BERT, etc all exist.

Does level 1 include distributed ordering and transactions? evc provides vector clocks. There are monotionically increasing values like ksuid implemented in Elixir.

riak_core is also a ‘level 1’ technology, a distributed consistent hash ring. Much like what backs dynamo.

Does level 1 include consensus? Paxos implementations exist, as well as ra, a raft implementation exist.

Does level 1 include distributing state? Is it locating remote processes, horde and swarm, and pg exist. Eventually consistent state with CRDTs like lasp and delta_crdt exist.

Does level 1 include tools for building a query language? leex, yecc, nimble_parsec exist.

Does level 1 include data structures for caching? Building indexes? Query optimizers?

As you can see, level 1 is broad and my answer hardly scratches the surface. Your end goal isn’t clear. Narrowing your question will make it easier to get answers. Maybe skimming awesome-elixir or awesome-erlang will give some hints, as they’re well organized resources. But, the GenServer certainly is a ‘building block’ for a distributed system.

kokolegorille · March 20, 2021, 2:25am

Anyone willing to help should not have to feel this way. @stevensonmt, thank You for your answer.

Please consider respecting others, even if answers does not fit your need. If this thread continue in this trend, I will have to close it for a moment…

zeroexcuses · March 20, 2021, 2:37am

With exception of parsing/query optimization, I’m interested in everything else. (Storage engines, Distributed Clocks, CRDTs). Do you know of any resource that tries to break down the various trade-offs on orthogonal dimensions? This problem domain seems to fit Elixir/Erlang really well.

I feel like there exists here an interesting set of idioms that exists at “one level above OTP”, but are still low level enough to be useful primitives for building these distributed databases / transactions.

tj0 · March 20, 2021, 10:41am

With mpope on this one, distributed systems are an extremely broad topic to cover.

I’m not sure of anyone actually breaking down the tradeoffs per say, but there are various good reads out which you can search for the keywords.

Serializability vs Linearizability
Operational transforms vs CRDT’s
CAP / PACE
Vector clocks (lamport)
Paxos/Raft
TLA+
GitHub - lasp-lang/lasp: Prototype implementation of Lasp in Erlang.
https://www.foundationdb.org/ built an entire language to ensure correctness over the network.

Hope that helps.

antoine · March 20, 2021, 11:53am

Distributed systems is a fascinating topic !

This talk might be interesting to you: ElixirConf 2020 - Thomas Depierre - How we built a distributed datastore in Elixir - YouTube

Not to discourage you, but to get some feedback from people who did it.

An article that is I thing good to start with: The dangers of the Single Global Process

To give some feedback, on my side, I did some experiment a few years ago.
On pet projects with riak_core, riak_core_lite, jepsen, and other stuff.
I learned a lot about many things topics: network, deployment, docker …

Then I felt, like I need to know more theorical things, so I dive in some very good ressources:

Books:

Papers: Raft, Paxos, Dynamodb

Then, the question was: How to test it ? To ensure that it do not only “work on my machine” !

That question lead me to https://jepsen.io/ , the Riak test suite and some property base testing tools.

It’s a quite a huge journey to build a distributed database, even with the Erlang/Elixir/OTP stack.

But, even if you build 10% of it, you will have a lot of fun and get very valuable knowledge about:

coordinating many microservices
network latency, idempotency, retry
ordering, duplicate messages
…

derek-zhou · March 20, 2021, 3:23pm

Great article. However, start thinking in distributed system too early is another evil. SGP, possibly with ETS to speed up reading can get you a long way.

zakimedina · March 21, 2021, 2:14pm

Your original post triggered conversations I have personally had around how Elixir & OTP present a different paradigm in writing distributed and concurrent software. Building an app using Elixir and OTP is kinda like building a monolith with built-in microservices. It is a different paradigm than developing a traditional microservices architecture which will have a different treatment for storing or persisting user or application state.

OTP itself can help with distributed state and distributed transactions.

Distributed databases is a different beast all together. Here is a great article that reflects the more cloud-native distributed databases and persistence technologies.

Personally, I have been working with some success on leveraging Distributed SQL databases and its integrations with Ecto and Phoenix. Examples are Yugabyte (which leverages a Postgres query layer) and CockroachDB. You can check those out if you like.

WolfDan · March 21, 2021, 2:45pm

The closest I can think of is foundationdb (https://www.foundationdb.org/) It’s not a database itself because it lacks the query, planning etc for making it an actual database, but it gives an api which you can apply to actually build a database on top of it, tho it handles all the distributed aspect you’ve to keep in mind a lot of things to maintain that distributed features, so I consider it a good building block

entone · March 21, 2021, 9:47pm

There are some nice Erlang bindings for RocksDB and riak_core is probably as close as you’ll get for an off the shelf library for a distributed DB.

BarrelDB is also an interesting project, barrel-db · GitLab