Massive distributed parallel processing of large data sets with Elixir?

quda · August 15, 2022, 1:41pm

Is there a library/framework in Elixir for massive distributed processing of large data sets (BigData) ?
Actually, I need something like Apache Hadoop.
I found some mentions that GenStage+Flow could do it, but I cannot figure it and cannot grasp the “massive distributed parallel processing” in these frameworks. How to spread it on tens on nodes/cluster to perform “massive” MapReduce to flows of +50TB of data?
Are there other Elixir solutions for this ?

PS: For the time being we are doing this big data processing using a (ancient) custom stack developed on PhP/Hadoop, but as we need to migrate the entire back-end to Elixir I have to look for an Elixir (easy to implement, performant, sustainable) solution for this back-end tool.

bdarla · August 15, 2022, 2:41pm

You could find some relevant material in Concurrent Data Processing in Elixir (PragProg).
In addition to Gestate and Flow, the book discusses Broadway. According to the book, Broadway “offers a convenient way to build data-ingestion pipelines that consume events from external message brokers, like RabbitMQ, Apache Kafka, Amazon SQS, and more”.
I have to admit that although I have not read the book myself yet, but it is certainly relevant to your interests.

LostKobrakai · August 15, 2022, 5:04pm

GenStage, Flow, Broadway can be building blocks for something like that, but they‘re certainly not a solution for that problem without a lot of additional work put into the parts not provided by them. Especially the distribution part does not exist in them.

quda · August 15, 2022, 5:51pm

Quite indeed. This could be a project per se that requires lots of time/resources for study, planning, design, testing etc. and not a side task to complete “by the end of the month”.

Now I have to convince the customer about this.

MrDoops · August 15, 2022, 6:44pm

You can also look at GitHub - elixir-nx/explorer: Series (one-dimensional) and dataframes (two-dimensional) for fast data exploration in Elixir for fast columnar / OLAP workloads - some sort of Broadway ETL setup that grabs a CSV, imports to Explorer, runs your OLAP workloads, then loads the aggregated/transformed results is a very common sort of pipeline Explorer would be good at.

I’d recommend also looking at solutions such as https://flink.apache.org/ or https://materialize.com/ depending on your use case.

Flow + Broadway might get you far enough though if PHP + Hadoop is working already. I have a sneaking suspicion the Elixir ecosystem, maybe via the Nx projects, will end up with some Apache BEAM / distributed dataflow type tooling. At the very least something similar to Explorer taking elixir expressions that execute via Rust NIFs but in streaming dataflow contexts like Apache BEAM (the other BEAM - not our BEAM, but which is also cool).

SirWerto · August 16, 2022, 8:00am

Maybe you can give a look to Vessel. It is a interface with Hadoop written in Elixir. I don’t know the status of the package but could be helpful if you are keeping Hadoop in your stack.

quda · August 16, 2022, 1:00pm

Vessel looks quite promising for our scope, but sadly it seems abandoned and un-released: “Vessel is currently in a pre-v1 state”. And not a word about distributed processing.