uranther
Big Data with Elixir
I want to write Big Data applications in Elixir using, for example, the Lambda architecture. However, most of these software packages are written in Java, sometimes with an Apache Thrift interface, and/or Python API. I am wondering if anyone else is interested in this sort of thing, and what they’ve discovered so far.
Wiki
Elixir / OTP primitives
- GenStage is a specification for exchanging events between producers and consumers. It also provides a mechanism to specify the computational flow between stages
GenStage(docs) - a behaviour for implementing producer and consumer stagesFlow(separate repo - docs) -Flowallows developers to express computations on collections, similar to theEnumandStreammodules, although computations will be executed in parallel using multipleGenStagesConsumerSupervisor(docs) - a supervisor designed for starting children dynamically. Besides being a replacement for the:simple_one_for_onestrategy in the regularSupervisor, aConsumerSupervisorcan also be used as a stage consumer, making it straight-forward to spawn a new process for every event in a stage pipeline
- GenEvent is a behaviour module for implementing event handling functionality.
The event handling model consists of a generic event manager process with an arbitrary number of event handlers which are added and deleted dynamically.
An event manager implemented using this module will have a standard set of interface functions and include functionality for tracing and error reporting. It will also fit into a supervision tree.
Basho Riak
- Bitcask is an Erlang application that provides an API for storing and retrieving key/value data using log-structured hash tables that provide very fast access. The design of Bitcask was inspired, in part, by log-structured filesystems and log file merging.
- Riak KV (key-value) is a distributed NoSQL database designed to deliver maximum data availability by distributing data across multiple servers. As long as your Riak KV client can reach one Riak server, it should be able to write data. Its default storage backend is Bitcask and it also supports LevelDB and memory backends.
Riak KV Enterprise includes multi-datacenter cluster replication, which ensures low-latency and robust business continuity. - Riak CS (cloud storage) is an object storage system built on top of Riak. It facilitates storing large objects in Riak and presents an S3-compatible interface. It also provides multi-tenancy features such as user accounts, authentication, access control mechanisms, and per account usage reporting.
- Riak TS (time-series) is a distributed NoSQL key/value store optimized for time series data. With TS, you can associate a number of data points with a specific point in time. TS uses discrete slices of time to co-locate data. For example, humidity and temperature readings from a meter reported during the same slice of time will be stored together on disk.
- Riak Pipelines is most simply described as “UNIX pipes for Riak.” In much the same way you would pipe the output of one program to another on the command line (e.g.
find . -name *.hrl | xargs grep define | uniq | wc -l),riak_pipeallows you to pipe the output of a function on one vnode to the input of a function on another (e.g.kvget | xform | reduce). - Riak Ensemble is a consensus library that supports creating multiple consensus groups (ensembles). Each ensemble is a separate Multi-Paxos instance with its own leader, set of members, and state.
- Basho Data Platform reduces the complexity of integrating and deploying the components of your technology stack, providing Riak KV in-product, NoSQL databases, caching, real-time analytics, and search. These features are required in order to run distributed active workloads across applications; BDP controls the replication and synchronization of data between components while also providing cluster management.
- Basho Data Platform (BDP) builds on Riak KV (Riak) to support your data-centric services. Ensure your application is highly available and scalable by leveraging BDP features such as:
- Data replication & synchronization between components
- Real-time analytics through Apache Spark integration
- Cluster management
- Caching with Redis for rapid performance (Enterprise only)
- Data Platform Core is the additive component to Riak KV, colloquially called “the Service Manager”, that enables Riak to run supervised 3rd party applications on a Riak+BDP cluster. More generally, it can be thought of an application executor and watcher, that provides service configuration meta-data exchange in a distributed, scalable, fault-tolerant method. The service manager is the foundation of the Basho Data Platform. It provides a means for building a cluster of nodes that can deploy, run, and manage platform services.
- Cache Proxy (Enterprise-only) service uses Redis and Riak KV to provide pre-sharding and connection aggregation for your data platform cluster, which reduces latency and increases addressable cache memory space with lower cost hardware. Cache proxy has the following components:
- Pre-sharding
- Connection Aggregation
- Command Pipelining
- Read-through Cache
- Leader Election (Enterprise-only) service enables Spark clusters to run without a ZooKeeper instance. The Leader Election Service uses a simple, line-based, ascii protocol to interact with Spark. This protocol is incompatible with the ZooKeeper protocol, and requires a BDP-specific patch to Spark for compatibility purposes.
- Spark Cluster Manager (Enterprise-only) provides all the functionality required for Spark Master high availability without the need to manage yet another software system (Zookeeper).This reduces operational complexity of Basho Data Platform (BDP).
- Basho Data Platform (BDP) builds on Riak KV (Riak) to support your data-centric services. Ensure your application is highly available and scalable by leveraging BDP features such as:
Other
- Disco is a distributed map-reduce and big-data framework that is similar to Hadoop. It has its own distributed file system called DDFS and can interact with HDFS. It is written in Erlang and exposes a Python interface. I have not found any Erlang API docs, but it should be feasible to create an Elixir library from it.
- Skel is a streaming process-based skeleton library for Erlang. (Tutorial)
- Skel is a library produced as part of the Paraphrase project to assist in the introduction of parallelism for Erlang programs. It is a collection of algorithmic skeletons, a structured set of common patterns of parallelism, that may be used and customised for a range of different situations.
- Workflow Items
- A Recurring Example
- Sequential Function
- Pipe Skeleton
- Farm Skeleton
- Ord Skeleton
- Reduce Skeleton
- Map Skeleton
- Feedback Skeleton
- CouchDB is a document-oriented NoSQL database architecture and is implemented in Erlang; it uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.
- Paratize - Elixir library providing some handy parallel processing facilities that supports configuring number of workers and timeout.
Hadoop
- HttpFS (REST API for HDFS)
- C API libhdfs - could be wrapped with an Elixir NIF
Apache Kafka
Kafka is a high-throughput distributed messaging system.
- KafkaEx - someone has dutifully made an Elixir library for Kafka using a binary interface
DOA
- Apache Spark - Java or Python interface
- Apache Storm - JVM + Clojure DSL; uses Apache Thrift, so non-JVM interfaces are feasible
- Apache Samza - JVM + Clojure DSL; non-JVM support is on the roadmap
- Apache Flink - JVM-only; example of Clojure implementation
From the above, you can see why I am interested in learning Clojure to interface with these libraries (I want to stay, far, far away from Java, and maybe less far from Scala). But, wouldn’t it be cool to have these available in Elixir?!
Most Liked
josevalim
This is a very bad attempt to grab the headlines.
Java is not becoming Erlang, rather Java has a more complete ecosystem than Erlang/Elixir, which is not news for anybody.
And this is not a zero sum game. I have seen companies running hundreds of “heterogenous” Erlang nodes by using RabbitMQ or Apache Kafka for the communication. In my opinion, the part that Erlang shines is exactly in building a homogenous sub-system that runs on my infrastructure. It is an absolute pleasure to build a distributed pubsub system for websockets communication or something like Phoenix.Presence without a need to bring in third party tools. Then it can integrate with the rest of the system using thrift, a message queue, etc.
In the worst scenario, where you cannot rely on the Erlang distribution, then you have to pick up an off the shelf solution, as you would in any other language. And as the Erlang/Elixir communities grow, the quality of the packages that interface with those off-the-shelf brands will continue to improve.
I truly hope this is fading away because it is easy to point out that parallel, concurrency and distributed computing have so many branches that there will be no single platform or ecosystem that can effectively tackle all options. Yes, you will end-up mixing technologies. However, I do hope we will continue to improve on our side of things.
ejc123
I’ve given this question much consideration since my previous reply. Another thing to define is what you mean by ‘Big Data’. 5 years ago, it was all about Hadoop/Map-Reduce on data sets too large for a single machine.
More recently, people are tired of waiting for results and the tedium of actually programming Map-Reduce so you get things like Spark which use functional style to make the programming sane and close-enough-to-real-time processing.
I’m less familiar with Storm, Samza and Flink, but after a quick reading I’d say they all attack stream processing – such as log file analysis or credit fraud detection. Things where a decision is made on dynamic pools of data. I believe the same is possible with Spark, it’s just that these projects have their own specializations.
Just to muddy the waters further, we shouldn’t leave out Machine Learning and Deep Learning from the discussion.
I still have the feeling that Elixir could kill in all these areas. I have messed with Hadoop and Spark and the machinations they go through to manage nodes is unbelievable.
My plan for today is to watch those talks and do some more research on what is missing from Elixir’s ecosystem. Like you, I suspect it’s popularity.
ssagaert
I think you are sadly mistaken: something like Spark cannot be duplicated in Elixir, easily or painfully.
The reason : Elixir ( & Erlang) are simply not suited for scientific programming/data science programming because they do not have mutable arrays and the corresponding linear algebra & machine learning & DSP libraries that build on top of that. Without mutable (large & possibly distributed) arrays it’s simply not possible to do any efficient number crunching.
Big Data is not just about accessing lot’s of data (files on HDFS or in various NoSQL dbs -> this can be done in Elixir) but it really is what you do with that data once you’ve accessed it and that’s where data science/machine learning comes in.
There was one promising project called Matlang for Erlang by one ph.d. guy but that seems dead. It was based on wrapping the native BLAS matrix library just like matlab, sciPy, etc do.It showed decent performance. One of the reasons this never became mainstream is that it required a patch to the Erlang VM for efficiciently calling BLAS and that patch was rejected by the Erlang team.
Also the Erlang VM doesn’t do any jitting to native code like the JVM does, so (scalar) numerical performance & tight loops to iterate over collections of numbers is not great (without wrapping a native numerical lib).







