Big Data with Elixir

uranther · February 24, 2016, 7:55pm

I want to write Big Data applications in Elixir using, for example, the Lambda architecture. However, most of these software packages are written in Java, sometimes with an Apache Thrift interface, and/or Python API. I am wondering if anyone else is interested in this sort of thing, and what they’ve discovered so far.

Wiki

Elixir / OTP primitives

GenStage is a specification for exchanging events between producers and consumers. It also provides a mechanism to specify the computational flow between stages
- GenStage (docs) - a behaviour for implementing producer and consumer stages
- Flow (separate repo - docs) - Flow allows developers to express computations on collections, similar to the Enum and Stream modules, although computations will be executed in parallel using multiple GenStages
- ConsumerSupervisor (docs) - a supervisor designed for starting children dynamically. Besides being a replacement for the :simple_one_for_one strategy in the regular Supervisor, a ConsumerSupervisor can also be used as a stage consumer, making it straight-forward to spawn a new process for every event in a stage pipeline
GenEvent is a behaviour module for implementing event handling functionality.
The event handling model consists of a generic event manager process with an arbitrary number of event handlers which are added and deleted dynamically.
An event manager implemented using this module will have a standard set of interface functions and include functionality for tracing and error reporting. It will also fit into a supervision tree.

Basho Riak

Bitcask is an Erlang application that provides an API for storing and retrieving key/value data using log-structured hash tables that provide very fast access. The design of Bitcask was inspired, in part, by log-structured filesystems and log file merging.
Riak KV (key-value) is a distributed NoSQL database designed to deliver maximum data availability by distributing data across multiple servers. As long as your Riak KV client can reach one Riak server, it should be able to write data. Its default storage backend is Bitcask and it also supports LevelDB and memory backends.
Riak KV Enterprise includes multi-datacenter cluster replication, which ensures low-latency and robust business continuity.
Riak CS (cloud storage) is an object storage system built on top of Riak. It facilitates storing large objects in Riak and presents an S3-compatible interface. It also provides multi-tenancy features such as user accounts, authentication, access control mechanisms, and per account usage reporting.
Riak TS (time-series) is a distributed NoSQL key/value store optimized for time series data. With TS, you can associate a number of data points with a specific point in time. TS uses discrete slices of time to co-locate data. For example, humidity and temperature readings from a meter reported during the same slice of time will be stored together on disk.
Riak Pipelines is most simply described as “UNIX pipes for Riak.” In much the same way you would pipe the output of one program to another on the command line (e.g. find . -name *.hrl | xargs grep define | uniq | wc -l), riak_pipe allows you to pipe the output of a function on one vnode to the input of a function on another (e.g. kvget | xform | reduce).
Riak Ensemble is a consensus library that supports creating multiple consensus groups (ensembles). Each ensemble is a separate Multi-Paxos instance with its own leader, set of members, and state.
Basho Data Platform reduces the complexity of integrating and deploying the components of your technology stack, providing Riak KV in-product, NoSQL databases, caching, real-time analytics, and search. These features are required in order to run distributed active workloads across applications; BDP controls the replication and synchronization of data between components while also providing cluster management.
- Basho Data Platform (BDP) builds on Riak KV (Riak) to support your data-centric services. Ensure your application is highly available and scalable by leveraging BDP features such as:
  - Data replication & synchronization between components
  - Real-time analytics through Apache Spark integration
  - Cluster management
  - Caching with Redis for rapid performance (Enterprise only)
- Data Platform Core is the additive component to Riak KV, colloquially called “the Service Manager”, that enables Riak to run supervised 3rd party applications on a Riak+BDP cluster. More generally, it can be thought of an application executor and watcher, that provides service configuration meta-data exchange in a distributed, scalable, fault-tolerant method. The service manager is the foundation of the Basho Data Platform. It provides a means for building a cluster of nodes that can deploy, run, and manage platform services.
- Cache Proxy (Enterprise-only) service uses Redis and Riak KV to provide pre-sharding and connection aggregation for your data platform cluster, which reduces latency and increases addressable cache memory space with lower cost hardware. Cache proxy has the following components:
  - Pre-sharding
  - Connection Aggregation
  - Command Pipelining
  - Read-through Cache
- Leader Election (Enterprise-only) service enables Spark clusters to run without a ZooKeeper instance. The Leader Election Service uses a simple, line-based, ascii protocol to interact with Spark. This protocol is incompatible with the ZooKeeper protocol, and requires a BDP-specific patch to Spark for compatibility purposes.
- Spark Cluster Manager (Enterprise-only) provides all the functionality required for Spark Master high availability without the need to manage yet another software system (Zookeeper).This reduces operational complexity of Basho Data Platform (BDP).

Other

Disco is a distributed map-reduce and big-data framework that is similar to Hadoop. It has its own distributed file system called DDFS and can interact with HDFS. It is written in Erlang and exposes a Python interface. I have not found any Erlang API docs, but it should be feasible to create an Elixir library from it.
Skel is a streaming process-based skeleton library for Erlang. (Tutorial)
- Skel is a library produced as part of the Paraphrase project to assist in the introduction of parallelism for Erlang programs. It is a collection of algorithmic skeletons, a structured set of common patterns of parallelism, that may be used and customised for a range of different situations.
- Workflow Items
  - A Recurring Example
  - Sequential Function
  - Pipe Skeleton
  - Farm Skeleton
  - Ord Skeleton
  - Reduce Skeleton
  - Map Skeleton
  - Feedback Skeleton
CouchDB is a document-oriented NoSQL database architecture and is implemented in Erlang; it uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.
Paratize - Elixir library providing some handy parallel processing facilities that supports configuring number of workers and timeout.

Hadoop

HttpFS (REST API for HDFS)
C API libhdfs - could be wrapped with an Elixir NIF

Apache Kafka

Kafka is a high-throughput distributed messaging system.

KafkaEx - someone has dutifully made an Elixir library for Kafka using a binary interface

DOA

Apache Spark - Java or Python interface
Apache Storm - JVM + Clojure DSL; uses Apache Thrift, so non-JVM interfaces are feasible
Apache Samza - JVM + Clojure DSL; non-JVM support is on the roadmap
Apache Flink - JVM-only; example of Clojure implementation

From the above, you can see why I am interested in learning Clojure to interface with these libraries (I want to stay, far, far away from Java, and maybe less far from Scala). But, wouldn’t it be cool to have these available in Elixir?!

ejc123 · February 24, 2016, 10:13pm

I admit that I’m very new to Elixir and Erlang, but it seems to me that these are mostly attempts to build Erlang/OTP on the JVM. Granted, I’m not aware of a filesystem like HDFS for Elixir, but I think that much of this is trying to catch up to where Erlang/OTP is.

uranther · February 25, 2016, 4:37pm

I think you’re right! However, where is the Big Data revolution with Erlang/Elixir? It looks like the Erlang User Conference 2013 had a Big Data track where Erlang is considered to be the next big thing. (Each of those has a YouTube video, if you search the talk’s title.)

It must just be a matter of Java being more popular than Erlang, despite Erlang/OTP being more well-suited for the task. I guess it is up to us now to bring Elixir to the arena!

ejc123 · February 25, 2016, 4:59pm

I’ve given this question much consideration since my previous reply. Another thing to define is what you mean by ‘Big Data’. 5 years ago, it was all about Hadoop/Map-Reduce on data sets too large for a single machine.
More recently, people are tired of waiting for results and the tedium of actually programming Map-Reduce so you get things like Spark which use functional style to make the programming sane and close-enough-to-real-time processing.

I’m less familiar with Storm, Samza and Flink, but after a quick reading I’d say they all attack stream processing – such as log file analysis or credit fraud detection. Things where a decision is made on dynamic pools of data. I believe the same is possible with Spark, it’s just that these projects have their own specializations.
Just to muddy the waters further, we shouldn’t leave out Machine Learning and Deep Learning from the discussion.

I still have the feeling that Elixir could kill in all these areas. I have messed with Hadoop and Spark and the machinations they go through to manage nodes is unbelievable.

My plan for today is to watch those talks and do some more research on what is missing from Elixir’s ecosystem. Like you, I suspect it’s popularity.

mackenza · February 25, 2016, 9:41pm

Spark achieves similar results as Storm of Flink real-time stream processing by shortening its batch time to what they refer to as micro batch. Spark Streaming is not exactly the same as real-time streaming but the micro batch approach can get you pretty darned close (like seconds).

I read a pretty interesting benchmark post of this from some engineers at Yahoo - https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
Also, one of the guys from Yahoo was on Software Engineering Daily podcast talking about the results: http://softwareengineeringdaily.com/2016/02/03/benchmarking-stream-processing-frameworks-with-bobby-evans/

I am very much in agreement with the group here… why are all these tools written in Java? I think some of it is a self-fulfilling prophecy in that Hadoop was written in Java and became an Apache Foundation project. The Apache Foundation, while not exclusively Java-only, is favorable to Java So as they take on new, Hadoop-related (and derivative/alternative), the projects one by one seem to go through Apache incubation then promotion to official Apache projects. Maybe it’s this success with getting a proper home for your project that has attracted the authors to go with Java first? Who knows?

Maybe what we need is a similar thing for Elixir/Erlang (not sure, it may exist already)?

tuned · March 8, 2016, 1:59pm

Very interesting. I’m thinking about doing something with HDF5 plus Disco and TensorFlow. I come from Python and getting familiar with lambda architecture. See my recent post for an overview: https://lnkd.in/dwtt5PX
I have started now learning Elixir.

mackenza · March 8, 2016, 2:53pm

I am very interested in Disco (I <3 Donna Summer… oh wait)… I mean Disco Project

It’s Python you actually write the programs in, so I guess under the covers it’s interesting to Erlang/Elixir but not as much in its end use cases.

tuned · March 8, 2016, 4:22pm

Hi,
Just found the Disco Fmw few days ago. I have been on other options though.

I am very interested in Disco

Have you watched the movie The Martian?

Basically, I have studied the Hadoop stack but Java is not really my thing. The (formal) complexity of MapReduce is too high for the purposes I am implementing this pipeline (small-medium projects running on cloud providers, with very segmented data. Basically every microservices would handle few variables indexed by geopoints and its aggregates) and usually too resources-draining in terms of learning/working hours and compute costs. Something like the map and reduce command pipeline used by MongoDB CLI (former versions) would be already a quite powerful tool.

By now I am doing research and using the module for netCDF https://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4-module.html that is a network format for exchanging HDF5, the highly distributed file system by Unidata; and open source GIS solutions for geoquerying.
My focus now is to design and develop simple ad-hoc solutions to map HDF5 data to Web services with Python.
Soon I would like to add some new layers to this pipeline: map-reduce ( and I found the Disco Fmwk Python API) or a basic homebrewed library with some simple methods (as told before about early Mongo implementations), and a “compute unit” to ease data-scientists studying aggregates.
I thought to integrate this system with Kubernetes using multiple clusters of microservices. It’s a cool experiment I think.
I am considering Elixir because of I always wanted to learn functional, and concurrency can be the thing for making this possible.
Any idea or comment welcome

PS. Doing something with Elixir would make everything not monolithic, easily scalable for any level of usage, lowering deploying barriers. Making Python and Elixir working together programmatically on HTTP would be great!

PPS. I am also on Slack as lorenzogotuned, I started a channel #web-and-data

ejc123 · March 8, 2016, 4:29pm

This sounds exciting. I’m not really versed in slack (slack.com, right) yet, I tried to search for lorenzogotuned and #web-and-data, but didn’t find anything. What domain are you on – or am I getting something wrong.

tuned · March 8, 2016, 4:33pm

Hi, cool!
You have to sign up on the Elixir Slack group https://elixir-slackin.herokuapp.com/ and then sign in into Slack when you get the invitation in your inbox. Once you are signed in you look for the channel.
See you there.

ejc123 · March 8, 2016, 5:13pm

Thank you. I was missing the Elixir Slack group.

tuned · March 8, 2016, 5:39pm

We can actually start collecting basic design conditions (tools, target size of the data, deploying hypothesis, etc) and the links from our different posts into a repository. Are you ok if I start it on Github with a README? We can write there a kind of “state-of-the-art” as introduction to see what can be done or planned.

ejc123 · March 8, 2016, 5:48pm

Excellent idea!

tuned · March 8, 2016, 6:46pm

AstonJ · March 8, 2016, 7:27pm

Don’t forget you can make Wiki’s here too (which anyone at Trust Level 1 or higher can edit). If @uranther is ok with it we can make the first post here a Wiki and you could all maintain it.

The main benefits are that it may encourage more discussion, shows up if anyone searches for it and it’s easier to make changes to (no need for PRs needing to be approved, etc). Up to you

mkunikow · March 14, 2016, 4:27pm

I don’t think so you can beat something like Apache Spark (Spark SQL, Spark Streaming, Machine Learning, GraphX).
But what would be nice some kind of interoperability for example elixir -> Apache Spark or Apache Spark -> Elixir.

uranther · March 17, 2016, 3:54am

They have many more man-years of development invested, too. Because of OTP, the same functionality can be duplicated in much less Erlang/Elixir code and have better scalability and reliability.

I agree! In the meantime of some major Elixir Big Data projects, interoperability would be great.

ssagaert · March 17, 2016, 1:34pm

I think you are sadly mistaken: something like Spark cannot be duplicated in Elixir, easily or painfully.

The reason : Elixir ( & Erlang) are simply not suited for scientific programming/data science programming because they do not have mutable arrays and the corresponding linear algebra & machine learning & DSP libraries that build on top of that. Without mutable (large & possibly distributed) arrays it’s simply not possible to do any efficient number crunching.

Big Data is not just about accessing lot’s of data (files on HDFS or in various NoSQL dbs -> this can be done in Elixir) but it really is what you do with that data once you’ve accessed it and that’s where data science/machine learning comes in.

There was one promising project called Matlang for Erlang by one ph.d. guy but that seems dead. It was based on wrapping the native BLAS matrix library just like matlab, sciPy, etc do.It showed decent performance. One of the reasons this never became mainstream is that it required a patch to the Erlang VM for efficiciently calling BLAS and that patch was rejected by the Erlang team.

Also the Erlang VM doesn’t do any jitting to native code like the JVM does, so (scalar) numerical performance & tight loops to iterate over collections of numbers is not great (without wrapping a native numerical lib).

tuned · March 20, 2016, 11:40am

@ssagaert
Very interesting post.
We are talking in fact to create a possible synergy between Elixir and Python in the same pipe. Leveraging Elixir to access HDFS-like and inter-operate to Python for the map-reduce or machine-learning, and back again to Elixir for the database and Web inter-operability. This can be done by using protobuffer or a common swap space.
Again, one of the weakness of the software you mentioned is of being monolithic and to enforce the use of certain tools (above all Java).
Python can have some overhead problems at performing as the actual tools on the market, so creating a pipeline with the characteristics above can be a great plus for Python users (that are the great majority of data scientists) not using the Apache pipe.
I didn’t say, and the other the same, it’s easy or a few weeks work, but there is quite a demand to “ease” the actual big-data pipe.

ssagaert · March 21, 2016, 9:02am

I don’ t think wrapping Python with Elixir is a very good Idea. Python is slow and already a glue language so that’s like 2 glue layers. For performance you should interface with a faster lower level language.

“there is quite a demand to “ease” the actual big-data pipe.”
I’m not sure this is true. Have you even studied Spark? You can program it already in 4 languages: Scala (the native language it’s programmed in. This is actually a high level language but fast (& also complex unfortunately)), Java, R & Python. So for the Pythonistas: there really is no need to recreate Spark in Python since it already has a Python interface. Next to that you do do a lot of data retrieval/manipulation in SQL (for non-prgrammers) & there are interactive notebooks (like in iPython).

Nothing prevents you in principle to develop an Elixir interface to Spark but to make this workable you would need efficient (mutable) datatstructures to represent dataframes & large arrays (vectors/matrices/tensors) in Elixir and this does not exist at the time. And no a (nested) (linked) list is NOT going to cut it for representing vectors/matrices/tensors in Elixir.