uranther

uranther

Big Data with Elixir

I want to write Big Data applications in Elixir using, for example, the Lambda architecture. However, most of these software packages are written in Java, sometimes with an Apache Thrift interface, and/or Python API. I am wondering if anyone else is interested in this sort of thing, and what they’ve discovered so far.

Wiki

Elixir / OTP primitives

  • GenStage is a specification for exchanging events between producers and consumers. It also provides a mechanism to specify the computational flow between stages
    • GenStage (docs) - a behaviour for implementing producer and consumer stages
    • Flow (separate repo - docs) - Flow allows developers to express computations on collections, similar to the Enum and Stream modules, although computations will be executed in parallel using multiple GenStages
    • ConsumerSupervisor (docs) - a supervisor designed for starting children dynamically. Besides being a replacement for the :simple_one_for_one strategy in the regular Supervisor, a ConsumerSupervisor can also be used as a stage consumer, making it straight-forward to spawn a new process for every event in a stage pipeline
  • GenEvent is a behaviour module for implementing event handling functionality.
    The event handling model consists of a generic event manager process with an arbitrary number of event handlers which are added and deleted dynamically.
    An event manager implemented using this module will have a standard set of interface functions and include functionality for tracing and error reporting. It will also fit into a supervision tree.

Basho Riak

  • Bitcask is an Erlang application that provides an API for storing and retrieving key/value data using log-structured hash tables that provide very fast access. The design of Bitcask was inspired, in part, by log-structured filesystems and log file merging.
  • Riak KV (key-value) is a distributed NoSQL database designed to deliver maximum data availability by distributing data across multiple servers. As long as your Riak KV client can reach one Riak server, it should be able to write data. Its default storage backend is Bitcask and it also supports LevelDB and memory backends.
    Riak KV Enterprise includes multi-datacenter cluster replication, which ensures low-latency and robust business continuity.
  • Riak CS (cloud storage) is an object storage system built on top of Riak. It facilitates storing large objects in Riak and presents an S3-compatible interface. It also provides multi-tenancy features such as user accounts, authentication, access control mechanisms, and per account usage reporting.
  • Riak TS (time-series) is a distributed NoSQL key/value store optimized for time series data. With TS, you can associate a number of data points with a specific point in time. TS uses discrete slices of time to co-locate data. For example, humidity and temperature readings from a meter reported during the same slice of time will be stored together on disk.
  • Riak Pipelines is most simply described as “UNIX pipes for Riak.” In much the same way you would pipe the output of one program to another on the command line (e.g. find . -name *.hrl | xargs grep define | uniq | wc -l), riak_pipe allows you to pipe the output of a function on one vnode to the input of a function on another (e.g. kvget | xform | reduce).
  • Riak Ensemble is a consensus library that supports creating multiple consensus groups (ensembles). Each ensemble is a separate Multi-Paxos instance with its own leader, set of members, and state.
  • Basho Data Platform reduces the complexity of integrating and deploying the components of your technology stack, providing Riak KV in-product, NoSQL databases, caching, real-time analytics, and search. These features are required in order to run distributed active workloads across applications; BDP controls the replication and synchronization of data between components while also providing cluster management.
    • Basho Data Platform (BDP) builds on Riak KV (Riak) to support your data-centric services. Ensure your application is highly available and scalable by leveraging BDP features such as:
      • Data replication & synchronization between components
      • Real-time analytics through Apache Spark integration
      • Cluster management
      • Caching with Redis for rapid performance (Enterprise only)
    • Data Platform Core is the additive component to Riak KV, colloquially called “the Service Manager”, that enables Riak to run supervised 3rd party applications on a Riak+BDP cluster. More generally, it can be thought of an application executor and watcher, that provides service configuration meta-data exchange in a distributed, scalable, fault-tolerant method. The service manager is the foundation of the Basho Data Platform. It provides a means for building a cluster of nodes that can deploy, run, and manage platform services.
    • Cache Proxy (Enterprise-only) service uses Redis and Riak KV to provide pre-sharding and connection aggregation for your data platform cluster, which reduces latency and increases addressable cache memory space with lower cost hardware. Cache proxy has the following components:
      • Pre-sharding
      • Connection Aggregation
      • Command Pipelining
      • Read-through Cache
    • Leader Election (Enterprise-only) service enables Spark clusters to run without a ZooKeeper instance. The Leader Election Service uses a simple, line-based, ascii protocol to interact with Spark. This protocol is incompatible with the ZooKeeper protocol, and requires a BDP-specific patch to Spark for compatibility purposes.
    • Spark Cluster Manager (Enterprise-only) provides all the functionality required for Spark Master high availability without the need to manage yet another software system (Zookeeper).This reduces operational complexity of Basho Data Platform (BDP).

Other

  • Disco is a distributed map-reduce and big-data framework that is similar to Hadoop. It has its own distributed file system called DDFS and can interact with HDFS. It is written in Erlang and exposes a Python interface. I have not found any Erlang API docs, but it should be feasible to create an Elixir library from it.
  • Skel is a streaming process-based skeleton library for Erlang. (Tutorial)
    • Skel is a library produced as part of the Paraphrase project to assist in the introduction of parallelism for Erlang programs. It is a collection of algorithmic skeletons, a structured set of common patterns of parallelism, that may be used and customised for a range of different situations.
    • Workflow Items
      • A Recurring Example
      • Sequential Function
      • Pipe Skeleton
      • Farm Skeleton
      • Ord Skeleton
      • Reduce Skeleton
      • Map Skeleton
      • Feedback Skeleton
  • CouchDB is a document-oriented NoSQL database architecture and is implemented in Erlang; it uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.
  • Paratize - Elixir library providing some handy parallel processing facilities that supports configuring number of workers and timeout.

Hadoop

Apache Kafka

Kafka is a high-throughput distributed messaging system.

  • KafkaEx - someone has dutifully made an Elixir library for Kafka using a binary interface

DOA

From the above, you can see why I am interested in learning Clojure to interface with these libraries (I want to stay, far, far away from Java, and maybe less far from Scala). But, wouldn’t it be cool to have these available in Elixir?!

Most Liked

josevalim

josevalim

Creator of Elixir

This is a very bad attempt to grab the headlines. :slight_smile: Java is not becoming Erlang, rather Java has a more complete ecosystem than Erlang/Elixir, which is not news for anybody.

And this is not a zero sum game. I have seen companies running hundreds of “heterogenous” Erlang nodes by using RabbitMQ or Apache Kafka for the communication. In my opinion, the part that Erlang shines is exactly in building a homogenous sub-system that runs on my infrastructure. It is an absolute pleasure to build a distributed pubsub system for websockets communication or something like Phoenix.Presence without a need to bring in third party tools. Then it can integrate with the rest of the system using thrift, a message queue, etc.

In the worst scenario, where you cannot rely on the Erlang distribution, then you have to pick up an off the shelf solution, as you would in any other language. And as the Erlang/Elixir communities grow, the quality of the packages that interface with those off-the-shelf brands will continue to improve.

I truly hope this is fading away because it is easy to point out that parallel, concurrency and distributed computing have so many branches that there will be no single platform or ecosystem that can effectively tackle all options. Yes, you will end-up mixing technologies. However, I do hope we will continue to improve on our side of things.

ejc123

ejc123

I’ve given this question much consideration since my previous reply. Another thing to define is what you mean by ‘Big Data’. 5 years ago, it was all about Hadoop/Map-Reduce on data sets too large for a single machine.
More recently, people are tired of waiting for results and the tedium of actually programming Map-Reduce so you get things like Spark which use functional style to make the programming sane and close-enough-to-real-time processing.

I’m less familiar with Storm, Samza and Flink, but after a quick reading I’d say they all attack stream processing – such as log file analysis or credit fraud detection. Things where a decision is made on dynamic pools of data. I believe the same is possible with Spark, it’s just that these projects have their own specializations.
Just to muddy the waters further, we shouldn’t leave out Machine Learning and Deep Learning from the discussion.

I still have the feeling that Elixir could kill in all these areas. I have messed with Hadoop and Spark and the machinations they go through to manage nodes is unbelievable.

My plan for today is to watch those talks and do some more research on what is missing from Elixir’s ecosystem. Like you, I suspect it’s popularity.

ssagaert

ssagaert

I think you are sadly mistaken: something like Spark cannot be duplicated in Elixir, easily or painfully.

The reason : Elixir ( & Erlang) are simply not suited for scientific programming/data science programming because they do not have mutable arrays and the corresponding linear algebra & machine learning & DSP libraries that build on top of that. Without mutable (large & possibly distributed) arrays it’s simply not possible to do any efficient number crunching.

Big Data is not just about accessing lot’s of data (files on HDFS or in various NoSQL dbs -> this can be done in Elixir) but it really is what you do with that data once you’ve accessed it and that’s where data science/machine learning comes in.

There was one promising project called Matlang for Erlang by one ph.d. guy but that seems dead. It was based on wrapping the native BLAS matrix library just like matlab, sciPy, etc do.It showed decent performance. One of the reasons this never became mainstream is that it required a patch to the Erlang VM for efficiciently calling BLAS and that patch was rejected by the Erlang team.

Also the Erlang VM doesn’t do any jitting to native code like the JVM does, so (scalar) numerical performance & tight loops to iterate over collections of numbers is not great (without wrapping a native numerical lib).

Where Next?

Popular in Wikis Top

BartOtten
A wiki for Doom Emacs Doom is a configuration framework for GNU Emacs. It focuses on performance and customizability. It can be a founda...
New
blackode
This is a wiki - anyone at Trust Level 1 or higher can help keep it updated. Elixir Pocket Syntax Uncommon Logical stuff of Elixir modul...
New
axelson
Preamble This Wiki is intended to be a community-maintained (see the Contribution Guidelines if interested) resource of common “gotchas” ...
New
shavit
To transcode the video there is FFMPEG. On Demand When a user uploads a video, the app renames and copy the file to a path, then call F...
New
nicbet
Introduction Now that the language is picking up support and maturing nicely, I’d like to start a collection of common and recurring Elix...
New
jdumont
Guide Using an iPad for web development can be easily split into two main parts: Setting up the iPad as a thin client Working in a remo...
New
Rich_Morin
I’d like to start a discussion of data serialization formats, in the context of Elixir. The rest of this note is a combination of persona...
New
georgeguimaraes
Hi people, since the new year is coming, I’d like to plan my travels for events in 2017. So, what events (Elixir or FP related) that you...
New
Eiji
At start some definitions: HTTPS (is a protocol for secure communication over a computer network which is widely used on the Internet) -...
New
AstonJ
I’ve noticed we’ve got a few now - wonder if we can compile a list? This is a wiki - anyone at Trust Level 1 can edit :023: Link: This...
New

Other popular topics Top

sorentwo
Hello! tl;dr Announcing Oban, an Ecto based job processing library with a focus on reliability and historical observability. After spen...
985 42842 311
New
AstonJ
Posting this to see if we can make things easier for people to get into Neovim. If you use Neovim and have a favourite distro please let ...
New
alice
Hey, Just curious what are the main benefits of Elixir compared to Clojure? When is Elixir more useful than Clojure and vice versa? Th...
New
aalberti333
As the title describes, I’m trying to run Enum.map() over a list of key/value pairs, where the value is a map. My data looks like this: ...
New
AngeloChecked
What learn first? Rust or Elixir Hi Elixir community! I’m here because i want learn a new language. I’m a junior developer and mainly i ...
New
bsollish-terakeet
Credo is smart enough to check for (something like) this: assert length(the_list) == 0 with this response: Checking if an enum is empt...
New
boundedvariable
I am going through the kafka architecture. All the features what the kafka is providing are already in Erlang. I would like hear your opi...
New
joaquinalcerro
Hi there, I am working with Ecto-Postgresql and I need to call all of the records from a specific table but the table has 40,000 record...
New
PeterCarter
There are pre-rolled solutions for other frameworks that do work. However, Phoenix does not seem to have these. Have people had good expe...
New
lanycrost
Hi everyone! I need implement if…else if…else condition from my elixir code, and anymore of this control flow structures not work proper...
New

We're in Beta

About us Mission Statement