Big Data with Elixir

tuned · March 21, 2016, 9:28am

I didn’t say anything about wrapping… but creating a data pipe that involves both… with common data-exchange format. Each one doing what it can do at best. I have used many times the word ‘inter-operability’ but never ‘wrapping’ I think.
What Spark does is a part of what a pipeline does. It’s a part in the pipeline. We are talking about the possibility of using Python and Elixir to build an ALTERNATIVE pipeline. Please see the posts above.
Just a note, a market with one alternative is not really choosing a tool, it’s undergoing a tool. Finally, the approach of saying “there is already another tool” doesn’t work in Open Source because every tool is shaped to solve a problem and never to solve every problem; there is always space for a tool doing the same thing but in a different way.

greyhwndz · April 20, 2016, 1:08am

This is an interesting thread. Would be exciting to further explore Machine Learning/Deep Learning and Big Data with Elixir.

I have heard from Francesco Cesarini in a video in youtube that number crunching may not be a strong area of Erlang. @ssagaert 's post is interesting on that account. I guess acknowledging that aspect or perhaps doing something or advocating improvements in that area for Erlang/Elixir would be something if Erlang/Elixir would be used way more in Distributed Data Computing

uranther · April 20, 2016, 4:18am

I wonder what avenues have been explored to improve Erlang VM’s performance for number crunching? I read that it may not have been a priority because of the lack of number crunching in telecom. Does that mean there is room for improvement, or is there an architectural problem that makes implementing fast matrices inexorable?

While it would be nice to do everything in Elixir, for now we will have to offload the CPU/memory-intensive calculations to a specialized language such as Julia, Python, R, Mathematica, C, Rust, et al. I recently bumped into the Futhark language, which is “a statically typed, data-parallel, and purely functional array language, and comes with a heavily optimising ahead-of-time compiler that generates GPU code via OpenCL.” I like the idea of staying in a functional language, and of targeting the GPU even more!

As programmers, we should aim to use a higher and higher-level languages because it’s easier to write correct code (Rich Hickey said this was one of his reasons for creating Clojure). The industry average defect rate is “about 15 - 50 errors per 1000 lines of delivered code,” so we want to write fewer lines of code for more reliable programs. The suckless community takes this idea to the extreme.

“One of my most productive days was throwing away 1,000 lines of code.” — Ken Thompson

We are wondering if we can we write similar Big Data tools in fewer lines of code using Elixir. And it seems that way because OTP has mostly solved the big and distributed part of the problem.

One point I liked in that blog post was that larger codebases tend to move toward service-oriented or microservice architectures. This makes sense to me as a way of isolating defects and reducing leaky interfaces. So why bet everything on Spark, which has MLlib, Streaming, YARN, GraphX, etc, all in one codebase? It is an impressive tool, and certainly convenient, but that’s not enough reason to not try to do it better, differently.

I wholeheartedly agree!

omnibs · April 20, 2016, 4:44pm

I have read somewhere just a few weeks ago that most libraries used for data science (extreme number crunching) use bindings to basically the same Fortran code. I’m struggling to find the quote, but it was in some discussion about Julia or Rust for data science vs R, Python, Matlab.

I wonder what would be the story for a numpy in Erlang/Elixir with nifs or ports, for instance. There’s the whole deal with reductions and the possibility that delegating too much work to number crunching code in other languages hurts reduction count and Erlang’s scheduler, which is IMO the real reason why Phoenix beat everyone else in the Phoenix Showdown benchmarks. It’s something I’d like to experiment with, but don’t have the time right now.

There is NumEr, which is a collection of Erlang NIF functions for BLAS operations on vectors and matrices with CUDA, but CUDA has an overhead and must be used wisely.

cgarciae · May 3, 2016, 2:29pm

I’ve been playing with rustler, it allows you to create NIFs in Rust. My intention is to create a “NumEx” to use real matrices from Elixir. While I am not 100% sure of Rust because of it also being a new language, but I find it as a language much pleasing than C/C++. Rustler is also the easiest way I’ve found to create a NIF.

cjbottaro · June 13, 2016, 5:28pm

Hello,

I started playing around with building a library that mimics Apache Storm, but in Elixir and much simpler. For those unfamiliar, Apache Storm lets you build complex job topologies/pipelines and runs them distributedly.

Currently, it’s capable of doing things like distributed word count and distributed join. When I say “distributed”, I really mean parallel (all processes run on a single BEAM VM for now).

The README gives a pretty good explanation of what it does and how it works:

Also see the examples:

So I just watched the ElixirConf.EU keynote and I’m a bit unclear on the uses of GenBroker/GenStage. Are those things that would supercede what I’m building, or are they things that would make what I’m building easier?

The keynote depicts pipelines as linear/straight, where as my lib lets you build a directed, acyclic graph, rather than just a straight pipeline. Is my understanding of GenBroker/GenStage correct, or would they allow you make complex graphs of pipelines?

Also, is there any interest in a library/framework like this? The motivation is that Storm, like Spark, like Hadoop are difficult to install/use and are generally extremely heavy with lots of operational dependencies, etc. I want something extremely simple use, albeit at the expense of crazy fault tolerance requirements and maybe performance. You know… “Medium Data” problems…

uranther · June 14, 2016, 2:37am

What a cool project (and cool name)! and exactly what I was thinking about when I posted this topic. The examples are so easy to comprehend because of Elixir’s expressiveness and the clean interfaces.

I’m sure this library would be useful to Alchemists with “Medium Data” problems I would encourage you to release a Hex package for it. Fault-tolerance and performance can come later as more people depend on the library.

I haven’t watched the keynote about GenBroker/GenStage but I too am curious about their applications in this sense.

andre1sk · June 14, 2016, 3:46am

We all like Elixir but using it for “Big data” would be a stretch BEAM is not optimized (and realisticly can not be optimized) for such workloads.

mkunikow · June 14, 2016, 10:40pm

There is Apache Storm, Apache Spark, Apache Flink, Apache Apex and there is also something called Apache Beam -> some top layer over existing platforms so you can easy move flow between them. Summary so quite a lot to play with

There will be course Big Data Analysis with Scala and Spark - free

mackenza · June 15, 2016, 11:52pm

On the BEAM not doing jit on the machine code… I thought I saw a roadmap for Erlang from a core guy saying they were working on jit using llvm for a future release. Possibly it was Erlang Factory SF? (on my phone so not looking too hard)

mackenza · June 15, 2016, 11:57pm

Found it… http://www.erlang-factory.com/sfbay2016/kenneth-lundin

montanonic · June 16, 2016, 3:45am

Any thoughts on using Elixir as the glue between Big Data applications written in other languages? As others have pointed out, numerical computation isn’t a strong suit of the EVM. However, I’m of the opinion so far that Elixir may very well be one of the best possible “glue languages” in existence, and if so, it seems like it would be much better to use it to piece together code in other languages more suitable to this domain.

Julia seems to be the most Elixir-like language suited to numerical computation / Big Data, though I’ve never personally used it (yet):
http://julialang.org/

asierguti · June 24, 2016, 1:44pm

I agree with the forum.

In order to process big amounts of data, you need a way to store it, and a way to process the data. The storage is solved with HDFS. Note that it’s not the only option, you can store it with ceph, GFS, etc. In fact, HDFS started as a way to replicate Google’s GFS.

Now, regarding processing, I still don’t get why Hadoop and Spark are so popular. As many mentioned, Java is not the most suitable language. It’s OK to build a nice web service easily, but it’s not the best language for high intensive data mining. The same applies to Spark and Scala. So, why Scala? My guess is that Scala introduces functional programming with Java syntax on top of JVM.

I believe that erlang and elixir are far superior. So, you don’t even need Hadoop for serious data mining. You just need a storage engine (MongoDB, Riak, CouchDB or whatever). If you want to operate on files, you just need a distributed FS, and I believe that Ceph for example is much better than HDFS. Then, you implement the entire mining in Erlang or Elixir, which provides much more than what Hadoop of Spark gives you (fault tolerance distributed system with native map reduce). If you need any need to do any data intensive stuff, you can implement them on C or C++, provide a C interface and called those functions from within Erlang/Elixir.

This makes much more sense that installing few different frameworks which want to emulate what you can do in a few lines of code. Unfortunately, nobody seems to implement big data and data mining this way.

asierguti · June 24, 2016, 1:50pm

By the way, somebody suggested Julia, Rust or other languages for data intensive stuff. No, only C or C++ apply here. The reason is that only these 2 languages provide full control over the memory. You can align the data exactly as you want, allocate or free memory at will, etc. Using those tools and knowing the computer architecture and how computers work, you can increase the data density, fully utilize the cache lines and increase substantially the speed. Other languages can’t do that, cause they are design to provide you a way to manage the memory layout.

A lot of people use python, R or mathematica for data mining. The only reason why they do that is because they have libraries which implement most of the stuff, from clustering to neural networks. You just create a small script, and you can play around with models. This is very good for uncovering data patterns and models. Now, if you want to apply those models on large data sets and near real time, you will have to move them to C or C++.

NobbZ · June 24, 2016, 2:02pm

You are wrong. There is also ASM, it gives you at least as much control over the memory as well as C, and even finer control about the CPU cycles used, assuming you know your target architecture.

Also there is Rust. If you don’t limit yourself to releases but are willing to use nightlys, you can also use extensions, which do allow a fine grained memory control. Only that you have to convince the compiler about your definition beeing memory-safe.

And since rust uses LLVM under the hood it is as portable as C.

asierguti · June 24, 2016, 3:21pm

No, assembly gives you control over the microprocessor, which register you want to load, etc. It doesn’t allow you to create complex structures and store them in a very specific order in the main memory. Well, you can actually do it, but it’s so difficult that nobody does it.

Yes, if you want to have full control over the CPU cycles, you can embed some assembly with SIMD instructions in your C or C++ code, but I don’t see anybody writing large scale software in assembly.

Regarding rust, why would I use Rust when I can use C++, which is much more mature? My feeling is that many people and companies try to reinvent something like C++ but in a safe way, but they end up with something that is inferior.

NobbZ · June 25, 2016, 6:47am

I second that no sane person would write large scale programs in ASM But still, if you would, you can design datastructures nicely aligned and even bitpacked as you can in C, but you don’t have syntax support.

Why would you use elixir when you can use erlang, which is much more mature?

Denying new languages simply because they are new and there are already others that have the saim/similar goal, is not well thought.

Despite of the existence of Rust, I wouldn’t use C++ unless I really have to. That’s a thing of personal preferences. The existence of Rust makes it easier to avoid C++.

And I really like the promisses of rust. You got a memory leak in your program? Blame Mozilla Foundation and make them repair rust!

Languages, Computer Systems and reqwuirements evolve and as such there will new languages pop up. Some of them will die, others remain. And rust is a good candiate for survival.

ejc123 · June 27, 2016, 3:19pm

I’ve worked with R a bit and most of the commonly used data mining libraries have the computationally intensive parts written in C. This is also a reason for going with R. Personally, I can’t get R to sink into my skull – I can never remember which syntax to use for anything.

uranther · June 28, 2016, 5:47pm

One opportunity for us is to make service similar to Zookeeper. Consider the following diagram which depicts how Kafka and Zookeeper interact, and it is similar for other systems such as Apache Storm.

Brokers, producers and consumers use Zookeeper to manage and share state

It is a type of coordination service and it’s exactly what OTP should take care of for us. I was thinking about @cjbottaro’s Tempest library. Once this library wants to scale across many machines, it will need some coordination between the processes in order to handle failures and retries and so on.

The key here is that the business need will drive the innovation; or, “scratch your own itch.” That is, if we can start to solve real problems by brewing our own Elixir data processing systems, then we have a real project to work from.

mkunikow · June 29, 2016, 7:46pm

Also nice to mention that google recently opened tencerflow