Big Data with Elixir

cjbottaro · July 3, 2016, 3:19pm

I’ve been thinking about this a little too and thought of using Etcd instead of Zookeeper. But I’m unsure of how to actually use it. Using it keep track of the state of each processor, for every message they receive would absolutely destroy performance (I think) and also require a pretty beefy Etcd cluster to handle the traffic. I should start a topic in the “Your Libraries and Projects” area to brainstrom…

Also, can you explain the benefit of recreating something like Etcd in Elixir? Is it to remove the overhead of JSON over HTTP for each request? To reduce operational complexity of having to maintain a separate coordination cluster?

Thanks!

andre1sk · July 3, 2016, 4:12pm

The reason that a lot of projects rely on zookeeper is that properly implementing paxos, raft or zab in case of zookeeper is very hard and resource consuming project, fortunately there is fairly mature erlang implementation in
the form of riak_ensemble which could serve as foundation. Now implementing kafka like system that relyes on riak_ensemble would be a really cool project.

uranther · December 27, 2016, 6:05am

Elixir / OTP primitives

GenStage is a specification for exchanging events between producers and consumers. It also provides a mechanism to specify the computational flow between stages
- Experimental.GenStage (docs) - a behaviour for implementing producer and consumer stages
- Experimental.Flow (docs) - Flow allows developers to express computations on collections, similar to the Enum and Stream modules, although computations will be executed in parallel using multiple GenStages
- Experimental.DynamicSupervisor (docs) - a supervisor designed for starting children dynamically. Besides being a replacement for the :simple_one_for_one strategy in the regular Supervisor, a DynamicSupervisor can also be used as a stage consumer, making it straight-forward to spawn a new process for every event in a stage pipeline
GenEvent is a behaviour module for implementing event handling functionality.
The event handling model consists of a generic event manager process with an arbitrary number of event handlers which are added and deleted dynamically.
An event manager implemented using this module will have a standard set of interface functions and include functionality for tracing and error reporting. It will also fit into a supervision tree.

Basho Riak

Bitcask is an Erlang application that provides an API for storing and retrieving key/value data using log-structured hash tables that provide very fast access. The design of Bitcask was inspired, in part, by log-structured filesystems and log file merging.
Riak KV (key-value) is a distributed NoSQL database designed to deliver maximum data availability by distributing data across multiple servers. As long as your Riak KV client can reach one Riak server, it should be able to write data. Its default storage backend is Bitcask and it also supports LevelDB and memory backends.
Riak KV Enterprise includes multi-datacenter cluster replication, which ensures low-latency and robust business continuity.
Riak CS (cloud storage) is an object storage system built on top of Riak. It facilitates storing large objects in Riak and presents an S3-compatible interface. It also provides multi-tenancy features such as user accounts, authentication, access control mechanisms, and per account usage reporting.
Riak TS (time-series) is a distributed NoSQL key/value store optimized for time series data. With TS, you can associate a number of data points with a specific point in time. TS uses discrete slices of time to co-locate data. For example, humidity and temperature readings from a meter reported during the same slice of time will be stored together on disk.
Riak Pipelines is most simply described as “UNIX pipes for Riak.” In much the same way you would pipe the output of one program to another on the command line (e.g. find . -name *.hrl | xargs grep define | uniq | wc -l), riak_pipe allows you to pipe the output of a function on one vnode to the input of a function on another (e.g. kvget | xform | reduce).
Riak Ensemble is a consensus library that supports creating multiple consensus groups (ensembles). Each ensemble is a separate Multi-Paxos instance with its own leader, set of members, and state.
Basho Data Platform reduces the complexity of integrating and deploying the components of your technology stack, providing Riak KV in-product, NoSQL databases, caching, real-time analytics, and search. These features are required in order to run distributed active workloads across applications; BDP controls the replication and synchronization of data between components while also providing cluster management.
- Basho Data Platform (BDP) builds on Riak KV (Riak) to support your data-centric services. Ensure your application is highly available and scalable by leveraging BDP features such as:
  - Data replication & synchronization between components
  - Real-time analytics through Apache Spark integration
  - Cluster management
  - Caching with Redis for rapid performance (Enterprise only)
- Data Platform Core is the additive component to Riak KV, colloquially called “the Service Manager”, that enables Riak to run supervised 3rd party applications on a Riak+BDP cluster. More generally, it can be thought of an application executor and watcher, that provides service configuration meta-data exchange in a distributed, scalable, fault-tolerant method. The service manager is the foundation of the Basho Data Platform. It provides a means for building a cluster of nodes that can deploy, run, and manage platform services.
- Cache Proxy (Enterprise-only) service uses Redis and Riak KV to provide pre-sharding and connection aggregation for your data platform cluster, which reduces latency and increases addressable cache memory space with lower cost hardware. Cache proxy has the following components:
  - Pre-sharding
  - Connection Aggregation
  - Command Pipelining
  - Read-through Cache
- Leader Election (Enterprise-only) service enables Spark clusters to run without a ZooKeeper instance. The Leader Election Service uses a simple, line-based, ascii protocol to interact with Spark. This protocol is incompatible with the ZooKeeper protocol, and requires a BDP-specific patch to Spark for compatibility purposes.
- Spark Cluster Manager (Enterprise-only) provides all the functionality required for Spark Master high availability without the need to manage yet another software system (Zookeeper).This reduces operational complexity of Basho Data Platform (BDP).

Other

Skel is a streaming process-based skeleton library for Erlang. (Tutorial)
- Skel is a library produced as part of the Paraphrase project to assist in the introduction of parallelism for Erlang programs. It is a collection of algorithmic skeletons, a structured set of common patterns of parallelism, that may be used and customised for a range of different situations.
- Workflow Items
  - A Recurring Example
  - Sequential Function
  - Pipe Skeleton
  - Farm Skeleton
  - Ord Skeleton
  - Reduce Skeleton
  - Map Skeleton
  - Feedback Skeleton
CouchDB is a document-oriented NoSQL database architecture and is implemented in Erlang; it uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.
Paratize - Elixir library providing some handy parallel processing facilities that supports configuring number of workers and timeout.

jwarlander · December 27, 2016, 11:09am

I’ll just put this out there:

dmoc · February 17, 2017, 4:49pm

Given the inherent size of Big Data operations I think it will be very difficult to get organisations to adopt alternative technologies without an extremely good business case. Still, interesting thread and thanks for the effort.

Some other sw you might find interesting, maybe for “Not So Big-Data” ™ projects, Apache NiFi and Apache Zeppelin.

I’m only half joking with “Not So Big-Data” tag because all the focus (generally and for $$$ obviously) is on the enterprise level. “Big Data” used to be a relative term and I think there’s still a lot of scope for providing tooling for the segment between Desktop…SME, without them having to jump straight to enterprise level solutions.

Question (from someone with little practical FP): I see “immutability” cropping up a lot regarding raw data processing speed and discussions typical end with fact the Beam doesn’t and possibly never will support mutable buffers. As someone pointed out, pick any language and it’s a good bet that it relies on some well established libs. So the question: what’s wrong with the idea of splitting a pipeline into immutable/mutable stages, implementing the latter in whatever is best/convenient and hooking everything up via unix pipes? Is this method slow or inflexible? Are there any benchmarks, guidelines, etc, for alternative methods?

(edit) PS: I understand there are a few scenarios where you would want to edit data in-place but, guessing, there are a lot more where the transformed data is going to copied elsewhere eventually.

dmitriid · February 21, 2017, 1:31pm

I’ve mentioned this before: https://medium.com/@dmitriid/erlang-is-dead-long-live-e-885ccbcbc01f#.fyc9oiwv2

But I’ll mention this again: Java is becoming Erlang much faster than Erlang is becoming Java.

The following are illusions of grandeur that have no place in the real world:

“Erlang/OTP being more well-suited for the task”
“I still have the feeling that Elixir could kill in all these areas”
“it seems that way because OTP has mostly solved the big and distributed part of the problem.”
“Java is not the most suitable language”
“Erlang or Elixir, which provides much more than what Hadoop of Spark gives you (fault tolerance distributed system with native map reduce)”
etc.

It’s one thing two spawn a million green threads on your laptop. It’a different thing running a couple of thousand of distributed nodes with multiple tasks per node, spread across several data centres.

Guess what, Erlang/OTP is as suited for this task as Java: it’s not. Erlang was never designed to handle thousands of nodes in heterogenous environments. There are multiple known limitations of its distribution protocol. There’s a reason RELEASE exists.

However, Java provides established libraries and approaches that solve these problems. It might not be easy, but definitely not writing everything from scratch (or relying on Basho’s riak_* libraries).

Distribution? Distributed storage? Distributed databases? Streaming? Distributed map-reduce? Real-time analytics on streams?

All these problems have been solved multiple times over in other languages (primarily Java). Or they have been solved by AWS (stream your data into Kinesis/Kinesis Firehose, analyse in real time with Kinesis Analytics, dump into RDS for warehousing). Aaaand to use AWS you’ll undoubtedly use Python or Java, not Erlang.

There are multiple reasons for that, obviously. The main is, definitely, that Java (Python or whatever else) are just so much popular than Erlang.

The other important on is: Erlang has too long prided itself on being oh so much superior to other languages in anything that comes to parallel and distributed computing. So long that it completely missed other languages marching on and improving in the same areas. If not on a language level, then on library and infrastructure level.

So, the reality of today is the following. Unless someone is smart enough and has enough resources to build a library/infrastructure on par with other languages (a new Kafka/Hadoop/Kubernetes/Cassandra/Spark/Fink/…the list just goes on and on and on, doesn’t it?..), the only way to deal with BigData in Erlang/Elixir is to write a proper library to interface with any of these systems. Anything else is just illusions of grandeur.

josevalim · February 21, 2017, 2:55pm

This is a very bad attempt to grab the headlines. Java is not becoming Erlang, rather Java has a more complete ecosystem than Erlang/Elixir, which is not news for anybody.

And this is not a zero sum game. I have seen companies running hundreds of “heterogenous” Erlang nodes by using RabbitMQ or Apache Kafka for the communication. In my opinion, the part that Erlang shines is exactly in building a homogenous sub-system that runs on my infrastructure. It is an absolute pleasure to build a distributed pubsub system for websockets communication or something like Phoenix.Presence without a need to bring in third party tools. Then it can integrate with the rest of the system using thrift, a message queue, etc.

In the worst scenario, where you cannot rely on the Erlang distribution, then you have to pick up an off the shelf solution, as you would in any other language. And as the Erlang/Elixir communities grow, the quality of the packages that interface with those off-the-shelf brands will continue to improve.

I truly hope this is fading away because it is easy to point out that parallel, concurrency and distributed computing have so many branches that there will be no single platform or ecosystem that can effectively tackle all options. Yes, you will end-up mixing technologies. However, I do hope we will continue to improve on our side of things.

bbense · February 21, 2017, 6:27pm

FWIW, Mesosphere has done a lot of work on the Erlang at scale problem.

https://github.com/dcos/lashup

In my experience, one of the strengths of Erlang/Elixir is that you can turn “big data” problems into medium data ones that actually work in a much smaller hardware footprint.

ssagaert · February 22, 2017, 11:52am

It’s not just Java. Scala is pretty popular within the big data/data science niche. Especially if you work with Spark since it’s written in Scala and hence the Scala API the actual ‘native’ Spark API.

There’s also Clojure that has some popularity within the data science community.

Unsurprisingly both these languages support functional programming.

krapans · February 23, 2017, 9:11pm

Have someone of you used Elixir for data mining? I have a case, where I have lots of data logs, lots of user input data, how they reacted to different user scenarios and would like to start playing around with that and get out maybe some useful insights using data mining.

petermorrow · May 3, 2017, 6:46am

What about a project that isn’t quite a “Big Data” project? Rather a Medium Data project - if you will.

Like an ETL / Data Warehousing project that required data to be collected from a dozen data sources that in total are not quite in PB territory. With Elixir it would be fairly straightforward to write an app that queries and processes the data sources concurrently. There would certainly be some number crunching involved but nothing at a scale that justifies building a Hadoop cluster.

Would the issues discussed on here about Elixir/Erlang’s weak spots in the number crunching arena be problematic at this “Medium Data” scale?

synthslave · January 25, 2018, 8:18pm

Sorry to bump this older thread but I couldn’t find any recent threads on big data/data science/machine learning. I’d like to play around with Spark/Hadoop with website I’m making with Elixir/Phoenix to compare against PostgreSQL. Is anybody else here doing this and/or experienced any problems/issues?

I’ll share my results but warning, I’m a noob to it all.

Thanks

greenz1 · February 5, 2018, 3:37pm

Boldly go where no one has ever gone before. ~Star Trek

This is one of those not-so-touched areas and your contribution will be very valuable to the community. I would be watching this thread for getting details about your experiment.

synthslave · February 12, 2018, 5:17pm

@greenz1 - Here’s what I’m considering doing:

Per the very helpful suggestion of @OvermindDL1 , make use of PostgreSQL’s GIS extension to collect city data from say, Manhattan, via their API endpoint URL, then use FDW foreign data wrapper to allow Spark(Python) to access that data to do it’s “black magic” on it to give me useful analysis (my first time ever doing big data/data science so hopefully Udemy course is good) Also per a suggestion from GIS forum, Python apparently has good analysis libraries like scipy, matplotlib, or scikit-learn to look for patterns so am thinking that is another reason to utilize Spark/Python combination to work with FDW on PostgreSQL

I have no idea if this will work but will give it a shot

sb3rg · August 16, 2019, 8:03pm

Definitely interested in this. It seems like Elixir/Erlang would be the perfect fit for the Lambda Architecture.

mkunikow · August 24, 2019, 6:14am

Lambda Architecture … is dead.
Now is only streaming architecture.

mattneel · September 3, 2019, 10:10pm

Use Bondy ( and optionally Apache Kafka or some other message streaming service) to tie all your microservices together so if you need to do basic analytics or data science in other languages, you can. Use the right tool for the job at the time. Bondy makes it relatively simple to go back and couple in a new implementation.

MeerKatDev · December 23, 2019, 12:22pm

I would guess they were talking about the BLAS libraries, which are the standard for C-based stuff.

onkara · August 19, 2020, 9:26pm

Being complete n00b I am trying to draw some take home conclusions from this thread …

Erlang/OTP can be used as glue for coordination/synchronization but not for number crunching.
Use Ports/NIFs to interface with number crunching tools written in C/C++/Rust/Julia

Please correct me if I am wrong.

IRLeif · January 3, 2021, 11:40am

Yes, more or less. That is my understanding as well.

I have provided some more details on my thoughts in this thread. Maybe it’s useful.