How to use Dataframes in Elixir for pandas style work?

I looked at Elixir dataframe @ https://github.com/jordipolo/dataframe

The data_frame.ex (https://github.com/JordiPolo/dataframe/blob/master/lib/data_frame.ex) file on the github page (for version 0.1.0) does not match the actual file downloaded as part of deps. As a result I am not able to get the examples to work. What is the correct way to flag this to the author ?

Also, is there any other package / way to get pandas/numpy style work done in Elixir ? I am ruby programmer, seduced by Elixir’s multi-processor and transparent networking possibilities (referring to the rooms in the portal example)

Thanks!

What is the correct way to flag this to the author ?

Maybe try opening an issue on GitHub.

But I don’t think BEAM (the VM Elixir runs on) is a good fit for Pandas/NumPy style of computations, it just wasn’t made for it.

1 Like

With blockchain technology on the rise and bigdata being where it already is, numerical crunching using native libraries should be a part of Elixir’s toolkit. I want a toolkit so that what I have in mind for the long term becomes feasible without having to invest in other languages.

@josevalim - I am sure you have an architecture in mind to execute this, or you have robust reasons as to why this is the “wrong thing” for Elixir. Either way I would really appreciate your inputs. It looks like the individual parts are available but not the sum of things. Is Numerical Crunching in Elixir missing some effort (development + architecture + supervisory) to fit things together ? Or is it that there are certain aspects of Erlang VM that make this a fool’s errand ?

First list is list of capabilities that exist - straight forward

Second list is list of capabilities that do not exist - From what I read on the forums, Elixir / Erlang lacks the following capabilities in order to be a contender in numerical processing.

Capability Exists:
INPUT/OUTPUT data from disparate non DB sources

Capability does not exist:
Native NumPy/SciPy

Other references from this forum:

7 Likes

I created an issue. here. I’m also working on a library for plot graph for ML. Using Dygraph and Electron if any of you guys are interest please pm me. I would really like to have some of your suggestions.

Have you looked at Matrex?

2 Likes

Hello, sorry, accidentally I delete my comment, this one:

Recently I am working with Erlang integrated to python using pandas, seaborn, and other tools, you can review my work here: https://github.com/zgbjgg/jun, is not finished but it’s working, also you can check https://github.com/zgbjgg/kaa and https://github.com/zgbjgg/oox

I hope this can be helpful!

Thanks!

3 Likes

To clarify one thing, the BEAM is not good for this type of work for the same reasons that Python isn’t. But, like Python, we can certainly make use of native libraries to overcome that deficiency. Also, it may not always be the case that it is inefficient to do natively on BEAM, if the BEAM implemented a JIT for example, which is how the JVM is able to be used for this type of stuff. There’s also the question of how much computation you need to do. Small sets of data may be fine to do these types of computation natively on BEAM.

4 Likes

I’m interested

I would’ve agreed more with your statement 6 months ago. Since then I’ve been doing more work in ML stuff on an IoT device and it’s been pretty nice. Streams, Processes, and now LiveView make building real-time continuous data processing with web user interface a charm.

Maybe not for dealing with huge datasets, but a few dozen gigs of data makes it well within the realm of many mid-sized or research lab data computation sizes.

I’ve contributed a bit to Matrex and highly recommend it. Of course I still do my optimizations and GPU work in Julia, but the two share a significant subset of syntax. Eg they both provide |> and do/end blocks which really helps lower my cognitive syntax overhead. Though I keep adding do’s to my Julia functions. :slight_smile:

For Elixir native processing HiPE is still around and some other newer JIT projects. I also think macro syntax extensibility give Elixir a pretty big heads up for future optimization bridges. One example I thought of would be using macros to serialize a chain of math function calls to a fast C/C++ math function engine as an extension to Matrex. Another would be a small Elixir AST DSL for a WASM based computational engine. Not as fast as pure jit but it could be safe, portable and still fast.

4 Likes

I think we’re actually in agreement. Matrex is exactly the type of native library I was talking about. It’s a wrapper around CBLAS, which is the same thing NumPy uses.

When I wrote before about performance, I mentioned that a JIT could optimize away the problems. I actually spoke to one of the core contributors to Erlang and he pointed out that part of the optimization for these operations requires in place mutation. So, with the native data structures that are immutable, you can certainly do these operations, they’ll just be less efficient than the mutable ones. There’s an interesting project that uses Erlang’s atomics to implement a matrix library called Matrax, which is inspired by Matrex. It’s interesting because it only relies on things native to Erlang.

1 Like

Ah gotcha. yes pure Elixir works for some things but large scale number processing wouldn’t be competitive even with JIT’ing. Still there’s some need to optimize the Elixir side of things – the same as to NumPy – there’s still lots of cases where you need to do a custom loop or function evaluation over a given matrix. That’s where I think Elixir AST/Macro’s could come in pretty handy. The Matrax library is an interesting take though not really that different from Matrex style NIF’s in the end. Aside from that I do find the lack of NaN’s (or Inf’s?) in BEAM to be annoying when dealing with number processing. Though some custom math operators could deal with that to a degree using atom’s.

Regarding some Macro<->AST bridge, from what I understand the TensorFlow library uses the Python code to build up a form of AST which is then compiled and run by the C++ side of things. It sounds like TensorFlow is moving to Swift partly to have better ability to extra the computation graph from Python code.

1 Like

Thanks for mentioning Matrax.
It’s actually better suited for works where you need that shared atomically mutable data structure accessed by multiple processes. For those get/2 & put/3 operations it’s fast.
Let’s say you generate a big 2D map and you would like to run some constraint solving algorithms on it in parallel while you are continuously updating it.

For matrix multiplication and similar workloads, where you create / garbage collect matrices continuously it is not well suited. But for those there is Matrex.

I would also love to see further developments in data science in Elixir. For the things I had to do in my previous workplace like cosine similarity calculation Elixir did the job well.

Hey I also <3 Julia. Definitely don’t use it very much anymore (one of my projects I used Julia macros to make functions that autogenerated equivalent verilog code). But when I dip in I also keep writing do’s in my functions. Would an elixir/Julia bridge be useful? I think it would be awesome to have Elixir spin up a Julia worker node with an aync threaded message handler ready to receive erlang terms (and if I get ambitious, lambdas too), but sadly Julia worker mode is not well documented.