Share an Apache Arrow memory buffer between Elixir and Rust

Apache Arrow is a columnar disk / wire / memory format which is being adopted by a bunch of high profile projects. It’s basically the successor to the widely used Pandas dataframe library in the Python data science ecosystem, with the big advantage of being cross language.

I want to finally take a stab at bringing Arrow to Elixir natively, after having toyed with doing so a few years ago.

The problem is that Apache Arrow’s big selling point, apart from its columnar design, is its “zero copy” format that is, it doesn’t need to be serialised / deserialised, to work across languages. This make it extremely efficient to memory map files, and indeed, to share buffers between processes. The latter of course, is a big no-no in Elixir unless you use ETS.

Now, Elixir is bad at maths, but Rust is excellent at maths, and Apache Arrow’s Rust implementation has a huge amount of compute kernels which I would like to access. Given that Arrow’s columnar format presupposes possibly gigantic data sizes, serialization between Rust and Elixir, even without encoding, is potentially going to be very slow and inefficient, so I’d like to use the Rust compute kernels “in place” on the Elixir process buffer, which would need somehow for the memory between Rust and Elixir to be shared.

So before I embark on this project (and I already have someone who’s willing to help), I’d like to know IF Elixir is capable, somehow, of accessing an externally defined buffer, either via a memory address through Rustler, for example (ie, can a Rust NIF share a buffer with an Elixir process, and if so, is this functionality exposed in Rustler?), or (second best), memory mapping a file.

If neither, what are my options for inter-operating with external processes on large amounts of (likely in-memory but possibly memory-mapped disk) data, fast.

Please note that I understand the risks of shared memory but I need performance.

For guide, the wider project I will use this for is massively parallel data ingest from a large amount of data APIs, all orchestrated by Elixir with its excellent capabilities and failover protections in this domain, into arrow columns, with analytics performed externally in real time by something performant. Basically, I want the whole thing orchestrated by Elixir, with compute “help” from Rust. I plan further to integrate this into the federated Matrix protocol to provide a decentralised financial/crypto/other data infrastructure that competes open-source with the horribly expensive centralised providers.

REDIS or other third party brokers are not an option for the same copy-overhead reason.

2 Likes

Yes, via the NIF API: Erlang -- erl_nif - In a nutshell, you define the buffer as a NIF resource and then you can retrieve its underlying contents as binary. The resource will only be deallocated once the binaries have been garbage collected.

Also look into Explorer, which is built on top of Polars (which is built on top of Arrow): GitHub - elixir-explorer/explorer: Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir

1 Like

Basically looks like what I’m planning is already implemented by Explorer? Does Explorer natively understand the Arrow columnar format in Elixir code, or does it rely entirely on the backends for that?

It relies on backends at the moment.

There is a project within Membrane called shmex which allows shared memory between processes… and Unifex which facilitates creation of C-based NIFs. For Rust-based NIFs use Rustler.

Further information on shared memory:
https://man7.org/linux/man-pages/man7/shm_overview.7.html

I would expect that you could either

  • In case of NIFs wrap the reference allocated chunk of memory in an opaque struct which holds the reference on Elixir side, which is unpacked and accessed properly on NIF side (may be able to avoid shmem), or

  • In case of C node or external process, use shared memory; establish a signalling layer and then use the shared memory like a frame buffer

It would take a lot of soak testing but should give you the performance you require. Bottom line: use shared memory APIs to facilitate exchange of data between OS processes and package it up for easy signalling with Elixir.

I am not fluent in Rust but have noted existence of shmem, shmem_ipc.

Good luck :>

1 Like

Awesome set of links especially the POSIX stuff. Very handy - on it.

1 Like