Apache Arrow is a columnar disk / wire / memory format which is being adopted by a bunch of high profile projects. It’s basically the successor to the widely used Pandas dataframe library in the Python data science ecosystem, with the big advantage of being cross language.
I want to finally take a stab at bringing Arrow to Elixir natively, after having toyed with doing so a few years ago.
The problem is that Apache Arrow’s big selling point, apart from its columnar design, is its “zero copy” format that is, it doesn’t need to be serialised / deserialised, to work across languages. This make it extremely efficient to memory map files, and indeed, to share buffers between processes. The latter of course, is a big no-no in Elixir unless you use ETS.
Now, Elixir is bad at maths, but Rust is excellent at maths, and Apache Arrow’s Rust implementation has a huge amount of compute kernels which I would like to access. Given that Arrow’s columnar format presupposes possibly gigantic data sizes, serialization between Rust and Elixir, even without encoding, is potentially going to be very slow and inefficient, so I’d like to use the Rust compute kernels “in place” on the Elixir process buffer, which would need somehow for the memory between Rust and Elixir to be shared.
So before I embark on this project (and I already have someone who’s willing to help), I’d like to know IF Elixir is capable, somehow, of accessing an externally defined buffer, either via a memory address through Rustler, for example (ie, can a Rust NIF share a buffer with an Elixir process, and if so, is this functionality exposed in Rustler?), or (second best), memory mapping a file.
If neither, what are my options for inter-operating with external processes on large amounts of (likely in-memory but possibly memory-mapped disk) data, fast.
Please note that I understand the risks of shared memory but I need performance.
For guide, the wider project I will use this for is massively parallel data ingest from a large amount of data APIs, all orchestrated by Elixir with its excellent capabilities and failover protections in this domain, into arrow columns, with analytics performed externally in real time by something performant. Basically, I want the whole thing orchestrated by Elixir, with compute “help” from Rust. I plan further to integrate this into the federated Matrix protocol to provide a decentralised financial/crypto/other data infrastructure that competes open-source with the horribly expensive centralised providers.
REDIS or other third party brokers are not an option for the same copy-overhead reason.