Apache Arrow metadata deserialization

I would like to bring Apache Arrow to Elixir and plan to open source this project once I make some meaningful progress.

I have been following this section in the specification and using binary pattern matching to identify messages within an Arrow stream. A message’s metadata is serialized with FlatBuffers and contains among other things the size of the message body, which is used to determine where the message ends in the stream. Unfortunately there is no official implementation of FlatBuffers for Elixir.

I found an unofficial FlatBuffers library but it doesn’t support the struct data type used in Arrow’s schema files, plus it’s also a fairly inactive project (last commit was in 2017).

I had an idea of calling the FlatBuffers compiler from Elixir to return JSON which I could parse to retrieve the info I need, but this is a hacky solution and I don’t want to be making external calls to compilers when Arrow is supposed to be fast.

Is there any other way I would be able to deserialize this information for use within Elixir?

1 Like

Welcome to the forums @aaronhooper!

Have you taken a look at elixir-nx/eplorer? GitHub - elixir-nx/explorer: Series (one-dimensional) and dataframes (two-dimensional) for fast data exploration in Elixir &| Introducing Explorer :: Christopher Grainger's Blog

It is an Elixir dataframes library supporting Arrow via a polars backend. Polars is rust-based, and one of the dependencies provides deep support for Arrow (GitHub - ritchie46/arrow2: transmute-free Rust library to work with the Arrow format). You could take a look at that as a reference. Pulling in rust (e.g. via rustler) would make your library more complex to maintain, but may, on balance, make your life easier!

2 Likes