Apache Arrow

I’ve been doing some stream processing jobs with Broadway / Kafka, and It has been working pretty well. Thanks Jose and Marlus!. Now I would like to store the output in parquet files (on object storage), but there are no libraries for doing that.

There is a really good project called Apache Arrow which is intended to build, manage and interoperate “dataframes” (like pandas or spark df) on memory, gRPC, and disk (parquet). I think if we could build and interoperate dataframes using Arrow, it would be easier to leverage Elixir in the Data Engineering and Data Science world, which is one of the things we are getting behind.

Arrow was been written in C++, Java, Go and Rust (WIP). Ruby, Js and Python libraries uses the C++ bindings.

I haven’t worked with Rust or Rustler, but reading about how Discord managed to write the SortedSet mutable data structure, makes me think we could do the same with Arrow.

What do you think about? If it makes sense and is someone is interested, we could work on it!

8 Likes

I see Elixir and Rust as brothers that are complementing each other and I’d love to bring some performance-critical (and useful!) pieces of technology to the Elixir ecosystem through Rust NIFs.

(When I settle in my new job – should be soon – I plan to finish my sqlite driver + Ecto 3.x adapter which uses a Rust library beneath.)

I have worked for some time with Rustler (0.22-rc) and I quite liked it. There are a few things it still needs to do better – like give you an explicit API for yielding back to the BEAM so you can manually make sure you are not clogging the dirty NIF schedulers (scroll down and look for “Long-running NIFs” for context) – but in general it has been an absolutely excellent project! (@scrogson do let me know if I am misrepresenting Rustler, has been a few months since I last looked at it.)

You only need to make sure to periodically yield to the BEAM VM with chunks of the work being done and you’ll have one lag-less and lighting-fast program.

To me Elixir and Rust pair naturally because Rust can make certain mutable data structures work extremely fast and Elixir can literally do everything else (including orchestration between services).

I might be overly optimistic here but I can see a future where Elixir and Rust converge together in a co-op environment / runtime, f.ex. the Lumen effort.

5 Likes

I don’t think this is overly optimistic necessarily. Hans (hansihe) works on both Lumen and Rustler. I know the Lumen folks are hoping to be compatible with existing C NIFs to not lose large amounts of the ecosystem. But I also think there is a great value in strengthening the Elixir + Rust case. Lumen is built with Rust. I guess we might stick with the Rustler approach though to not cut ties with C land. I think that would be a mistake.

4 Likes

Oh, definitely. We shouldn’t leave the old code behind – everything must JustWork™ and that’s that.

I am mostly saying that Rust can be made a first-class citizen in the BEAM land, and it should. Save for a few kinks yet to be ironed out, Rustler is quite amazing.

2 Likes

Following. Going to be building parquet files via Broadway soon also.

6 Likes

Following. Going to be building parquet files via Broadway soon also.

Did this get anywhere? I’m interested in using Broadway to process data from Kafka (serialized in Avro) and write parquet to HDFS.

Did anyone in this thread end up getting a Broadway batcher to write parquet?

Hey! yes, we recently got big news!

Christopher Grainger has been working on Explorer, a dataframe library that currently uses Polars (Rust) which implements Arrow and can export to Parquet. Explorer is now part of Elixir NX libraries, so I think the community will focus on this, which is amazing!. But it is not ready yet.

The other approach that I was thinking to use before I knew about explorer, was DuckDB, which is a single C library / database (like Sqlite but columnar), that can also export to Parquet. But as I don’t have experience with C and Nifs, I was waiting for the ODBC driver which should be ready soon.

If you try any of these approaches please let us know how it went.

Best.

4 Likes

I’m not sure how to use Explorer… Have you looked at GitHub - pinterest/elixir-thrift: A Pure Elixir Thrift Implementation? Is there a way to encode my data using the Thrift protocol and have it output columnar parquet files?
Still learning how to config all these parts together.

2 Likes

now you can use it!
Still WIP, but I beat it works well.

1 Like