I’ve been doing some stream processing jobs with Broadway / Kafka, and It has been working pretty well. Thanks Jose and Marlus!. Now I would like to store the output in parquet files (on object storage), but there are no libraries for doing that.
There is a really good project called Apache Arrow which is intended to build, manage and interoperate “dataframes” (like pandas or spark df) on memory, gRPC, and disk (parquet). I think if we could build and interoperate dataframes using Arrow, it would be easier to leverage Elixir in the Data Engineering and Data Science world, which is one of the things we are getting behind.
Arrow was been written in C++, Java, Go and Rust (WIP). Ruby, Js and Python libraries uses the C++ bindings.
I haven’t worked with Rust or Rustler, but reading about how Discord managed to write the SortedSet mutable data structure, makes me think we could do the same with Arrow.
What do you think about? If it makes sense and is someone is interested, we could work on it!