Seeking info about Explorer and Table Formats

ndrluis · July 29, 2024, 5:02pm

Hello,

I am developing an Elixir library to work with Apache Iceberg called ExIceberg. My primary focus is to handle catalog responsibilities in Elixir while delegating the reading and writing processes to the Rust implementation.

Currently, we can manage some table formats using Trino through the req_trino library. However, I want to integrate Iceberg more closely with the Elixir ecosystem.

Here is some context about Apache Iceberg:

Apache Iceberg is a table format similar to Delta Lake and Hudi. It is not responsible for computing. In Java, we have implementations that work with Spark; in Python, there is some support via PyArrow; and in Rust, it will be implemented using Apache DataFusion as the query engine.

My initial idea was to get the RecordBatch returned from the Rust implementation and pass it to Explorer. However, after discussing with @philss about the responsibilities of Explorer, I want to clarify what exactly Explorer intends to achieve.

Initially, I envisioned Explorer as similar to Ibis. But since the release of Remote DataFrame, I see that Explorer is aiming to be more like Daft.

So, I have some questions:

What is the focus of Explorer? Is it to be more of a DataFrame interface for multiple backends or a query engine with a DataFrame API?

If the focus is to be a DataFrame interface, I can return just a RecordBatch or a Table from Arrow and pass it to Explorer. For this, we might need something like this: daft.from_arrow — Daft documentation. However, if the focus is on distributed reading, ExIceberg will focus solely on catalog interaction, and Explorer will use ExIceberg to get the metadata and Explorer read the files.

josevalim · July 29, 2024, 8:36pm

Hi @ndrluis! I have been learning about Iceberg recently, so it is really cool that you are tackling this.

Regarding the focus of Explorer, I think it is a performant (and potentially distributed) DataFrame engine first with support for multiple backends. The multiple backends is a feature but not the priority. That’s my take though, other members of the Explorer team may have a different perspective.

What is next on my roadmap is to have something like Explorer.BatchFrame which is a collection of dataframes, potentially on several nodes, and then we execute the operations on those frames concurrently. The process to do so is:

Get the location of all .parquet, .arrow, etc files
Spawn several nodes, each pointing to one the locations above
Create a BatchFrame
??? (do all work)
Collect or write the results back

With this in mind, where to do you think Iceberg would enter? I believe it can help with step 1 but, what about step 5? Would you want to write the results back in Iceberg format too?

ndrluis · July 30, 2024, 12:15am

Thank you @josevalim for the response.

You are correct about the first step, and regarding the writing, I believe it would be better to delegate the writing to the Iceberg implementation because it has some details about the metadata and commit process that I don’t think make sense to have on the Explorer side.

Are you thinking of using Arrow in the Explorer.BatchFrame, or would it be an Elixir File IO?

If the idea is to be compatible with Arrow Table/RecordBatch, I can focus on using the Iceberg Rust implementation. However, if the idea is to have something only in Elixir, I believe it would be easier to write the entire Iceberg implementation in Elixir.

josevalim · July 30, 2024, 7:27am

The BatchFrame will rather be a collection of existing DataFrames. For example, you could create it manually like this:

Explorer.BatchFrame.new([
  Explorer.DataFrame.from_parquet("foo/1.parquet"),
  Explorer.DataFrame.from_parquet("foo/2.parquet"),
  Explorer.DataFrame.from_parquet("foo/3.parquet"),
  ...
])

Or you could use from_ipc, from_csv, whatever. So it will use whatever Backend you have chosen (most likely Polars) and coordinate work across them. Does this make sense?

ndrluis · July 30, 2024, 12:36pm

Sorry, the question about I/O does not make sense because, since the idea is to be like Daft, the scan responsibility is from Explorer.

From my perspective, it would be easier to delegate these scans to the Iceberg Rust Implementation and pass the RecordBatch/Table Arrow to an Explorer Dataframe. However, if the idea is to distribute the reading process, I will need to learn more about the scan process in the context of Iceberg.

josevalim:

The BatchFrame will rather be a collection of existing DataFrames. For example, you could create it manually like this:
Explorer.BatchFrame.new([
  Explorer.DataFrame.from_parquet("foo/1.parquet"),
  Explorer.DataFrame.from_parquet("foo/2.parquet"),
  Explorer.DataFrame.from_parquet("foo/3.parquet"),
  ...
])
Or you could use from_ipc, from_csv, whatever. So it will use whatever Backend you have chosen (most likely Polars) and coordinate work across them. Does this make sense?

I understand that we could use from_parquet (in the future, it would be good to support ORC files), but I believe that it is not so simple.

There are some details that I don’t have the knowledge of, but I know they exist, such as Logical Plan / Physical Plan and how the partitions and deleted files/rows in the Iceberg context would work in this scan process.

Another thing is that we need to use some features that are not being used today, like the pushdown predicate and statistics in the from_parquet.

But with this idea of Explorer being responsible for the reading process, I can start to dig into these more lower-level things.

ndrluis · July 30, 2024, 5:43pm

I missed an important point. Iceberg uses Avro files for some of its metadata, which contain low-level details specific to Iceberg’s implementation.

Additionally, all implementations (Go, Rust, Python, Java) use Arrow for data delivery to ensure compatibility with other tools. I understand that Explorer currently doesn’t support Arrow RecordBatch or Table, but I believe it’s best to follow the approach used by other implementations.

Once we have the basic implementation to return an Arrow RecordBatch/Table, we can revisit and enable data loading in Explorer in a way that makes sense for Explorer.

josevalim · July 31, 2024, 6:45am

Agreed. Explorer may be able to do distribution in the future but it doesn’t mean everyone has to use it. Being able to receive directly from Arrow would be welcome too. We already do something similar in DataFrame.from_query, where we pass a pointer around.

josevalim · August 19, 2024, 9:04pm

Btw, here is an issue we can keep this discussion moving forward: Add delta lake file support · Issue #752 · elixir-explorer/explorer · GitHub

josevalim · March 3, 2025, 7:19pm

I also wonder if we should leverage Iceberg via DuckDB: Core Extensions – DuckDB

You can already use DuckDB via GitHub - elixir-explorer/adbc: Apache Arrow ADBC bindings for Elixir.