Access behaviour for Explorer.DataFrame

mooreryan · October 13, 2024, 1:19am

I have a couple of questions about Explorer’s DataFrame and the Access behaviour.

In the selecting columns section of the Explorer manual, it mentions that the Access behaviour is implemented for DataFrames.

Because of this I would expect that accessing a column that doesn’t exist with the bracket notation to return nil and using DataFrame.fetch to try an access a column that doesn’t exist to return :error.

However, this is not the case. Here are some examples.

Examples

df =
  DataFrame.new(
    a: [1, 2, 3],
    b: [10, 20, 30]
  )

And here are some ways to access it.

Using brackets

iex> df["a"]
#Explorer.Series<
  Polars[3]
  s64 [1, 2, 3]

iex> df["c"]
** (ArgumentError) could not find column name "c". The available columns are: ["a", "b"].
If you are attempting to interpolate a value, use ^c.

Trying to access column c raises an ArgumentError, but I would expect that to return nil given the Access behaviour.

Using `fetch`

Using fetch was also surprising:

iex> Explorer.DataFrame.fetch(df, "a")
{:ok,
 #Explorer.Series<
   Polars[3]
   s64 [1, 2, 3]
 >}

iex> Explorer.DataFrame.fetch(df, "c")
** (ArgumentError) could not find column name "c". The available columns are: ["a", "b"].
If you are attempting to interpolate a value, use ^c.

The first makes sense ({:ok, val}) but the second, raises an ArgumentError, where I would expect it to return :error given the Access behaviour.

I assume that this is the intended way for it to work given that it is included in the test suite. See this test for example, which shows that an ArgumentError is expected to be raised.

(The fact that it is an ArgumentError is also interesting given that fetch!/2 from Access raises a KeyError exception rather than ArgumentError.)

Other Access behaviour functions

Another interesting thing is that not all of the functions in the Access behaviour are available. E.g.,

iex> Explorer.DataFrame.fetch!(df, "c")
** (UndefinedFunctionError) function Explorer.DataFrame.fetch!/2 is undefined or private.

Summary

To summarize, here are my questions:

Why does accessing columns in a DataFrame not behave in the way that the Access behaviour docs imply that it should behave?
Why aren’t all the Access behaviour functions available on DataFrame?
How are you supposed to check if a column exists in a DataFrame other than using a try block?

josevalim · October 13, 2024, 2:44pm

The Explorer team has decided that it makes more sense to raise (catching errors early) than return nil. How much this is violation of the Access behaviour is a good question. Here is what we can say:

Keyword lists are used to model optional keys, but they raise if the key is not an atom
Maps always return nil for missing keys, but they also have a convenience API for being strict, such as map.foo
Nx tensors raise if you use access with an invalid dimension

I’d have to put more thought into it but, in the face of the two conflicting positions below, I’d probably stick with the first one:

We should let data structures decide what is best for them
Access should impose it to return nil

The Access behaviour requires you only to implement fetch and get_and_update, the other functionality is made available through the Access module itself. It is similar to a GenServer, where you implement handle_call but you must invoke GenServer.call.

We should add an API for it.

billylanchantin · October 13, 2024, 3:15pm

You can also do:

df = Explorer.DataFrame.new(a: [1, 2, 3])

if "a" in df.names do
  # ...
end

It’s not perfect because Access is more inclusive. For example:

# Both of these work
df["a"] #=> #Explorer.Series<Polars[3] s64 [1, 2, 3]>
df[:a]  #=> #Explorer.Series<Polars[3] s64 [1, 2, 3]>

# Only the string version works
"a" in df.names #=> true
:a in df.names  #=> false

So it’s a bit of a leaky abstraction. But depending on what you’re doing it may be good enough.

mooreryan · October 13, 2024, 4:15pm

Thanks for the responses josevalim and billylanchantin!!

That makes sense. I was not aware that the return types for a behaviour were flexible.

This is interesting. I’m new to Elixir, so I can’t really say how it ought to be, but at least for me, the DataFrame implantation of Access being different from the Access docs was surprising. But that could be addressed by adding something to DataFrame docs.

Thanks for the explanation. That makes sense.

Thanks, I will go with something like this for now.