Performant data structure for a table (think 2D array or list of lists)

I am writing a library to deal with dataframes. Dataframes can be thought as excel tables. The main data is a 2D table and any operation can be row based or column based.
Operations include filtering by column, row, adding or deleting rows, columns, etc.

Right now, in a very naive way, I’ve implemented the main data as a list of lists, row based. But of course this makes column operations expensive (add a new element in a certain position to each list for instance).

If I switch to list of list, column based, then row operations would be expensive.
Tuples would not make it I believe because they may get huge and there may be lots of modifications
and Maps are not ordered, and I need order.

Any advise? Are out there any comparison of performance of each of the Elixir data structures?

1 Like

People very frequently use maps where the key is a tuple of coordinates, and the value is the value at that coordinate. IE:

%{
  {0, 0} => 1,
  {0, 1} => 2,
}
1 Like

Sounds good but I think in my case would only make sense if it is fast to get all the keys for a given row or column.

in your example:

%{
{0, 0} => 1,
{0, 1} => 2,
}

I’d possible like to get all the elements in column 1, for instance, ideally something like:

Map.get(map, {_, 1})

But it does not seem to be possible, it says _ is an unbound variable.

Extracting a single row or column I think is most efficient by using Enum.filter/2. This would make extracting a single column or row O(row * colums) though.

If you really need O(1) access to a complete row or column randomly, I do think, that a doubled DS would be the only thing to go, but it does double the amount of memory needed to hold the structure, also it makes writes expensive since they had to be done twice.

Basically you need to two nested maps %{row_id => %{col_id => data_sample}} and %{col_id => %{row_id => data_sample}}.

2 Likes

Yeah the duplication is definitely a possibility. Fortunately this is less duplication than it may look at first glance because cell values are just pointed to by each map location, not actually duplicated.

:ets could be another possibility because it does support a {_, 1} style match operation.

FWIW even traditional mutable 2d arrays suffer from this question. If you make row access primary then column access requires large jumps in memory which eats away at any cache locality you may have gotten with row access.

2 Likes

Okay, maybe I am not getting something very important here, but are you not, in essence, needing a simple two-dimensional array?

You probably are missing that there are no arrays in elixir.

But there are tuples with arbitrary sizes. And you have native functions to deal with them.

My point was that you could simulate any dimension of array in the single-dimensional tuple. I understand it’s not practical for changing data however. :unamused:

Tuples must be wholly copied on every update. The OP mentioned that updates are common which generally disqualifies large tuples.

1 Like

Yep. Seems like a managed data structure is in order. Probably list (or tuple) of tuples with a maximum size (probably 64).

Sorry that I can’t be more helpful.

The erlang :array module already does that tuple managing for very fast row access. I’d say use it, it is well tested.

For note, maps are ‘almost’ on par with :array in speed in almost all cases (sometimes even a bit faster) except iteration, which sounds like it might be something you need.

Overall though I’d say use maps, and if you setup your key just right then iteration can be very fast too.

2 Likes

I think you might want to look at the Tensor library that I made a while back, which allows for vectors, matrices and any higher-order tensors as well. It works with sparse maps, making it more efficient when it is only partially filled. It probably provides the functionality you want (a multitude of different ways of filtering, adding/removing rows or colums, flipping or rotating, fast access to a single element, row or column, easy ways to sort rows or columns, etc).

It is of course also available on Hex.pm.

6 Likes

Your library looks pretty nice. I’d put some benchmark comparisons versus the vanilla (or Erlang) data structures for some bragging rights. :wink: benchwarmer should do just fine.

2 Likes

I’ve been experimenting with using map with a two dimensional key(tuple) for a map as originally suggested.

But it seems that operations will be much slower. I’ve just tried to reimplement a couple of them and bear in mind that my code sucks, but sucks as much doing lists then doing maps I guess and still lists seems to be faster than maps.
This is the code, map implementation:


list implementation(has many more functions, do not care about most):

The benchmark:

Results:
transpose(list of lists) 314.23
transpose(maps) 123.85 - 2.54x slower
tail(maps) 3303.23
tail(list of lists) 1539.64 - 2.15x slower
rows(list of lists) 85.83
rows(maps) 3.86 - 22.21x slower
columns(list of lists) 198.51
columns(maps) 129.22 - 1.54x slower

I’m specially worried by the columns function as I know the list version was done with no love for performance whatsoever, still seems to outperform a map with tuples as keys.

I’ll try Tensor next, although I want to keep experimenting with map a little bit more.

1 Like

It’s very hard to make any reasonable comment without seeing the actual functions that do the transformations.

If performance is really the critical issue, then this is the kind of thing you write a NIF for. Elixir is simply not well suited to doing standard linear algebra type calculations as fast as the hardware can go. Even a naive implementation in C is going to beat the pants off of any Elixir implementation.

1 Like

Did you use a tuple for the key in the map? If so then I’d recommend trying to invert it and test again, I.E. put the, say, x coordinate as the map key then the value is another map of y values. I’m curious to see how/if that would help. :slight_smile:

(or y in the outer and x in the inner depending on access patterns)

What is actually funny is that most matrix math libraries that work with ‘native’ code is much faster than C code, as this is a piece of software written in Fortran that has had a history of twenty years of efficiency improvements. And the Fortran compiler can also do more rigorous rewriting than the C compiler can.

@dimitarvp: This is a great idea. Hopefully I can find some time in the near future to create a benchmark.

That is because Fortran has no aliasing constructs, which allow the compiler to optimize better. In C++ this can be forced, but generally only through template work. Rust on the other hand has similar no aliasing constructs as Fortran, hence why it can beat C++ in some benchmarks too. Though for Matrix work some C++ libraries like Eigen, which use heavy template magic, end up as some of the fastest executable code regardless, and are wonderfully easy to use, but hard to write, thankfully it is already written for you. ^.^