I am writing a library to deal with dataframes. Dataframes can be thought as excel tables. The main data is a 2D table and any operation can be row based or column based.
Operations include filtering by column, row, adding or deleting rows, columns, etc.
Right now, in a very naive way, I’ve implemented the main data as a list of lists, row based. But of course this makes column operations expensive (add a new element in a certain position to each list for instance).
If I switch to list of list, column based, then row operations would be expensive.
Tuples would not make it I believe because they may get huge and there may be lots of modifications
and Maps are not ordered, and I need order.
Any advise? Are out there any comparison of performance of each of the Elixir data structures?
Extracting a single row or column I think is most efficient by using Enum.filter/2. This would make extracting a single column or row O(row * colums) though.
If you really need O(1) access to a complete row or column randomly, I do think, that a doubled DS would be the only thing to go, but it does double the amount of memory needed to hold the structure, also it makes writes expensive since they had to be done twice.
Basically you need to two nested maps %{row_id => %{col_id => data_sample}} and %{col_id => %{row_id => data_sample}}.
Yeah the duplication is definitely a possibility. Fortunately this is less duplication than it may look at first glance because cell values are just pointed to by each map location, not actually duplicated.
:ets could be another possibility because it does support a {_, 1} style match operation.
FWIW even traditional mutable 2d arrays suffer from this question. If you make row access primary then column access requires large jumps in memory which eats away at any cache locality you may have gotten with row access.
But there are tuples with arbitrary sizes. And you have native functions to deal with them.
My point was that you could simulate any dimension of array in the single-dimensional tuple. I understand it’s not practical for changing data however.
The erlang :array module already does that tuple managing for very fast row access. I’d say use it, it is well tested.
For note, maps are ‘almost’ on par with :array in speed in almost all cases (sometimes even a bit faster) except iteration, which sounds like it might be something you need.
Overall though I’d say use maps, and if you setup your key just right then iteration can be very fast too.
I think you might want to look at the Tensor library that I made a while back, which allows for vectors, matrices and any higher-order tensors as well. It works with sparse maps, making it more efficient when it is only partially filled. It probably provides the functionality you want (a multitude of different ways of filtering, adding/removing rows or colums, flipping or rotating, fast access to a single element, row or column, easy ways to sort rows or columns, etc).
Your library looks pretty nice. I’d put some benchmark comparisons versus the vanilla (or Erlang) data structures for some bragging rights. benchwarmer should do just fine.
I’ve been experimenting with using map with a two dimensional key(tuple) for a map as originally suggested.
But it seems that operations will be much slower. I’ve just tried to reimplement a couple of them and bear in mind that my code sucks, but sucks as much doing lists then doing maps I guess and still lists seems to be faster than maps.
This is the code, map implementation:
list implementation(has many more functions, do not care about most):
The benchmark:
Results:
transpose(list of lists) 314.23
transpose(maps) 123.85 - 2.54x slower
tail(maps) 3303.23
tail(list of lists) 1539.64 - 2.15x slower
rows(list of lists) 85.83
rows(maps) 3.86 - 22.21x slower
columns(list of lists) 198.51
columns(maps) 129.22 - 1.54x slower
I’m specially worried by the columns function as I know the list version was done with no love for performance whatsoever, still seems to outperform a map with tuples as keys.
I’ll try Tensor next, although I want to keep experimenting with map a little bit more.
It’s very hard to make any reasonable comment without seeing the actual functions that do the transformations.
If performance is really the critical issue, then this is the kind of thing you write a NIF for. Elixir is simply not well suited to doing standard linear algebra type calculations as fast as the hardware can go. Even a naive implementation in C is going to beat the pants off of any Elixir implementation.
Did you use a tuple for the key in the map? If so then I’d recommend trying to invert it and test again, I.E. put the, say, x coordinate as the map key then the value is another map of y values. I’m curious to see how/if that would help.
(or y in the outer and x in the inner depending on access patterns)
What is actually funny is that most matrix math libraries that work with ‘native’ code is much faster than C code, as this is a piece of software written in Fortran that has had a history of twenty years of efficiency improvements. And the Fortran compiler can also do more rigorous rewriting than the C compiler can.
@dimitarvp: This is a great idea. Hopefully I can find some time in the near future to create a benchmark.
That is because Fortran has no aliasing constructs, which allow the compiler to optimize better. In C++ this can be forced, but generally only through template work. Rust on the other hand has similar no aliasing constructs as Fortran, hence why it can beat C++ in some benchmarks too. Though for Matrix work some C++ libraries like Eigen, which use heavy template magic, end up as some of the fastest executable code regardless, and are wonderfully easy to use, but hard to write, thankfully it is already written for you. ^.^