Strange argument error when calling List.to_tuple with a large list of ints

Hi, I have a ~200 MB binary file that I am trying to load into a tuple of ints. My previous code used the binary directly to get an int, like so:

def get_table() do
    File.read('data/HandRanks.dat')
end

def get_int(table, index) do
    :binary.part(table, {index * 4, 4})
    |> :binary.decode_unsigned(:little)
end

This code worked fine so and was reasonably performant but I wanted to test other methods of accessing the ints. There are around 32 million ints total in the file. First I built a list that contained all of the ints, starting from the end of the binary, then when done I tried to call List.to_tuple to convert to a tuple, like so:

def gen_tuple(table) do
    length = div(byte_size(table), @int_size)
    gen_tuple([], length - 1, table)
end
def gen_tuple(list, -1, _) do
    List.to_tuple(list)
end
def gen_tuple(list, current, table) do
    new_list = [get_int(table, current) | list]
    gen_tuple(new_list, current - 1, table)
end

with @int_size being 4. When I run my gen_tuple I get this error:

** (ArgumentError) argument error
:erlang.list_to_tuple([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ā€¦])

Iā€™ve tried testing a few different things. The list is the correct size, and running things like length(list), Enum.slice, first(), last() on the list donā€™t throw errors and give the answers I would expect. is_list returns true, both for the elixir version and :erlang.is_list. If I use Enum.take to only use the first elements of the list to_tuple runs without error sometimes. I figured out that if the list is taken down to length 16,777,215 it can run without error. Anything with length 16,777,216 and above gives me the error. I sliced the list at that spot and didnā€™t notice anything improper. Element 16,777,215 is the number 24788 and the next element is the number 24792. I can even slice the list to take 50 or so elements in the middle that include this ā€˜problem spotā€™ and to_tuple runs fine, so Iā€™m really confused about what is going on. The length of the list shouldnā€™t cause any problems, the list isnā€™t improper as far as I can tell, and the ā€˜problem elementā€™ doesnā€™t seem to be a problem when it is part of a smaller list. Can anyone help me figure out what Iā€™m doing wrong here? I get the feeling Iā€™m missing something obvious!

In case anyone is feeling unreasonably helpful the binary file can be downloaded here: https://github.com/chenosaurus/poker-evaluator/raw/master/data/HandRanks.dat

If anyone is familiar with poker math, itā€™s the really awesome lookup table used in the 2+2 forum poker evaluator algorithm, used for lightning-fast poker hand evaluations.

1 Like

16,777,215 is 0xffffff (24 bits) and therefore probably marks a limit in your adressable space.

Current OTP version is documented to use 26 bits though. Perhaps your version does have a smaller limit?

2 Likes

Ah, thanks, that makes sense.

When I run elixir --version I get

$ elixir --version
Erlang/OTP 20 [erts-9.2.1] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] 
[kernel-poll:false] [dtrace]

Elixir 1.6.1 (compiled with OTP 20)

I installed through brew on MacOS. So is there anyway around this limitation, or is that going to be the maximum size of a tuple?

Thats probably the maximum, but to be honest, why a tuple and not a list or even keep it as a binary on the bin_heap?

Well a list doesnā€™t work well because I need O(1) random access. If I canā€™t use a tuple then keeping it as a binary will be the way to go. I just wanted to try to improve performance because it seemed that :binary.part() was being somewhat slow. Perhaps there is some way I can implement the get_int() function using pattern matching on the binary instead.

A list is O(N) access for read, so it suites your requirements :wink: I just assume you meant O(1) here :wink:

For read performance, Iā€™d go for an ETS first, storing in the form {index, integer}. A NIF second which holds the raw data as an &[u32] (or equivalent type in the implementing language, this was rust) in memory and just accesses this array by a given index.

Ha yeah, I edited it to O(1) right after I posted, sorry about that. Okay, Iā€™ll try the ETS table out next. A nif in C should be fairly easy, since I used the original C code to figure out how to use the lookup table, so Iā€™ll try that as well.

Iā€™m also going to see if there is some other way I can get a tuple to go up to the 26 bit limit, since that would give me plenty of room to spare. Even a slow way to generate the tuple is fine, since I only need to do it once.

Thanks for the help!

Perhaps using erlangs :array module might help as well. It nests tuples and has still o(1) read but is a bit heavier in memory foot print (probably smaller than an ETS though). The main culprit is, that with any form of non-ETS or non-binary you will need to copy the whole thing around when crossing the process-boundary. This alone would make me avoid anything but binary or ETS for this huge thingā€¦

1 Like

Okay, Iā€™ll test that one too. Yes, crossing the process-boundary would be a huge issue, but I was thinking of perhaps making a dedicated gen_server to handle the calls. I will definitely try out the ETS first though, along with perhaps trying to optimize way I access the binary.

One question I have though, if you have iex handy, could you try doing a call to :erlang.make_tuple(17_000_000, 0) to see if itā€™s just my system that seems to be limited to 24 bit tuple address space?

Erlang/OTP 20 [erts-9.2] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]

Interactive Elixir (1.6.1) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> :erlang.make_tuple(17_000_000, 0)
** (ArgumentError) argument error
    :erlang.make_tuple(17000000, 0)
iex(1)> :erlang.make_tuple(0xFFFFFF, 0)  
{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...}
iex(2)> :erlang.make_tuple(0xFFFFFF + 1, 0)
** (ArgumentError) argument error
    :erlang.make_tuple(16777216, 0)

and erl:

2> erlang:make_tuple(17000000, 0).
** exception error: bad argument
     in function  erlang:make_tuple/2
        called as erlang:make_tuple(17000000,0)
3> erlang:make_tuple(16#ffffff, 0).
{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...}
4> erlang:make_tuple(16#ffffff + 1, 0).
** exception error: bad argument
     in function  erlang:make_tuple/2
        called as erlang:make_tuple(16777216,0)
1 Like

Interesting! Perhaps erlang needs to update their docs here then:

http://erlang.org/doc/efficiency_guide/advanced.html#id71800

I filed an issue: https://bugs.erlang.org/browse/ERL-577

Isnā€™t it technically O(log(n))? IF you nest tuples itā€™s O(1) to get to a tuple, but then you may need to follow another pointer if itā€™s nested.

I think someone did the benchmark and maps w/ integer keys as positions wins out over :array these days.

That would make a really whopping heap for that process. Donā€™t try and send it in a message. :wink: It would be better to keep it in an ETS table or a binary and not as local data to a process.

4 Likes

Yeah, it is. Technically. But here n is not the position of the element but rather the size of the array overall.

So accessing the first or the last or any other element will ā€œtake the same timeā€. This is different to a list, where accessing the first element is instant while the last one will take as long as it needs to traverse the list.

Its very similar to a prefix-tree where access times are bound to the length of the key and not the size of the collection.

/me has a feeling heā€™s missing somethingā€¦

Wouldnā€™t this be easily and efficiently done by just keeping it as a binary and doing something like:

  def get_int(index, table) do
    index = index * 4 * 8
    <<_skip::integer-size(index), result::little-integer-size(32), _rest::binary>> = table
    result
  end
1 Like

Awesome, thanks! Iā€™m still pretty new to elixir and I felt there might be a binary pattern-matching way to do it but I wasnā€™t sure. Iā€™m definitely going to see how performant that is compared to the other methods.