Performance of random access to binaries

Hi, I’m trying to port a piece of C code to Elixir.

I have a binary that contains a tree structure. I keep the binary in memory since it has a limited size (I read it from a bigger file). The relevant documentation seems to suggest that the binary can be traversed just once from start till end to build the desired output. In practice it turns out it’s not the case and I should rather rely on offsets: each node has a count of children followed by their offsets. Each offset designates the number of bytes from the start of the binary where the child node starts. So what I would need is to have random access to the binary.

The C code uses memory-mapped files, where you just move the memory pointer and that (I imagine) means moving the file position underneath. While I could do the same with :file.position, I’d rather work with the binary and not the whole file.

Now the question is - how efficient are the following operations:

  • byte_size(binary) and
  • :binary.part(binary, pos, len)

Are those constant time? Is :binary.part/3 just moving the pointers around?

If so, I could write seek(binary, pos) for accessing the binary in a random way.

All of this operations are constant time.

Though :binary.part/3 isn’t moving pointers “around”, it creates a new fat pointer that points into the old binary and it will also increase its parents ref counter. Due to this trick you avoid copying the sub binary, but the “parent” can’t be GC’d as long as the “child” lives. You can use :binary.copy/1 to explicitely create a copy if necessary.

Random access to a file though is massively dependant on the underlying diskspeed and filesystem, even though its algorythmically considered constant time, I’d not rely on it for random access if I can afford to hold a copy of the file in memory or I can consume the file in a stream to create the result.

3 Likes

Sounds like an interesting problem. :slight_smile: Maybe a sample repo and several samples of the binary will help?

If not possible, then I fully agree with @NobbZ’s more general advice.

For reference:

http://erlang.org/doc/efficiency_guide/binaryhandling.html#sub_binary