Speedup bitwise decoding

melmoth · April 4, 2017, 9:36am

Hello guys,

I’m currently struggling on a problem that should be easy but does not come as it should.
I need to decode an integer and output a list of index having it’s bit set to 1.

I came up with a first recursive implementation:
def listOfIndexOf({key, _size}) do
listOfIndexOf(key, 0, [])
end

defp listOfIndexOf(key, index, acc) do
  if key == 0 do
    acc
  else
    val = key &&& 1
    listOfIndexOf(key >>> 1, index + 1,
                       if(val == 1,
                          do: [index | acc],
                          else: acc))
  end
end

as well as a second one base on reduce_while

def listOfIndexOf({key, size}) do
  {_,_, ret} = Enum.reduce_while(1..size, {key, 0, []},
                       fn (_, {key, index, acc} = input) ->
                         if key == 0 do
                           {:halt, input}
                         else
                           val = key &&& 1
                           updateKey = key >>> 1
                           ret = if(val == 1, do: [index | acc], else: acc)              
                           {:cont, {updateKey, index + 1, ret}}
                         end
    end)
  ret
end

(still not a functional expert so remarks welcomed :))
(I’ve been playing with guard also to avoid if)

Unfortunately, I would need some faster implementation, currently with a key of 50 000 bits the process time is around 30 - 40 ms.

Would you think of a faster way of decoding ?

Thanks

Fred

swelham · April 4, 2017, 10:38am

I’m not entirely sure I follow but based on your code example, I would write something along these lines.

defmodule Example do
  use Bitwise
  
  def list_of_index_of({key, _size}),
    do: list_of_index_of(key, 0, [])
    
  def list_of_index_of(0, _, acc), do: acc
  def list_of_index_of(key, index, acc) do
    acc = append_acc(key &&& 1, index, acc)
    
    list_of_index_of(key >>> 1, index + 1, acc)
  end
  
  defp append_acc(1, index, acc), do: [index | acc]
  defp append_acc(_, _, acc), do: acc
end

IO.inspect Example.list_of_index_of({50_000, 0})

This outputs [15, 14, 9, 8, 6, 4], is that correct?

michalmuskala · April 4, 2017, 10:56am

This looks like something that could be very well optimised by HiPE (the partial native compiler for erlang/elixir).

You can enable native compilation for a module by adding @compile :native attribute (and having a HiPE enabled erlang build (there should be [hipe] in the banner when you launch iex).

It’s worth noting that it’s generally not a good idea to call regular modules from HiPE compiled ones - it’s often slower than just using regular modules… If you have functions in this module that call some other modules, I would recommend extracting this function to a separate one.

melmoth · April 4, 2017, 11:51am

Hi @swelham,

thanks for your answer, yes your code right (and definitively elixirer than mine ; )

the idea is to extract a list of bit index in a bitfield being pretty large (actually the input value is more 1 <<< 50000 than 50000 :)))

Fred

melmoth · April 4, 2017, 11:57am

Hi @michalmuskala
it seems to be a good idea, and if it does not solve the problem(yeah unfortunately no gain), it does not hurt. The module is a bit manipulation one so called by everyone but using no one.

Fred

OvermindDL1 · April 4, 2017, 3:30pm

I wonder how quick operating over it as a binary would be, I’d imagine it would be a lot faster for VERY large numbers for sure, but I’m curious how this fairs for small numbers, anyone want to benchmark?

defmodule Testering do
  def list_of_index_of(bin) when is_binary(bin), do: list_of_index_of_(bin, :erlang.bit_size(bin)-1, [])
  def list_of_index_of(int) when is_integer(int) do
    bin = :binary.encode_unsigned(int)
    list_of_index_of(bin)
  end

  defp list_of_index_of_(<<>>, _idx, acc), do: acc
  defp list_of_index_of_(<<1::integer-size(1), rest::bitstring>>, idx, acc), do: list_of_index_of_(rest, idx-1, [idx | acc])
  defp list_of_index_of_(<<_::integer-size(1), rest::bitstring>>, idx, acc), do: list_of_index_of_(rest, idx-1, acc)
end

iex> Testering.list_of_index_of(50_000)
[4, 6, 8, 9, 14, 15]

You’d be surprised at the speed of bitstrings in certain situations, and this may be one of them.

Although iterating over an integer without ‘mutating’ it might be faster still, unsure…

melmoth · April 4, 2017, 4:40pm

Hey @OvermindDL1 you definitely get the point in the whole program, it is faster (between 2 - 5 ms) for my usecase. I’m extracting the code from the project to have a clean benchfella test and will post the result

benwilson512 · April 4, 2017, 6:09pm

(between 2 - 5 ms)

How much of a change is that relatively? Absolute times are not much info without something to compare it to.

melmoth · April 4, 2017, 7:11pm

This is the results using @OvermindDL1 vs @swelham implementations:

BinaryBench

benchmark name iterations average time
Binary convertion on 50 bits 1000000 2.61 µs/op
Binary convertion on 100 bits 500000 4.80 µs/op
Binary convertion on 200 bits 200000 8.75 µs/op
Binary convertion on 500 bits 100000 21.49 µs/op
Binary convertion on 1000 bits 50000 41.97 µs/op
Binary convertion on 2000 bits 20000 83.68 µs/op
Binary convertion on 5000 bits 10000 225.66 µs/op
Binary convertion on 10000 bits 5000 439.66 µs/op
Binary convertion on 20000 bits 2000 895.38 µs/op
Binary convertion on 50000 bits 1000 2140.63 µs/op
Binary convertion on 100000 bits 500 4707.48 µs/op

IntegerBench

benchmark name iterations average time
Integer convertion on 50 bits 1000000 2.84 µs/op
Integer convertion on 100 bits 200000 9.32 µs/op
Integer convertion on 200 bits 100000 24.09 µs/op
Integer convertion on 500 bits 50000 70.48 µs/op
Integer convertion on 1000 bits 10000 153.59 µs/op
Integer convertion on 2000 bits 5000 346.40 µs/op
Integer convertion on 5000 bits 1000 1106.29 µs/op
Integer convertion on 10000 bits 1000 2902.19 µs/op
Integer convertion on 20000 bits 200 8711.02 µs/op
Integer convertion on 50000 bits 50 41193.44 µs/op
Integer convertion on 100000 bits 10 150411.00 µs/op

Both implementations takes an integer as input.

The binary solution is quickly more efficient, even with :binary.encode_unsigned(int) to convert data so i guess there is some cycle to avoid here by providing a binary.

As the bench states, roughly a x20 gain on pretty large data (50000 bits)

Thanks

OvermindDL1 · April 4, 2017, 7:34pm

That call should be pretty fast, just a memcpy really, I then just iterate over the index of the binary without ever mutating it, which on the BEAM is ‘pretty dang fast’ as it is highly optimized for specifically this kind of work. ^.^

If there is a function to get the bit on an integer directly (not that I recall…) then it might be faster, but otherwise I’d think this is the best method as-is?

melmoth · April 5, 2017, 6:31am

Yes I guess it should be fast enough for now :))) Thanks

NobbZ · April 5, 2017, 7:11am

I’m on a mobile, so there might be auto correction kick the code…

def bit_set?(i, n) do
  i &&& (1 <<< n) != 0
end

One needs to benchmark though if it is really faster than working on a binary/bitstring.

OvermindDL1 · April 5, 2017, 3:05pm

I’m not sure it would be since these integers can be very long and it would be constructing new very long integers in a tight loop one for each bit, thus O(N^2), where the binary version has a slight overhead (a single memcpy) at start but then it is O(N) after that.

You could make it fast if a NIF were made that got a specific bit from the very long integer though, but may as well do it all in the NIF then, plus the NIF call itself has overhead that would probably make it slower than the binary version anyway.

Benchmark benchmark though! ^.^

melmoth · April 5, 2017, 4:28pm

Yes definitely, I had strange results this afternoon after porting the whole module to binaries. Soooooooo, gonna extract everything and bench, stay tuned :))

melmoth · April 5, 2017, 9:25pm

Ok so there is some results, obviously there is some problem with the bitAnd binary implementation.
I assume that I poorly designed the matching mechanism leading the beam vm to copy far too much data each time (I just suppose that having read the erlang performance recommendations…)

BinaryBitAndBench

benchmark name iterations average time
Binary bitAnd on 50 bits 100000 23.36 µs/op
Binary bitAnd on 100 bits 50000 52.74 µs/op
Binary bitAnd on 200 bits 10000 127.56 µs/op
Binary bitAnd on 500 bits 5000 379.15 µs/op
Binary bitAnd on 1000 bits 1000 1423.59 µs/op
Binary bitAnd on 2000 bits 500 6233.36 µs/op
Binary bitAnd on 5000 bits 100 16600.74 µs/op
Binary bitAnd on 10000 bits 50 55308.88 µs/op
Binary bitAnd on 20000 bits 10 190132.70 µs/op
Binary bitAnd on 50000 bits 1 1078100.00 µs/op
Binary bitAnd on 100000 bits 1 4287743.00 µs/op

BinaryConversionBench

benchmark name iterations average time
Binary conversion on 50 bits 1000000 2.40 µs/op
Binary conversion on 100 bits 500000 4.60 µs/op
Binary conversion on 200 bits 200000 9.24 µs/op
Binary conversion on 500 bits 100000 21.66 µs/op
Binary conversion on 1000 bits 50000 44.89 µs/op
Binary conversion on 2000 bits 20000 82.13 µs/op
Binary conversion on 5000 bits 10000 226.59 µs/op
Binary conversion on 10000 bits 5000 461.89 µs/op
Binary conversion on 20000 bits 2000 846.78 µs/op
Binary conversion on 50000 bits 1000 2125.82 µs/op
Binary conversion on 100000 bits 500 4420.54 µs/op

BinarySetBitBench

benchmark name iterations average time
Binary setBit on 50 bits 10000000 0.59 µs/op
Binary setBit on 100 bits 10000000 0.60 µs/op
Binary setBit on 200 bits 10000000 0.65 µs/op
Binary setBit on 500 bits 10000000 0.76 µs/op
Binary setBit on 1000 bits 1000000 1.07 µs/op
Binary setBit on 2000 bits 1000000 1.30 µs/op
Binary setBit on 5000 bits 1000000 2.17 µs/op
Binary setBit on 10000 bits 1000000 2.72 µs/op
Binary setBit on 20000 bits 500000 4.46 µs/op
Binary setBit on 50000 bits 200000 9.38 µs/op
Binary setBit on 100000 bits 100000 17.54 µs/op

IntegerBitAndBench

benchmark name iterations average time
Integer bitAnd on 50 bits 10000000 0.25 µs/op
Integer bitAnd on 200 bits 10000000 0.30 µs/op
Integer bitAnd on 500 bits 10000000 0.30 µs/op
Integer bitAnd on 1000 bits 10000000 0.31 µs/op
Integer bitAnd on 100 bits 10000000 0.31 µs/op
Integer bitAnd on 2000 bits 10000000 0.32 µs/op
Integer bitAnd on 5000 bits 10000000 0.36 µs/op
Integer bitAnd on 10000 bits 10000000 0.42 µs/op
Integer bitAnd on 20000 bits 10000000 0.54 µs/op
Integer bitAnd on 50000 bits 10000000 0.87 µs/op
Integer bitAnd on 100000 bits 1000000 1.42 µs/op

IntegerConversionBench

benchmark name iterations average time
Integer conversion on 50 bits 1000000 2.83 µs/op
Integer conversion on 100 bits 200000 9.23 µs/op
Integer conversion on 200 bits 100000 23.92 µs/op
Integer conversion on 500 bits 50000 71.62 µs/op
Integer conversion on 1000 bits 10000 154.14 µs/op
Integer conversion on 2000 bits 5000 348.22 µs/op
Integer conversion on 5000 bits 1000 1110.15 µs/op
Integer conversion on 10000 bits 1000 2888.23 µs/op
Integer conversion on 20000 bits 200 8732.87 µs/op
Integer conversion on 50000 bits 50 41148.96 µs/op
Integer conversion on 100000 bits 10 149090.90 µs/op

IntegerSetBitBench

benchmark name iterations average time
Integer setBit on 50 bits 10000000 0.32 µs/op
Integer setBit on 100 bits 10000000 0.36 µs/op
Integer setBit on 200 bits 10000000 0.40 µs/op
Integer setBit on 500 bits 10000000 0.42 µs/op
Integer setBit on 1000 bits 10000000 0.43 µs/op
Integer setBit on 2000 bits 10000000 0.44 µs/op
Integer setBit on 5000 bits 10000000 0.49 µs/op
Integer setBit on 10000 bits 10000000 0.57 µs/op
Integer setBit on 20000 bits 10000000 0.74 µs/op
Integer setBit on 50000 bits 1000000 1.27 µs/op
Integer setBit on 100000 bits 1000000 2.23 µs/op

Bench available here:
https://github.com/freandre/exbitshift

OvermindDL1 · April 5, 2017, 10:20pm

BinaryBitAndBench: Yeah this is definitely inefficient, replace the entire bitAnd/2 code in ExBitShift.BitUtilsBinary with this if you want it to take a binary (untested and not run, but excusing spelling errors it should be better):

  def bitAnd(bits1, bits2) do
    :erlang.band(:binary.decode_unsigned(bits1), :binary.decode_unsigned(bits2))
    |> :binary.encode_unsigned()
  end

BinaryConversionBench: The original bit of this thread. ^.^

BinarySetBitBench: You could convert from binary to integer then call the integer version easily enough, but keeping as binary as it is should be the fastest binary style, but heck, converting to integer and back might be faster, would need to benchmark. For note, you could pass it back as an iolist if you align things right and if the caller is fine with that.

IntegerBitAndBench: It is as expected, should have the binary version do the same thing.

IntegerConversionBench: As discussed early in thread, should be converting it to binary, then operating over it.

IntegerSetBitBench: It is as expected.

And of course turning on HiPE can entirely change many of these results. ^.^

But overall yeah, the results are what I would expect.

melmoth · April 6, 2017, 6:36am

Hi @OvermindDL1

All in all I guess the easiest (and most efficient) approach is to keep as main internal structure an int and have the conversion to binary in the list_of_index_of.
Most of my operations are bit set, and bit and and I only have one conversion per call.

Concerning the bitAnd, I came to this solution yesterday :

def bitAnd(bits1, bits2) do
  sz = bit_size(bits1)
  <<val1::size(sz)>> = bits1
  sz = bit_size(bits2)
  <<val2::size(sz)>> = bits2
  :binary.encode_unsigned(val1 &&& val2)
end

(decode_unsigned is only taking byte aligned binaries)

'till this morning I was wondering why you keep on using the erlang libs, it finally flashes (ok that s not a storm :))) to avoid the elixir overhead… guess i need more sleep

Anyway, will modify the bench, maybe it will help in the near future

melmoth · April 6, 2017, 9:02am

Hi guys,

ok so, after some trials on the bench:
bitAnd on binaries:
converting to an int / using the native band encoding back is far more efficient than playing with binary match (x100k)
setBit on binaries:
the same does not applies here, the binary implementation is more efficient (x1.7)
int implementation:
is the clear winner for the setBit and bitAnd operation

Concerning Hipe optimization, strangely it shows in most cases degradation of performances (only the binary implementation of the conversion benefits from something).

github is updated and I ve also provided the results

Thanks

melmoth · April 6, 2017, 11:24am

mm I definitely need to understand more in depth when data copy occurs …

I ve been able to almost reduce by half the time needed to process an int with the following code:

def listOfIndexOf({key, size}), do: listOfIndexOfNB(key, size - 1, [])
defp listOfIndexOfNB(_, index, acc) when index < 0, do: acc
defp listOfIndexOfNB(key, index, acc), do: listOfIndexOfNB(key, index - 1, updateAcc(:erlang.band(key, :erlang.bsl(1, index)), index, acc))
defp updateAcc(val, index, acc) when val != 0, do: [index | acc]
defp updateAcc(_, _, acc), do: acc

Basically from what I see, creating a new big mask is less expensive than keep on shifting the same input…

#puzzled

OvermindDL1 · April 6, 2017, 3:26pm

There is not really any elixir overhead, I come from erlang, not the ruby world, so I know more of erlang’s libs by heart. ^.^

For note, Elixir’s &&& is replaced with :erlang.band, and you can call band directly if you want to save a use, but it does not matter either way, same speed.

HiPE is faster with math in general, and slower with some other things. DO NOTE though, when one module that is not compile with HiPE calls one that is or vice-versa then then is a fairly substantial speed hit then, so be aware.

Probably just because it can quick-zero out most of the allocated memory in the mask I’d imagine?