Optimization challenge: bcrypt in pure Elixir

So I’ve been working on a pure Elixir version of bcrypt for mostly fun and to learn how it works. Currently I have the expensive part (the key setup) running, but I’m missing the call to Blowfish and checking that it actually returns the correct results.

The code is here: bcrypt ($2304082) · Snippets · Snippets · GitLab (it’s too long for this forum post due to the initial dataset)

Currently it takes around 7 seconds on my machine to process a password with difficulty factor 12. Here’s an eprof run with difficulty 8 (only the top results):

#                                                   CALLS     %    TIME µS/CALL
:binary.part/3                                     272916  1.69   54136    0.20
Enum.reduce_range/5                                282472  2.44   77988    0.28
Salakala.Bcrypt.xor/2                              267273  2.57   82112    0.31
:erlang.setelement/3                               797202  7.43  237795    0.30
anonymous fn/5 in Salakala.Bcrypt.expand_key/4     262656  7.60  243322    0.93
Salakala.Bcrypt.encipher/3                         267273 18.06  578222    2.16
Salakala.Bcrypt.f/2                               4276368 25.29  809543    0.19
Salakala.Bcrypt.blfrnd/5                          4276368 34.59 1107212    0.26

What I have done so far:

  1. I started with :arrays for the s/p data structures but that was very slow. Changing them to tuples has provided the most speed benefit. The values are mostly read so tuples have very good access performance.
  2. I started with binaries but it turned out to be faster to use integers for everything and only use binaries when integers would be cumbersome. For example Bitwise.bxor is much faster than :crypto.exor or my own binary xor implementation (which is still faster than :crypto.exor o_O).

Now, I know this is a useless effort in that it will always be much much slower than a NIF. This started as a curiosity project when I had problems deploying my software with burrito due to bcrypt_elixir NIF, I wondered exactly how much slower it would be. Turns out if we could make it 7 times faster than it is now, it would actually be kind of there for some usages. But I’m not saying that we’ll get a good general purpose bcrypt from this. It’s just for fun and learning how bcrypt works.

Oh, and as said, it’s still missing the 64 iterations of Blowfish once the key has been formed. :crypto does have Blowfish but I haven’t looked yet into how it should be used.

So, if you’re looking for something to waste your time on honing your BEAM optimisation skills, here’s a project for you. :slight_smile: I must warn that if it turns out even remotely good, I’d like to publish and license it as BSD/MIT/similar, so keep that in mind if you don’t want to share your code like that.

12 Likes

A little progress. There’s this function:

  def f(s, x) do
    <<
      b4::unsigned-integer-8,
      b3::unsigned-integer-8,
      b2::unsigned-integer-8,
      b1::unsigned-integer-8
    >> = <<x::32>>

    b4_i = elem(elem(s, 0), b4)
    b3_i = elem(elem(s, 1), b3)
    b2_i = elem(elem(s, 2), b2)
    b1_i = elem(elem(s, 3), b1)

    Bitwise.bxor(b4_i + b3_i, b2_i + b1_i)
  end

Along with blfrnd it’s called a lot of times. I was annoyed by that casting required to split the 4 bytes out of the 32-bit value. So with the help of the Elixir Discord, I changed it to this which is somewhat faster and now the time for difficulty 12 is around 5 seconds:

  def f(s, x) when x <= 4_294_967_296 do
    b1 = Bitwise.band(x, 0xFF)
    b2 = x |> Bitwise.bsr(8) |> Bitwise.band(0xFF)
    b3 = x |> Bitwise.bsr(16) |> Bitwise.band(0xFF)
    b4 = x |> Bitwise.bsr(24)

    b4_i = elem(elem(s, 0), b4)
    b3_i = elem(elem(s, 1), b3)
    b2_i = elem(elem(s, 2), b2)
    b1_i = elem(elem(s, 3), b1)

    Bitwise.bxor(Bitwise.band(b4_i + b3_i, 0xFFFFFFFF), Bitwise.band(b2_i + b1_i, 0xFFFFFFFF))
  end

(You may see there’s also a slight logic change to match the original code’s overflow mechanics.)

My next thought was to convert f/2 and blfrnd/5 to macros see if there’s any function call overhead that could be removed (they’re called over 4 million times for a difficulty 8 hashing).

6 Likes

I did try converting f/2 and blfrnd/5 to macros, but it didn’t have any noticeable impact on the runtime. At this point I ran out of ideas to try and thus around 5 seconds was the last result. This is too slow for general usage but at least I had fun.

Let me know if you come up with ideas to make it faster. :slight_smile: Ideas that don’t involve NIFs!

I haven’t tried it in OTP 25 yet, to see if there are any JIT improvements.

4 Likes