Why does String.valid? ascii guard clause check 7 bytes at a time?

Looking over how String.valid? implements :fast_ascii option, what’s the significance is pattern matching in groups of 7 bytes?

when Bitwise.band(0x80808080808080, a) == 0 (source)

I think I understand using Bitwise.band/2 to check for the value of the leading bit, but since all ascii characters are 1 byte, what does checking in groups of 7 bytes do?
Does it have to do with CPU design?

1 Like

On 64bit systems a word size on the BEAM is 8 bytes. Now because the BEAM uses some bits to indicate the type, and binaries are boxed types, the actual available payloa in the first word is 7 bytes, but beyond that it ideally should be 8 bytes at a time but the constant used for the AND limits each batch to 7 bytes.

Doing computation a word at a time is efficient vs byte at a time. Going beyond this using an 8 byte constant in a BIF and even using SSE instructions would be even faster so it appears there is opportunity for further optimisation.

6 Likes

Further to this there is an article that covers the fast decoding in terms of cycles per known. Validating UTF-8 bytes using only 0.45 cycles per byte (AVX edition) – Daniel Lemire's blog

There is also the fastest JSON parser here that can validate UTF-8 and ascii at 13GB/s. The JSON processing is 25x faster than the fastest known C++ implementation.

I would say base on those numbers there would be room to improve Elixir/Erlang performance.

3 Likes

Is there an agreed upon best practice for doing the first iteration differently?

For example my understanding is that it’s better to use atoms besides true and false.

 defp valid_utf8_fast_ascii?(<<a::56, rest::binary>>, :first) 
      when Bitwise.band(0x80808080808080, a) == 0 do
      valid_utf8_fast_ascii?(rest, :next)

  end

  defp valid_utf8_fast_ascii?(<<a::64, rest::binary>>, :next) ...

also would :erlang.system_info(:wordsize) work to check if BEAM was running on 32 vs. 64 ?

Thanks again for the great answer.

The issue is the “extract the first X bits into a” part not the representation of the binary, so subsequent iterations should be the same size. Additional discussion:

1 Like

read through that thread. This gives me all the context I was looking for plus some. Thanks!

Author of this implementation here.

We can easily test for the ‘common’ case of single-byte UTF-8 code points (ie: ASCII) by looking at the high bit of each byte to see if it’s clear by ANDing it with 0x80 (ie: b10000000). For continuous runs of such bytes, we can also check if ALL of the bytes’ high bits are clear by ANDing them with 0x80808080… in a similar manner.

On a 64 bit system, you would expect the optimal stride value to be 8 bytes. However, experimentation doesn’t seem to bear that out, and 7 byte strides seem to be optimal on ARM and x64 platforms (my hunch is that the VM does tagging with some of the higher bits in a word). Details of all this is in the PR thread that got linked above.

7 Likes

It does. 4 bits of every integer are reserved for tag.

1 Like

That was a ton of work and it made for a very interesting read. Can’t beat empirical data.