Looking over how String.valid? implements :fast_ascii option, what’s the significance is pattern matching in groups of 7 bytes?
when Bitwise.band(0x80808080808080, a) == 0 (source)
I think I understand using Bitwise.band/2 to check for the value of the leading bit, but since all ascii characters are 1 byte, what does checking in groups of 7 bytes do?
Does it have to do with CPU design?
On 64bit systems a word size on the BEAM is 8 bytes. Now because the BEAM uses some bits to indicate the type, and binaries are boxed types, the actual available payloa in the first word is 7 bytes, but beyond that it ideally should be 8 bytes at a time but the constant used for the AND limits each batch to 7 bytes.
Doing computation a word at a time is efficient vs byte at a time. Going beyond this using an 8 byte constant in a BIF and even using SSE instructions would be even faster so it appears there is opportunity for further optimisation.
There is also the fastest JSON parser here that can validate UTF-8 and ascii at 13GB/s. The JSON processing is 25x faster than the fastest known C++ implementation.
I would say base on those numbers there would be room to improve Elixir/Erlang performance.
Is there an agreed upon best practice for doing the first iteration differently?
For example my understanding is that it’s better to use atoms besides true and false.
defp valid_utf8_fast_ascii?(<<a::56, rest::binary>>, :first)
when Bitwise.band(0x80808080808080, a) == 0 do
valid_utf8_fast_ascii?(rest, :next)
end
defp valid_utf8_fast_ascii?(<<a::64, rest::binary>>, :next) ...
The issue is the “extract the first X bits into a” part not the representation of the binary, so subsequent iterations should be the same size. Additional discussion:
We can easily test for the ‘common’ case of single-byte UTF-8 code points (ie: ASCII) by looking at the high bit of each byte to see if it’s clear by ANDing it with 0x80 (ie: b10000000). For continuous runs of such bytes, we can also check if ALL of the bytes’ high bits are clear by ANDing them with 0x80808080… in a similar manner.
On a 64 bit system, you would expect the optimal stride value to be 8 bytes. However, experimentation doesn’t seem to bear that out, and 7 byte strides seem to be optimal on ARM and x64 platforms (my hunch is that the VM does tagging with some of the higher bits in a word). Details of all this is in the PR thread that got linked above.