Best way to compare graphemes to codepoints from a binary pattern match

Dusty · July 28, 2020, 7:26am

I am trying to compare the first character in a very long input string (like the text of a book) to some arbitrary character in a shorter string. The algorithm has an index for a grapheme in the shorter string, and just wants to check equality: Is the first character of my long string equal to the grapheme or not? I am trying to avoid doing something expensive under the hood, like calling String.graphemes/1 on the longer string. But I am having trouble figuring out the best way to compare the codepoint I get from binary pattern matching to the grapheme I get from other String methods.

To give a simple example:

def my_func(<<first::utf8, _rest::binary>> = _long_string) do
  last = String.at("string", 5)
  first == last
end

my_func("green")
# => false

I understand that basically this function is doing:

?g == "g"

so it makes sense that it returns false. But I am struggling to find the most efficient way to make this comparison.

I can do something like:

def my_func(<<first::utf8, _rest::binary>> = _long_string) do
  [last] = String.to_charlist(String.at("string", 5))
  first == last
end

my_func("green")
# => true

It works - but it feels like I should be able to do better . What I wanted to do was use the codepoint operator on the variable, like first == ?last, but it didn’t take long for me to understand that won’t fly.

My impression is that String.next_grapheme/1 and friends are going to process the whole string just to give me one value. Is that the case? Seems like using Stream won’t help if it requires me to process the long string into an enumerable first.

Should I stick with my second method above, or is there a better way?

Eiji · July 28, 2020, 7:56am

@Dusty How about replacing your function body to that code:

String.ends_with?("string", <<first::utf8>>)

Dusty · July 28, 2020, 9:43am

OK, this is fascinating. It happens that I might need to match at any point in the string, so String.ends_with?/2 might not work, but I am much more interested in what you’ve done with the variable by wrapping it in <<>>.

iex(1)> <<first::utf8, rest::binary>> = "string"
"string"
iex(2)> first
115
iex(3)> <<first::utf8>>
"s"
iex(4)> first == "s"
false
iex(5)> <<first::utf8>> == "s"
true
iex(6)> ?s
115
iex(7)> <<?s>>
"s"

This is kind of crazy. It obviously solves my problem, but what is happening here? I clearly don’t fully understand what Elixir strings are. Is Elixir actually converting between a charlist and a binary when I add <<>>? Or is it simply that no conversion is needed, because a charlist wrapped in <<>> is all a binary is? Is there a difference between what I’ve done here and calling Kernel.to_string/1?

Eiji · July 28, 2020, 9:50am

@Dusty Simply take a look at Kernel.SpecialForms.<<>>/1 documentation.

The utf8 , utf16 , and utf32 types are for Unicode code points. They can also be applied to literal strings and charlists:
iex> <<"foo"::utf16>>
<<0, 102, 0, 111, 0, 111>>
iex> <<"foo"::utf32>>
<<0, 0, 0, 102, 0, 0, 0, 111, 0, 0, 0, 111>>

LostKobrakai · July 28, 2020, 9:56am

Charlists and binaries don’t really have things in common. Your code just happens to extract the codepoint for ?s from the binary. If you’d segment by a different type things might look different:

iex(16)> IO.inspect("string", base: :binary)
<<0b1110011, 0b1110100, 0b1110010, 0b1101001, 0b1101110, 0b1100111>>
"string"
iex(17)> IO.inspect("string", base: :hex)
<<0x73, 0x74, 0x72, 0x69, 0x6E, 0x67>>
"string"

Binaries are just a list of bytes. To apply meaning you’ll need to know how to interpret the bytes to something higher level. Charlists on the other hand are always a list of intergers for the spec: [0..0x10FFFF].