I am trying to compare the first character in a very long input string (like the text of a book) to some arbitrary character in a shorter string. The algorithm has an index for a grapheme in the shorter string, and just wants to check equality: Is the first character of my long string equal to the grapheme or not? I am trying to avoid doing something expensive under the hood, like calling String.graphemes/1 on the longer string. But I am having trouble figuring out the best way to compare the codepoint I get from binary pattern matching to the grapheme I get from other String methods.
To give a simple example:
def my_func(<<first::utf8, _rest::binary>> = _long_string) do
last = String.at("string", 5)
first == last
end
my_func("green")
# => false
I understand that basically this function is doing:
?g == "g"
so it makes sense that it returns false. But I am struggling to find the most efficient way to make this comparison.
I can do something like:
def my_func(<<first::utf8, _rest::binary>> = _long_string) do
[last] = String.to_charlist(String.at("string", 5))
first == last
end
my_func("green")
# => true
It works - but it feels like I should be able to do better . What I wanted to do was use the codepoint operator on the variable, like first == ?last, but it didn’t take long for me to understand that won’t fly.
My impression is that String.next_grapheme/1 and friends are going to process the whole string just to give me one value. Is that the case? Seems like using Stream won’t help if it requires me to process the long string into an enumerable first.
Should I stick with my second method above, or is there a better way?
OK, this is fascinating. It happens that I might need to match at any point in the string, so String.ends_with?/2 might not work, but I am much more interested in what you’ve done with the variable by wrapping it in <<>>.
This is kind of crazy. It obviously solves my problem, but what is happening here? I clearly don’t fully understand what Elixir strings are. Is Elixir actually converting between a charlist and a binary when I add <<>>? Or is it simply that no conversion is needed, because a charlist wrapped in <<>> is all a binary is? Is there a difference between what I’ve done here and calling Kernel.to_string/1?
Charlists and binaries don’t really have things in common. Your code just happens to extract the codepoint for ?s from the binary. If you’d segment by a different type things might look different:
Binaries are just a list of bytes. To apply meaning you’ll need to know how to interpret the bytes to something higher level. Charlists on the other hand are always a list of intergers for the spec: [0..0x10FFFF].