How to split off the first UTF char of a string and get the remaining.
I do it like << first :: utf8, rem :: binary >> = buffer.
No I see, that I will split UTF8 chars that way. Is it possible to get the first grapheme
with a match operation and get the remaining binary?
iex(55)> << x :: utf8, rem :: binary >> = "ä"
"ä"
iex(56)> x
228
iex(57)> << "ä", 0 >>
<<195, 164, 0>>
Yes, but I thought it would be complicated and I wanted to lookup normalize/2s implementation and accidentally found String.next_grapheme/1 which does exactly what OP is asking for:
iex(1)> {first, rest} = String.next_grapheme("ä")
{"ä", ""}
iex(2)> i first
Term
"ä"
Data type
BitString
Byte size
2
Description
This is a string: a UTF-8 encoded binary. It's printed surrounded by
"double quotes" because all UTF-8 encoded codepoints in it are printable.
Raw representation
<<195, 164>>
Reference modules
String, :binary
Implemented protocols
IEx.Info, String.Chars, Inspect, Collectable, List.Chars
iex(3)> i rest
Term
""
Data type
BitString
Byte size
0
Description
This is a string: a UTF-8 encoded binary. It's printed surrounded by
"double quotes" because all UTF-8 encoded codepoints in it are printable.
Raw representation
<<>>
Reference modules
String, :binary
Implemented protocols
IEx.Info, String.Chars, Inspect, Collectable, List.Chars
I’m not sure how to explain it without going to deep into how UTF8 encodes values…
I can only say, 228 is hex 0xe4, which is codepoint U+00e4 which in UTF-8 getes encoded as two bytes, the first beeing 195 (0xc3) and the second one beeing 164 (0xa4).
I use that splitting of unicode binaries in the scanner of my xml lib (elixml). So I am more or less copying text. In that case I guess it is ok to work on codepoints. Correct me if I’m wrong.