Split off the first UTF char of a string

marcuslankenau · August 27, 2017, 5:50pm

How to split off the first UTF char of a string and get the remaining.

I do it like << first :: utf8, rem :: binary >> = buffer.
No I see, that I will split UTF8 chars that way. Is it possible to get the first grapheme
with a match operation and get the remaining binary?

iex(55)> << x :: utf8, rem :: binary >> = "ä"
"ä"
iex(56)> x
228
iex(57)> << "ä", 0 >>                        
<<195, 164, 0>>

idi527 · August 27, 2017, 6:13pm

What exactly you don’t like about the code that you’ve provided?

iex(2)> <<x::utf8, rest::bytes>> = "ä"
"ä"
iex(3)> rest
""
iex(4)> <<x::utf8>>
"ä"

LostKobrakai · August 27, 2017, 6:20pm

He probably meant this:

iex(1)> string = "\u0065\u0301"
"é"
iex(2)> <<x::utf8, rest::bytes>> = string
"é"
iex(3)> x
101
iex(4)> rest
"́"

But I doubt that graphemes can be matched via binary matches. I’d expect extracting those is more involved than what patter matching does do.

NobbZ · August 27, 2017, 6:26pm

For a grapheme there are two ways…

iex(1)> String.graphemes "ä"
["ä"]
iex(2)> first = hd(v(1))
"ä"
iex(3)> rest = tl(v(1)) |> Enum.join
""

The other one requires normalisation of the string before hand:

iex(1)> <<first_graph::utf8, rest::binary>> = String.normalize("ä", :nfc)
"ä"
iex(2)> first = <<first_graph::utf8>>
"ä"
iex(3)> rest
""

Also the integer 228 is equivalent to the UTF-8 encoded bytesequence <<195, 164>>.

NobbZ · August 27, 2017, 6:27pm

This gives the first codepoint, not the first grapheme!

Try i rest in iex after your example, you’ll see its raw representation will be <<204, 129>> which encodes that funny little stroke above the e

Qqwy · August 27, 2017, 6:41pm

Both ways you mentioned do work on all of the string. Are there ways to get the first grapheme without running over the string as a whole?

NobbZ · August 27, 2017, 6:49pm

Yes, but I thought it would be complicated and I wanted to lookup normalize/2s implementation and accidentally found String.next_grapheme/1 which does exactly what OP is asking for:

iex(1)> {first, rest} = String.next_grapheme("ä")
{"ä", ""}
iex(2)> i first
Term
  "ä"
Data type
  BitString
Byte size
  2
Description
  This is a string: a UTF-8 encoded binary. It's printed surrounded by
  "double quotes" because all UTF-8 encoded codepoints in it are printable.
Raw representation
  <<195, 164>>
Reference modules
  String, :binary
Implemented protocols
  IEx.Info, String.Chars, Inspect, Collectable, List.Chars
iex(3)> i rest
Term
  ""
Data type
  BitString
Byte size
  0
Description
  This is a string: a UTF-8 encoded binary. It's printed surrounded by
  "double quotes" because all UTF-8 encoded codepoints in it are printable.
Raw representation
  <<>>
Reference modules
  String, :binary
Implemented protocols
  IEx.Info, String.Chars, Inspect, Collectable, List.Chars

marcuslankenau · August 28, 2017, 4:48am

Dont understand that. Would be cool if you could explain that.

I use next_grapheme now and it works. Thx

NobbZ · August 28, 2017, 5:47am

I’m not sure how to explain it without going to deep into how UTF8 encodes values…

I can only say, 228 is hex 0xe4, which is codepoint U+00e4 which in UTF-8 getes encoded as two bytes, the first beeing 195 (0xc3) and the second one beeing 164 (0xa4).

iex(1)> <<228::utf8>> === <<195, 164>>
true

marcuslankenau · August 28, 2017, 6:33am

Ok, got it. Now I see my mistake. I was too stupid to read idiots response. My mistake was
that I did not put the ::utf8 when using the grapheme.

Thx alot.

NobbZ · August 28, 2017, 6:36am

Remember, @idi527’s example does only take the first codepoint, not the first grapheme!

marcuslankenau · August 28, 2017, 6:52am

@NobbZ thx for clarifying that. After reading parts of Codepoints vs Grapheme I think I got the difference.

I use that splitting of unicode binaries in the scanner of my xml lib (elixml). So I am more or less copying text. In that case I guess it is ok to work on codepoints. Correct me if I’m wrong.