\r\n as a grapheme is it 1 or 2?

I am assuming this is a Windows thing. Regex seems to treat \r\n as 2 characters but String treats them as one

trying to strip whitespace off the front of a string

iex(17)> regex = ~r/^\s+/
~r/^\s+/
iex(18)> junk = "\r\n\r\nabc"
"\r\n\r\nabc"
iex(19)> Regex.run(regex, junk, return: :index)
[{0, 4}]
iex(20)> String.slice(junk,  4..-1//1)
"c"
iex(21)> String.graphemes(juk)
["\r\n", "\r\n", "a", "b", "c"]
iex(22)>

Am I reading this correctly (absolute elixir noob). What the correct way to fix it?

Well โ€ฆ thatโ€™s odd. I can reproduce it on Linux as well โ€ฆ One thing you can do with it is to convert it to List and then join it back when youโ€™ve done. For this you could use a for generator like:

junk = "\r\n\r\nabc"
list = for <<char::utf8 <- junk>>, do: <<char::utf8>>
sliced_list = Enum.slice(list,  4..-1//1)
Enum.join(sliced_list)
# "abc"

I worked out that using String.trim was much simpler, but its still really odd behaviour. Rust - which uses the same utf8 mechanism , distinguishing characters (aka graphemes in elexir) from bytes, treats \r\n as 2 characters

Your solution is indeed better, but pay attention that I replied to your attempt on slicing data where an input was explicitly mentioned in your post. Besides for generators and other binary pattern matching with utf8 modifier I do not see any other safe way to slice data like you have asked. :light_bulb:

Edit: I completely forgot that you can also convert a string to the charlist (Erlangโ€™s string type which is simply a plain list of integers - not a binary chain). Itโ€™s also not the best solution, but it would work as well as therefore each character would be see separately too.

Iโ€™m not expert in it, but looks like the Unicode Standard Annex #29 (mentioned in String docs) defines a CARRIAGE RETURN and LINE FEED as separate characters as they are mentioned in their own rows where first column name is character. Maybe itโ€™s an Elixir bug? :bug:

As the Regex docs are mentioning, :index โ€œreturns byte index and match lengthโ€, not the grapheme index.

This works:

iex> binary_slice("\r\n\r\nabc", 4..-1//1)
"abc"

Bytes != codepoints != graphemes.

iex> junk = "\r\n\r\nabc รฉ"
"\r\n\r\nabc รฉ"
iex> for <<byte <- junk>>, do: <<byte>>
["\r", "\n", "\r", "\n", "a", "b", "c", " ", <<195>>, <<169>>]
iex> String.codepoints(junk)
["\r", "\n", "\r", "\n", "a", "b", "c", " ", "รฉ"]
iex> String.graphemes(junk)
["\r\n", "\r\n", "a", "b", "c", " ", "รฉ"]

The String documentation explains this topic well.

3 Likes

Oh, that makes sense - didnโ€™t look at the Regex documentation, so blindly assumed author expects grapheme index and thatโ€™s what my answer was based. :sweat_smile:

However this does not change a thing that only String sees \r\n as a single character. String is basically a set of helper functions for UTF8 binaries (hence utf8 modifier in generator has been proposed). Itโ€™s a serious gotcha that 2 UTF8 characters are seen as 1 and that happens only in String. :icon_confused:

String is basically a set of helper functions for UTF8 binaries

Exactly, it is the same underlying structure (binaries containing valid UTF8) but what changes is if you chose to process it as bytes, codepoints (UTF-8 characters) or graphemes (groups of codepoints which will be displayed as a single character to the human reader).

However this does not change a thing that only String sees \r\n as a single character.

This depends of the function. Many functions of the String module are explicitly working with graphemes, such as String.length, String.at, String.firstโ€ฆ
String.codepoints, String.next_codepoints, String.to_charlist, ::utf8 in bitstring patterns all work with codepoints.

Itโ€™s a serious gotcha that 2 UTF8 characters are seen as 1

Graphemes and codepoints are arguably a gotcha in any language, so Iโ€™m not disagreeing, but I wouldnโ€™t pin it on the String library :sweat_smile:

4 Likes

FYI flags are multiple codepoints too.

They are the country code offset by 127397.

iex> for <<cp::utf8 <- "UN">>, do: <<(cp +  127397)::utf8>>, into: ""
"๐Ÿ‡บ๐Ÿ‡ณ"
2 Likes