\r\n as a grapheme is it 1 or 2?

pm100 · July 27, 2025, 12:36am

I am assuming this is a Windows thing. Regex seems to treat \r\n as 2 characters but String treats them as one

trying to strip whitespace off the front of a string

iex(17)> regex = ~r/^\s+/
~r/^\s+/
iex(18)> junk = "\r\n\r\nabc"
"\r\n\r\nabc"
iex(19)> Regex.run(regex, junk, return: :index)
[{0, 4}]
iex(20)> String.slice(junk,  4..-1//1)
"c"
iex(21)> String.graphemes(juk)
["\r\n", "\r\n", "a", "b", "c"]
iex(22)>

Am I reading this correctly (absolute elixir noob). What the correct way to fix it?

Eiji · July 27, 2025, 1:00am

Well … that’s odd. I can reproduce it on Linux as well … One thing you can do with it is to convert it to List and then join it back when you’ve done. For this you could use a for generator like:

junk = "\r\n\r\nabc"
list = for <<char::utf8 <- junk>>, do: <<char::utf8>>
sliced_list = Enum.slice(list,  4..-1//1)
Enum.join(sliced_list)
# "abc"

pm100 · July 27, 2025, 1:24am

I worked out that using String.trim was much simpler, but its still really odd behaviour. Rust - which uses the same utf8 mechanism , distinguishing characters (aka graphemes in elexir) from bytes, treats \r\n as 2 characters

Eiji · July 27, 2025, 2:04am

Your solution is indeed better, but pay attention that I replied to your attempt on slicing data where an input was explicitly mentioned in your post. Besides for generators and other binary pattern matching with utf8 modifier I do not see any other safe way to slice data like you have asked.

Edit: I completely forgot that you can also convert a string to the charlist (Erlang’s string type which is simply a plain list of integers - not a binary chain). It’s also not the best solution, but it would work as well as therefore each character would be see separately too.

I’m not expert in it, but looks like the Unicode Standard Annex #29 (mentioned in String docs) defines a CARRIAGE RETURN and LINE FEED as separate characters as they are mentioned in their own rows where first column name is character. Maybe it’s an Elixir bug?

sabiwara · July 27, 2025, 2:09am

As the Regex docs are mentioning, :index “returns byte index and match length”, not the grapheme index.

This works:

iex> binary_slice("\r\n\r\nabc", 4..-1//1)
"abc"

Bytes != codepoints != graphemes.

iex> junk = "\r\n\r\nabc é"
"\r\n\r\nabc é"
iex> for <<byte <- junk>>, do: <<byte>>
["\r", "\n", "\r", "\n", "a", "b", "c", " ", <<195>>, <<169>>]
iex> String.codepoints(junk)
["\r", "\n", "\r", "\n", "a", "b", "c", " ", "é"]
iex> String.graphemes(junk)
["\r\n", "\r\n", "a", "b", "c", " ", "é"]

The String documentation explains this topic well.

Eiji · July 27, 2025, 2:18am

Oh, that makes sense - didn’t look at the Regex documentation, so blindly assumed author expects grapheme index and that’s what my answer was based.

However this does not change a thing that only String sees \r\n as a single character. String is basically a set of helper functions for UTF8 binaries (hence utf8 modifier in generator has been proposed). It’s a serious gotcha that 2 UTF8 characters are seen as 1 and that happens only in String.

sabiwara · July 27, 2025, 2:55am

String is basically a set of helper functions for UTF8 binaries

Exactly, it is the same underlying structure (binaries containing valid UTF8) but what changes is if you chose to process it as bytes, codepoints (UTF-8 characters) or graphemes (groups of codepoints which will be displayed as a single character to the human reader).

However this does not change a thing that only String sees \r\n as a single character.

This depends of the function. Many functions of the String module are explicitly working with graphemes, such as String.length, String.at, String.first…
String.codepoints, String.next_codepoints, String.to_charlist, ::utf8 in bitstring patterns all work with codepoints.

It’s a serious gotcha that 2 UTF8 characters are seen as 1

Graphemes and codepoints are arguably a gotcha in any language, so I’m not disagreeing, but I wouldn’t pin it on the String library

adamu · July 27, 2025, 3:01pm

FYI flags are multiple codepoints too.

They are the country code offset by 127397.

iex> for <<cp::utf8 <- "UN">>, do: <<(cp +  127397)::utf8>>, into: ""
"🇺🇳"

gist.github.com

https://gist.github.com/adamu/d9770af3265351ae866d9d5ffffa2a76

flags.csv

🇦🇦,🇦🇧,🇦🇨,🇦🇩,🇦🇪,🇦🇫,🇦🇬,🇦🇭,🇦🇮,🇦🇯,🇦🇰,🇦🇱,🇦🇲,🇦🇳,🇦🇴,🇦🇵,🇦🇶,🇦🇷,🇦🇸,🇦🇹,🇦🇺,🇦🇻,🇦🇼,🇦🇽,🇦🇾,🇦🇿,🇧🇦,🇧🇧,🇧🇨,🇧🇩,🇧🇪,🇧🇫,🇧🇬,🇧🇭,🇧🇮,🇧🇯,🇧🇰,🇧🇱,🇧🇲,🇧🇳,🇧🇴,🇧🇵,🇧🇶,🇧🇷,🇧🇸,🇧🇹,🇧🇺,🇧🇻,🇧🇼,🇧🇽,🇧🇾,🇧🇿,🇨🇦,🇨🇧,🇨🇨,🇨🇩,🇨🇪,🇨🇫,🇨🇬,🇨🇭,🇨🇮,🇨🇯,🇨🇰,🇨🇱,🇨🇲,🇨🇳,🇨🇴,🇨🇵,🇨🇶,🇨🇷,🇨🇸,🇨🇹,🇨🇺,🇨🇻,🇨🇼,🇨🇽,🇨🇾,🇨🇿,🇩🇦,🇩🇧,🇩🇨,🇩🇩,🇩🇪,🇩🇫,🇩🇬,🇩🇭,🇩🇮,🇩🇯,🇩🇰,🇩🇱,🇩🇲,🇩🇳,🇩🇴,🇩🇵,🇩🇶,🇩🇷,🇩🇸,🇩🇹,🇩🇺,🇩🇻,🇩🇼,🇩🇽,🇩🇾,🇩🇿,🇪🇦,🇪🇧,🇪🇨,🇪🇩,🇪🇪,🇪🇫,🇪🇬,🇪🇭,🇪🇮,🇪🇯,🇪🇰,🇪🇱,🇪🇲,🇪🇳,🇪🇴,🇪🇵,🇪🇶,🇪🇷,🇪🇸,🇪🇹,🇪🇺,🇪🇻,🇪🇼,🇪🇽,🇪🇾,🇪🇿,🇫🇦,🇫🇧,🇫🇨,🇫🇩,🇫🇪,🇫🇫,🇫🇬,🇫🇭,🇫🇮,🇫🇯,🇫🇰,🇫🇱,🇫🇲,🇫🇳,🇫🇴,🇫🇵,🇫🇶,🇫🇷,🇫🇸,🇫🇹,🇫🇺,🇫🇻,🇫🇼,🇫🇽,🇫🇾,🇫🇿,🇬🇦,🇬🇧,🇬🇨,🇬🇩,🇬🇪,🇬🇫,🇬🇬,🇬🇭,🇬🇮,🇬🇯,🇬🇰,🇬🇱,🇬🇲,🇬🇳,🇬🇴,🇬🇵,🇬🇶,🇬🇷,🇬🇸,🇬🇹,🇬🇺,🇬🇻,🇬🇼,🇬🇽,🇬🇾,🇬🇿,🇭🇦,🇭🇧,🇭🇨,🇭🇩,🇭🇪,🇭🇫,🇭🇬,🇭🇭,🇭🇮,🇭🇯,🇭🇰,🇭🇱,🇭🇲,🇭🇳,🇭🇴,🇭🇵,🇭🇶,🇭🇷,🇭🇸,🇭🇹,🇭🇺,🇭🇻,🇭🇼,🇭🇽,🇭🇾,🇭🇿,🇮🇦,🇮🇧,🇮🇨,🇮🇩,🇮🇪,🇮🇫,🇮🇬,🇮🇭,🇮🇮,🇮🇯,🇮🇰,🇮🇱,🇮🇲,🇮🇳,🇮🇴,🇮🇵,🇮🇶,🇮🇷,🇮🇸,🇮🇹,🇮🇺,🇮🇻,🇮🇼,🇮🇽,🇮🇾,🇮🇿,🇯🇦,🇯🇧,🇯🇨,🇯🇩,🇯🇪,🇯🇫,🇯🇬,🇯🇭,🇯🇮,🇯🇯,🇯🇰,🇯🇱,🇯🇲,🇯🇳,🇯🇴,🇯🇵,🇯🇶,🇯🇷,🇯🇸,🇯🇹,🇯🇺,🇯🇻,🇯🇼,🇯🇽,🇯🇾,🇯🇿,🇰🇦,🇰🇧,🇰🇨,🇰🇩,🇰🇪,🇰🇫,🇰🇬,🇰🇭,🇰🇮,🇰🇯,🇰🇰,🇰🇱,🇰🇲,🇰🇳,🇰🇴,🇰🇵,🇰🇶,🇰🇷,🇰🇸,🇰🇹,🇰🇺,🇰🇻,🇰🇼,🇰🇽,🇰🇾,🇰🇿,🇱🇦,🇱🇧,🇱🇨,🇱🇩,🇱🇪,🇱🇫,🇱🇬,🇱🇭,🇱🇮,🇱🇯,🇱🇰,🇱🇱,🇱🇲,🇱🇳,🇱🇴,🇱🇵,🇱🇶,🇱🇷,🇱🇸,🇱🇹,🇱🇺,🇱🇻,🇱🇼,🇱🇽,🇱🇾,🇱🇿,🇲🇦,🇲🇧,🇲🇨,🇲🇩,🇲🇪,🇲🇫,🇲🇬,🇲🇭,🇲🇮,🇲🇯,🇲🇰,🇲🇱,🇲🇲,🇲🇳,🇲🇴,🇲🇵,🇲🇶,🇲🇷,🇲🇸,🇲🇹,🇲🇺,🇲🇻,🇲🇼,🇲🇽,🇲🇾,🇲🇿,🇳🇦,🇳🇧,🇳🇨,🇳🇩,🇳🇪,🇳🇫,🇳🇬,🇳🇭,🇳🇮,🇳🇯,🇳🇰,🇳🇱,🇳🇲,🇳🇳,🇳🇴,🇳🇵,🇳🇶,🇳🇷,🇳🇸,🇳🇹,🇳🇺,🇳🇻,🇳🇼,🇳🇽,🇳🇾,🇳🇿,🇴🇦,🇴🇧,🇴🇨,🇴🇩,🇴🇪,🇴🇫,🇴🇬,🇴🇭,🇴🇮,🇴🇯,🇴🇰,🇴🇱,🇴🇲,🇴🇳,🇴🇴,🇴🇵,🇴🇶,🇴🇷,🇴🇸,🇴🇹,🇴🇺,🇴🇻,🇴🇼,🇴🇽,🇴🇾,🇴🇿,🇵🇦,🇵🇧,🇵🇨,🇵🇩,🇵🇪,🇵🇫,🇵🇬,🇵🇭,🇵🇮,🇵🇯,🇵🇰,🇵🇱,🇵🇲,🇵🇳,🇵🇴,🇵🇵,🇵🇶,🇵🇷,🇵🇸,🇵🇹,🇵🇺,🇵🇻,🇵🇼,🇵🇽,🇵🇾,🇵🇿,🇶🇦,🇶🇧,🇶🇨,🇶🇩,🇶🇪,🇶🇫,🇶🇬,🇶🇭,🇶🇮,🇶🇯,🇶🇰,🇶🇱,🇶🇲,🇶🇳,🇶🇴,🇶🇵,🇶🇶,🇶🇷,🇶🇸,🇶🇹,🇶🇺,🇶🇻,🇶🇼,🇶🇽,🇶🇾,🇶🇿,🇷🇦,🇷🇧,🇷🇨,🇷🇩,🇷🇪,🇷🇫,🇷🇬,🇷🇭,🇷🇮,🇷🇯,🇷🇰,🇷🇱,🇷🇲,🇷🇳,🇷🇴,🇷🇵,🇷🇶,🇷🇷,🇷🇸,🇷🇹,🇷🇺,🇷🇻,🇷🇼,🇷🇽,🇷🇾,🇷🇿,🇸🇦,🇸🇧,🇸🇨,🇸🇩,🇸🇪,🇸🇫,🇸🇬,🇸🇭,🇸🇮,🇸🇯,🇸🇰,🇸🇱,🇸🇲,🇸🇳,🇸🇴,🇸🇵,🇸🇶,🇸🇷,🇸🇸,🇸🇹,🇸🇺,🇸🇻,🇸🇼,🇸🇽,🇸🇾,🇸🇿,🇹🇦,🇹🇧,🇹🇨,🇹🇩,🇹🇪,🇹🇫,🇹🇬,🇹🇭,🇹🇮,🇹🇯,🇹🇰,🇹🇱,🇹🇲,🇹🇳,🇹🇴,🇹🇵,🇹🇶,🇹🇷,🇹🇸,🇹🇹,🇹🇺,🇹🇻,🇹🇼,🇹🇽,🇹🇾,🇹🇿,🇺🇦,🇺🇧,🇺🇨,🇺🇩,🇺🇪,🇺🇫,🇺🇬,🇺🇭,🇺🇮,🇺🇯,🇺🇰,🇺🇱,🇺🇲,🇺🇳,🇺🇴,🇺🇵,🇺🇶,🇺🇷,🇺🇸,🇺🇹,🇺🇺,🇺🇻,🇺🇼,🇺🇽,🇺🇾,🇺🇿,🇻🇦,🇻🇧,🇻🇨,🇻🇩,🇻🇪,🇻🇫,🇻🇬,🇻🇭,🇻🇮,🇻🇯,🇻🇰,🇻🇱,🇻🇲,🇻🇳,🇻🇴,🇻🇵,🇻🇶,🇻🇷,🇻🇸,🇻🇹,🇻🇺,🇻🇻,🇻🇼,🇻🇽,🇻🇾,🇻🇿,🇼🇦,🇼🇧,🇼🇨,🇼🇩,🇼🇪,🇼🇫,🇼🇬,🇼🇭,🇼🇮,🇼🇯,🇼🇰,🇼🇱,🇼🇲,🇼🇳,🇼🇴,🇼🇵,🇼🇶,🇼🇷,🇼🇸,🇼🇹,🇼🇺,🇼🇻,🇼🇼,🇼🇽,🇼🇾,🇼🇿,🇽🇦,🇽🇧,🇽🇨,🇽🇩,🇽🇪,🇽🇫,🇽🇬,🇽🇭,🇽🇮,🇽🇯,🇽🇰,🇽🇱,🇽🇲,🇽🇳,🇽🇴,🇽🇵,🇽🇶,🇽🇷,🇽🇸,🇽🇹,🇽🇺,🇽🇻,🇽🇼,🇽🇽,🇽🇾,🇽🇿,🇾🇦,🇾🇧,🇾🇨,🇾🇩,🇾🇪,🇾🇫,🇾🇬,🇾🇭,🇾🇮,🇾🇯,🇾🇰,🇾🇱,🇾🇲,🇾🇳,🇾🇴,🇾🇵,🇾🇶,🇾🇷,🇾🇸,🇾🇹,🇾🇺,🇾🇻,🇾🇼,🇾🇽,🇾🇾,🇾🇿,🇿🇦,🇿🇧,🇿🇨,🇿🇩,🇿🇪,🇿🇫,🇿🇬,🇿🇭,🇿🇮,🇿🇯,🇿🇰,🇿🇱,🇿🇲,🇿🇳,🇿🇴,🇿🇵,🇿🇶,🇿🇷,🇿🇸,🇿🇹,🇿🇺,🇿🇻,🇿🇼,🇿🇽,🇿🇾,🇿🇿