Possible bug in String.split/1's handling of Unicode whitespace?

dogweather · September 21, 2022, 10:17pm

According to the docs, String.split/1…

Divides a string into substrings at each Unicode whitespace occurrence…

(emphasis mine)

But I found that it doesn’t do this for U+00A0 “NO-BREAK SPACE” (UTF-8 <<194, 160>>). Maybe others too? No-break space is unicode whitespace and it’s marked as such in the Elixir Unicode property list.

Here’s the unexpected behavior I see (Elixir 1.14):

iex(1)> nbsp = <<194, 160>>
" "
iex(2)> s = Enum.join ["one", "two", "three"], nbsp
"one two three"
iex(3)> String.split s
["one two three"]

I expected it to split on the nbsp, producing the input, ["one", "two", "three"].

I traced the code for String.split/1 to break() here: elixir/lib/elixir/unicode/unicode.ex at main · elixir-lang/elixir · GitHub. It looks ok to me, though.

Background: This caused problems for me parsing some text. I didn’t realize that this is valid Unicode because String.split/1 and Regex.run where failing on it (although for two different reasons).

brettbeatty · September 21, 2022, 10:30pm

The String.split/1 docs also have this line:

Divisions do not occur on non-breaking whitespace.

dogweather · September 21, 2022, 10:30pm

Urp! Thank you, I didn’t catch that.

kip · September 22, 2022, 1:56am

The ‘break’ property and the ‘whitespace’ property are orthogonal in Unicode. Definitely significant overlap when looked at from the perspective of Latin scripts but definitely not universal - especially for languages that do not use whitespace between words (like Japanese, Chinese, Thai, Lao, Khmer, Myanmar scripts).

The elixir default for String.split/1 is “breaking whitespace”. By definition, “non breaking” characters should not break words, lines or sentences.

dogweather · September 22, 2022, 4:54am

Fascinating, thank you.

dogweather · September 22, 2022, 5:44am

I found an interesting consequence of the absence of whitespace in certain languages: some software handles line-breaking better than others.

Here’s Firefox maybe correctly line-wrapping a Japanese Wikipedia page. It won’t break up words:

Screen Shot 2022-09-21 at 11.39.10 PM

But Safari does break up some words, wrapping character-by-character:

Screen Shot 2022-09-21 at 11.38.29 PM

Those two extra characters after 1973 seem to be part of the next word. We can see how Firefox prefers to keep a word’s characters together, at the expense of a pretty flush layout. (See the line above 1973 with extra whitespace at the end.

[My wife, who knows Japanese, told me about the Japanese particles and they’re used to mark word boundaries.]