Background: This caused problems for me parsing some text. I didn’t realize that this is valid Unicode because String.split/1 and Regex.run where failing on it (although for two different reasons).
The ‘break’ property and the ‘whitespace’ property are orthogonal in Unicode. Definitely significant overlap when looked at from the perspective of Latin scripts but definitely not universal - especially for languages that do not use whitespace between words (like Japanese, Chinese, Thai, Lao, Khmer, Myanmar scripts).
The elixir default for String.split/1 is “breaking whitespace”. By definition, “non breaking” characters should not break words, lines or sentences.
I found an interesting consequence of the absence of whitespace in certain languages: some software handles line-breaking better than others.
Here’s Firefox maybe correctly line-wrapping a Japanese Wikipedia page. It won’t break up words:
But Safari does break up some words, wrapping character-by-character:
Those two extra characters after 1973 seem to be part of the next word. We can see how Firefox prefers to keep a word’s characters together, at the expense of a pretty flush layout. (See the line above 1973 with extra whitespace at the end.
[My wife, who knows Japanese, told me about the Japanese particles and they’re used to mark word boundaries.]