How to detect if a given character (grapheme) is whitespace?

fireproofsocks · November 12, 2019, 6:06am

While working on an Elixir parser (of Handlebars syntax), I came across the need to determine whether or not a given character (a grapheme). I know that there are the various String functions (like String.trim_leading), but I’m hoping to be able to write a guard clause on a function along the lines of this:

defp custom_trim(<<h :: binary - size(1)>> <> tail) when h not in [" ", "\t", "\n"], do: # ...

But I don’t know how to specify all the other whitespace characters. Is there a function that I could leverage to tell me whether a given character represents whitespace? If I use a regular expression pattern like ~r/\s/ would that match all whitespace? If it did I could do something like this:

String.split(my_str, ~r/\s/, parts: 2)

Thanks!

NobbZ · November 12, 2019, 6:24am

As you are splitting of a single byte only, you already covered most available whitespace. In that range there is only \v and \r missing AFAIR…

A better approach that you can extend to work with all whitespace is this one:

def f(<<h :: utf8, tail::binary>> when h not in ~c[ \t\n\r\v…], do…

You can find a list of all characters that are marked as whitespace by the unicode standard in wikipedia:

fireproofsocks · November 12, 2019, 3:36pm

Thank you for the info. If I wanted to reference one of the codepoints listed in the Wiki document, e.g. U+00A0 for a “No-break space”, how would I do that in Elixir? From https://stackoverflow.com/questions/54731429/convert-a-single-character-string-to-its-codepoint I can see how to do that using pattern matching or the String.to_charlist function, but if you give me a list of codepoint numbers (as listed in the Wiki reference), I can’t see how to take an alpha-numerical representation (or an integer) and convert it back to a string.

I can solve my immediate problem but I’m still not understanding the bigger picture here, so any clarifications would be appreciated!

NobbZ · November 12, 2019, 3:41pm

"\u00a0" as a string, or ~c[\u00a0] as charlist (with singlequote syntax as well).

kip · November 12, 2019, 6:43pm

As of Unicode 12.1 the following are categorised as whitespace (a list of code point ranges in this representation):

iex> Cldr.Unicode.Category.categories[:Zs] 
[
  {32, 32},
  {160, 160},
  {5760, 5760},
  {8192, 8202},
  {8239, 8239},
  {8287, 8287},
  {12288, 12288}
]

I have a lib that defines some guards to help with this sort of thing but specifically for whitespace you can:

when codepoint == 32 or codepoint == 160 or codepoint == 5760 or codepoint in 8192..8202 or codepoint == 8239 or codepoint == 8287 or codepoint == 12288

Or in a regex you can match on Unicode character categories:

iex> Regex.match? ~r/\p{Zs}/u, "   "
true

fireproofsocks · November 13, 2019, 6:38pm

Thanks! What’s your lib? It sounds relevant.