Unicode-savvy `letter_or_digit?` function?

scouten · February 18, 2019, 10:54pm

I’m parsing a file format by recursing on a charlist. One of the challenges I have in parsing this format is determining whether the current Unicode codepoint is a letter or digit. So far I’ve written this:

# HELP: This is not Unicode-savvy. Is there such a thing?
defp letter_or_digit?(c) when c >= ?0 and c <= ?9, do: true
defp letter_or_digit?(c) when c >= ?A and c <= ?Z, do: true
defp letter_or_digit?(c) when c >= ?a and c <= ?z, do: true
defp letter_or_digit?(_), do: false

This is called via Enum.split_while(&letter_or_digit?/1) and similar constructions.

What I’ve written is obviously not savvy about non-ASCII letters or digits. Is there such a thing handy? My Google-fu has failed me so far.

The obvious answer would be to parse via Regex match. Unfortunately, some other parts of the file format are far better consumed via charlist recursion and I don’t want to pay the cost of bouncing back and forth between strings and charlists.

hauleth · February 18, 2019, 11:39pm

You absolutely can run regexes on charlists as in Erlang charlists are the strings.
If you want to check if code point is digit, then here you have full list, you need to check all these groups. For rest of the character groups check here.

BTW you do not need to change binary (as assume that you get binary from the file) to iterate over it, you can pattern match on binary fragments as well by <<byte, rest :: binary>> = "abba", byte == ?a and rest == "bba".

scouten · February 19, 2019, 2:08am

I think you’re mistaken. Strings and charlists are not the same thing. Just to confirm:

iex(1)> x = '1234'  # note: single-quote to make it a charlist, not a binary / string
'1234'
iex(2)> Regex.match?(~r/23/, x)
** (FunctionClauseError) no function clause matching in Regex.match?/2    
    
    The following arguments were given to Regex.match?/2:
    
        # 1
        ~r/23/
    
        # 2
        '1234'
    
    Attempted function clauses (showing 1 out of 1):
    
        def match?(%Regex{re_pattern: compiled}, string) when is_binary(string)
    
    (elixir) lib/regex.ex:231: Regex.match?/2

mgwidmann · February 19, 2019, 2:52am

To have it be Unicode aware, you have to consider all the foreign characters. Is "й" going to return the right value for your function? Since listing them out is difficult, the easiest way is to check for numbers first and then check if the up case is different than down case.

Nicd · February 19, 2019, 7:58am

Try Erlang regular expressions for charlists instead, I think this is what hauleth meant: http://erlang.org/doc/man/re.html

You probably want to use regex to match for character property Number.

kip · February 21, 2019, 1:32pm

The regex examples below may help (correctly using the u flag as reminded by @NobbZ).

iex> char = "È"
"È"
iex> Regex.match? ~r/\p{Lu}/u, char
true
iex> Regex.match? ~r/\p{Ll}/u, char
false
iex> char = "é"
"é"
iex> Regex.match? ~r/\p{Ll}/u, char
true
iex> Regex.match? ~r/\p{Lu}/u, char
false
iex(16)> char = "1"                    
"1"
iex> Regex.match? ~r/\p{N}/u, char 
true

NobbZ · February 21, 2019, 1:33pm

You forgot to enable unicode. Use the u flag on the regex.

kip · February 21, 2019, 1:35pm

Sigh. Thanks for the reminder. I hate that its not the default, but I understand why

kip · February 22, 2019, 7:23pm

I’ve just published the first version of ex_cldr_unicode that might be helpful to you. It builds functions at compile time using data from the Unicode database.

Theres a bunch of fun stuff you can do, but to your use case it includes some guards that you may find useful. For example:

defmodule MyModule do
  require Cldr.Unicode.Guards
  alias Cldr.Unicode.Guards

  def my_function(codepoint) when Guards.is_upper(codepoint) do
    IO.puts "Its Uppercase!"
  end

See hex docs for further info.

I’ve defined only the following guards so far but it’s trivial to add more so let me know if its useful:

is_upper
is_lower
is_digit
is_currency_symbol

Oh, and it’s more than twice as fast as using a regex for this kind of matching.

There are a bunch of classifier functions as well. From my understanding of your use case the following may also apply:

iex> Cldr.Unicode.Property.alphanumeric? "1234"
true
iex> Cldr.Unicode.Property.alphanumeric? "KeyserSöze1995"
true
iex> Cldr.Unicode.Property.alphanumeric? "3段"
true
iex> Cldr.Unicode.Property.alphanumeric? "dragon@example.com"
false

scouten · February 24, 2019, 3:05am

Thanks, Kip, that looks like it will fit my need very well.

kip · February 24, 2019, 8:17am

Updated to version 0.2.0. Main changes are:

Moves the public API to the Cldr.Unicode module.
Updates and adds documentation to all public functions.
Removes the text annotations from the compiled functions which materially reduces the size of the beam files.

Feedback welcome as are feature requests and PRs.