Unicode-savvy `letter_or_digit?` function?

I’m parsing a file format by recursing on a charlist. One of the challenges I have in parsing this format is determining whether the current Unicode codepoint is a letter or digit. So far I’ve written this:

# HELP: This is not Unicode-savvy. Is there such a thing?
defp letter_or_digit?(c) when c >= ?0 and c <= ?9, do: true
defp letter_or_digit?(c) when c >= ?A and c <= ?Z, do: true
defp letter_or_digit?(c) when c >= ?a and c <= ?z, do: true
defp letter_or_digit?(_), do: false

This is called via Enum.split_while(&letter_or_digit?/1) and similar constructions.

What I’ve written is obviously not savvy about non-ASCII letters or digits. Is there such a thing handy? My Google-fu has failed me so far.

The obvious answer would be to parse via Regex match. Unfortunately, some other parts of the file format are far better consumed via charlist recursion and I don’t want to pay the cost of bouncing back and forth between strings and charlists.

2 Likes
  1. You absolutely can run regexes on charlists as in Erlang charlists are the strings.
  2. If you want to check if code point is digit, then here you have full list, you need to check all these groups. For rest of the character groups check here.

BTW you do not need to change binary (as assume that you get binary from the file) to iterate over it, you can pattern match on binary fragments as well by <<byte, rest :: binary>> = "abba", byte == ?a and rest == "bba".

1 Like

I think you’re mistaken. Strings and charlists are not the same thing. Just to confirm:

iex(1)> x = '1234'  # note: single-quote to make it a charlist, not a binary / string
'1234'
iex(2)> Regex.match?(~r/23/, x)
** (FunctionClauseError) no function clause matching in Regex.match?/2    
    
    The following arguments were given to Regex.match?/2:
    
        # 1
        ~r/23/
    
        # 2
        '1234'
    
    Attempted function clauses (showing 1 out of 1):
    
        def match?(%Regex{re_pattern: compiled}, string) when is_binary(string)
    
    (elixir) lib/regex.ex:231: Regex.match?/2
1 Like

To have it be Unicode aware, you have to consider all the foreign characters. Is "й" going to return the right value for your function? Since listing them out is difficult, the easiest way is to check for numbers first and then check if the up case is different than down case.

Try Erlang regular expressions for charlists instead, I think this is what hauleth meant: http://erlang.org/doc/man/re.html

You probably want to use regex to match for character property Number.

2 Likes

The regex examples below may help (correctly using the u flag as reminded by @NobbZ).

iex> char = "È"
"È"
iex> Regex.match? ~r/\p{Lu}/u, char
true
iex> Regex.match? ~r/\p{Ll}/u, char
false
iex> char = "é"
"é"
iex> Regex.match? ~r/\p{Ll}/u, char
true
iex> Regex.match? ~r/\p{Lu}/u, char
false
iex(16)> char = "1"                    
"1"
iex> Regex.match? ~r/\p{N}/u, char 
true
1 Like

You forgot to enable unicode. Use the u flag on the regex.

1 Like

Sigh. Thanks for the reminder. I hate that its not the default, but I understand why

I’ve just published the first version of ex_cldr_unicode that might be helpful to you. It builds functions at compile time using data from the Unicode database.

Theres a bunch of fun stuff you can do, but to your use case it includes some guards that you may find useful. For example:

defmodule MyModule do
  require Cldr.Unicode.Guards
  alias Cldr.Unicode.Guards

  def my_function(codepoint) when Guards.is_upper(codepoint) do
    IO.puts "Its Uppercase!"
  end

See hex docs for further info.

I’ve defined only the following guards so far but it’s trivial to add more so let me know if its useful:

  • is_upper
  • is_lower
  • is_digit
  • is_currency_symbol

Oh, and it’s more than twice as fast as using a regex for this kind of matching.

There are a bunch of classifier functions as well. From my understanding of your use case the following may also apply:

iex> Cldr.Unicode.Property.alphanumeric? "1234"
true
iex> Cldr.Unicode.Property.alphanumeric? "KeyserSöze1995"
true
iex> Cldr.Unicode.Property.alphanumeric? "3段"
true
iex> Cldr.Unicode.Property.alphanumeric? "dragon@example.com"
false
2 Likes

Thanks, Kip, that looks like it will fit my need very well.

1 Like

Updated to version 0.2.0. Main changes are:

  • Moves the public API to the Cldr.Unicode module.

  • Updates and adds documentation to all public functions.

  • Removes the text annotations from the compiled functions which materially reduces the size of the beam files.

Feedback welcome as are feature requests and PRs.

2 Likes