I’m parsing a file format by recursing on a charlist. One of the challenges I have in parsing this format is determining whether the current Unicode codepoint is a letter or digit. So far I’ve written this:
# HELP: This is not Unicode-savvy. Is there such a thing?
defp letter_or_digit?(c) when c >= ?0 and c <= ?9, do: true
defp letter_or_digit?(c) when c >= ?A and c <= ?Z, do: true
defp letter_or_digit?(c) when c >= ?a and c <= ?z, do: true
defp letter_or_digit?(_), do: false
This is called via Enum.split_while(&letter_or_digit?/1) and similar constructions.
What I’ve written is obviously not savvy about non-ASCII letters or digits. Is there such a thing handy? My Google-fu has failed me so far.
The obvious answer would be to parse via Regex match. Unfortunately, some other parts of the file format are far better consumed via charlist recursion and I don’t want to pay the cost of bouncing back and forth between strings and charlists.
You absolutely can run regexes on charlists as in Erlang charlists are the strings.
If you want to check if code point is digit, then here you have full list, you need to check all these groups. For rest of the character groups check here.
BTW you do not need to change binary (as assume that you get binary from the file) to iterate over it, you can pattern match on binary fragments as well by <<byte, rest :: binary>> = "abba", byte == ?a and rest == "bba".
I think you’re mistaken. Strings and charlists are not the same thing. Just to confirm:
iex(1)> x = '1234' # note: single-quote to make it a charlist, not a binary / string
'1234'
iex(2)> Regex.match?(~r/23/, x)
** (FunctionClauseError) no function clause matching in Regex.match?/2
The following arguments were given to Regex.match?/2:
# 1
~r/23/
# 2
'1234'
Attempted function clauses (showing 1 out of 1):
def match?(%Regex{re_pattern: compiled}, string) when is_binary(string)
(elixir) lib/regex.ex:231: Regex.match?/2
To have it be Unicode aware, you have to consider all the foreign characters. Is "й" going to return the right value for your function? Since listing them out is difficult, the easiest way is to check for numbers first and then check if the up case is different than down case.
I’ve just published the first version of ex_cldr_unicode that might be helpful to you. It builds functions at compile time using data from the Unicode database.
Theres a bunch of fun stuff you can do, but to your use case it includes some guards that you may find useful. For example:
defmodule MyModule do
require Cldr.Unicode.Guards
alias Cldr.Unicode.Guards
def my_function(codepoint) when Guards.is_upper(codepoint) do
IO.puts "Its Uppercase!"
end