Ex_unicode - Fun with Unicode (introspection, lookup, sets, guards, transforms...)

Following on from my CLDR lbraries I started work on Unicode transforms. But like everything related to CLDR there is a lot of yak-shaving and rabbit-hole travelling required.

The net result is a bunch of new libraries designed to make it easier to work with Unicode blocks, scripts, categories, properties and sets. These are:

  • ex_unicode that introspects a string or code point and tells you a lot more than you probably want to know. Buts is a good building block for other libraries.
  • unicode_set supports the Unicode Set syntax and provides the macro Unicode.Set.match?/2 that can be used to build clever guards to match on Unicode blocks, scripts, categories and properties.
  • unicode_guards uses ex_unicode and unicode_set to provide a set of prepackaged unicode-friendly guards. Such as is_upper/1, is_lower/1, is_currency_symbol/1, is_whitespace/1 and is_digit/1.
  • unicode_transform is a work in progress to implement the unicode transform specification and to generate transformation modules.
  • unicode_string will be the last part of this series that will provide functions to split and replace strings based upon unicode sets. Work hasn’t yet started but its going to be a fun project.

Unicode sets in particular allow some cool expressions. For example:

require Unicode.Set

# Is a given code point a digit? This is the
# digit `1` in the Thai script
iex> Unicode.Set.match?(?๓, "[[:digit:]]")
true

# What if we want to match on digits, but not Thai digits?
# Use set difference!
iex> Unicode.Set.match?(?๓, "[[:digit:]-[:thai:]]")
false

Since Unicode.Set.match?/2 is a macro, all the work of parsing, extracting code points, doing set operations and generating the guard code is done at compile time. The resulting code runs about 3 to 8 times faster than a regex case. (although of course regex has a much larger problem domain).

6 Likes

You should think of renaming your modules so that Unicode.Set becomes UnicodeSet instead, or at least Unicode.UnicodeSet if you want to make it clear that everything is under the Unicode namespace. The original name (Unicode.Set) doesn’t play well with aliasing.

2 Likes

Added two helpful functions in version 0.2.0 that:

  • Generates compiled patterns for speedy String.split/3 and String.replace/3:
  • Generates list of code point ranges that can be used in nimble_parsec.

Generating compiled patterns for String matching

String.split/3 and String.replace/3 allow for patterns and compiled patterns to be used with compiled patterns being the more performant approach. Unicode Set supports the generation of patterns and compiled patterns:

iex> pattern = Unicode.Set.compile_pattern "[[:digit:]]"
{:ac, #Reference<0.2128975411.2135031811.52710>}
iex> list = String.split("abc1def2ghi3jkl", pattern)
["abc", "def", "ghi", "jkl"]

Generating NimbleParsec ranges

The parser generator nimble_parsec allows a list of codepoint ranges as parameters to several combinators. Unicode Set can generate such ranges:

iex> Unicode.Set.utf8_char("[[^abcd][mnb]]")
[{:not, 97}, {:not, 98}, {:not, 99}, {:not, 100}, 98, 109, 110]

This can be used as shown in the following example:

defmodule MyCombinators do
  import NimbleParsec

  @digit_list Unicode.Set.utf8_char("[[:digit:]]")
  def unicode_digit do
    utf8_char(@digit_list)
    |> label("a digit in any Unicode script")
  end
end
1 Like

Good suggestion and will do for the next version.

This is very useful. I can use it to add proper support for unicode names in variable in my Elixir lexer

1 Like