Following on from my CLDR lbraries I started work on Unicode transforms. But like everything related to CLDR there is a lot of yak-shaving and rabbit-hole travelling required.
The net result is a bunch of new libraries designed to make it easier to work with Unicode blocks, scripts, categories, properties and sets. These are:
- ex_unicode that introspects a string or code point and tells you a lot more than you probably want to know. Buts is a good building block for other libraries.
- unicode_set supports the Unicode Set syntax and provides the macro
Unicode.Set.match?/2
that can be used to build clever guards to match on Unicode blocks, scripts, categories and properties. - unicode_guards uses
ex_unicode
andunicode_set
to provide a set of prepackaged unicode-friendly guards. Such asis_upper/1
,is_lower/1
,is_currency_symbol/1
,is_whitespace/1
andis_digit/1
. - unicode_transform is a work in progress to implement the unicode transform specification and to generate transformation modules.
- unicode_string will be the last part of this series that will provide functions to split and replace strings based upon unicode sets. Work hasn’t yet started but its going to be a fun project.
Unicode sets in particular allow some cool expressions. For example:
require Unicode.Set
# Is a given code point a digit? This is the
# digit `1` in the Thai script
iex> Unicode.Set.match?(?๓, "[[:digit:]]")
true
# What if we want to match on digits, but not Thai digits?
# Use set difference!
iex> Unicode.Set.match?(?๓, "[[:digit:]-[:thai:]]")
false
Since Unicode.Set.match?/2
is a macro, all the work of parsing, extracting code points, doing set operations and generating the guard code is done at compile time. The resulting code runs about 3 to 8 times faster than a regex case. (although of course regex has a much larger problem domain).