Ex_unicode - Fun with Unicode (introspection, lookup, sets, guards, transforms...)

Following on from my CLDR lbraries I started work on Unicode transforms. But like everything related to CLDR there is a lot of yak-shaving and rabbit-hole travelling required.

The net result is a bunch of new libraries designed to make it easier to work with Unicode blocks, scripts, categories, properties and sets. These are:

  • ex_unicode that introspects a string or code point and tells you a lot more than you probably want to know. Buts is a good building block for other libraries.
  • unicode_set supports the Unicode Set syntax and provides the macro Unicode.Set.match?/2 that can be used to build clever guards to match on Unicode blocks, scripts, categories and properties.
  • unicode_guards uses ex_unicode and unicode_set to provide a set of prepackaged unicode-friendly guards. Such as is_upper/1, is_lower/1, is_currency_symbol/1, is_whitespace/1 and is_digit/1.
  • unicode_transform is a work in progress to implement the unicode transform specification and to generate transformation modules.
  • unicode_string will be the last part of this series that will provide functions to split and replace strings based upon unicode sets. Work hasn’t yet started but its going to be a fun project.

Unicode sets in particular allow some cool expressions. For example:

require Unicode.Set

# Is a given code point a digit? This is the
# digit `1` in the Thai script
iex> Unicode.Set.match?(?๓, "[[:digit:]]")
true

# What if we want to match on digits, but not Thai digits?
# Use set difference!
iex> Unicode.Set.match?(?๓, "[[:digit:]-[:thai:]]")
false

Since Unicode.Set.match?/2 is a macro, all the work of parsing, extracting code points, doing set operations and generating the guard code is done at compile time. The resulting code runs about 3 to 8 times faster than a regex case. (although of course regex has a much larger problem domain).

8 Likes

You should think of renaming your modules so that Unicode.Set becomes UnicodeSet instead, or at least Unicode.UnicodeSet if you want to make it clear that everything is under the Unicode namespace. The original name (Unicode.Set) doesn’t play well with aliasing.

3 Likes

Added two helpful functions in version 0.2.0 that:

  • Generates compiled patterns for speedy String.split/3 and String.replace/3:
  • Generates list of code point ranges that can be used in nimble_parsec.

Generating compiled patterns for String matching

String.split/3 and String.replace/3 allow for patterns and compiled patterns to be used with compiled patterns being the more performant approach. Unicode Set supports the generation of patterns and compiled patterns:

iex> pattern = Unicode.Set.compile_pattern "[[:digit:]]"
{:ac, #Reference<0.2128975411.2135031811.52710>}
iex> list = String.split("abc1def2ghi3jkl", pattern)
["abc", "def", "ghi", "jkl"]

Generating NimbleParsec ranges

The parser generator nimble_parsec allows a list of codepoint ranges as parameters to several combinators. Unicode Set can generate such ranges:

iex> Unicode.Set.utf8_char("[[^abcd][mnb]]")
[{:not, 97}, {:not, 98}, {:not, 99}, {:not, 100}, 98, 109, 110]

This can be used as shown in the following example:

defmodule MyCombinators do
  import NimbleParsec

  @digit_list Unicode.Set.utf8_char("[[:digit:]]")
  def unicode_digit do
    utf8_char(@digit_list)
    |> label("a digit in any Unicode script")
  end
end
2 Likes

Good suggestion and will do for the next version.

This is very useful. I can use it to add proper support for unicode names in variable in my Elixir lexer

1 Like

The Unicode consortium today introduced Unicode version 13.0 that adds 5,390 characters, for a total of 143,859 characters. These additions include four new scripts, for a total of 154 scripts, as well as 55 new emoji characters. As a result there are some updates to ex_unicode and related packages.

  • ex_unicode version 1.4.0 adds support for Unicode 13. It also add some additional derived categories for detecting quote marks of varying kinds (left, right, double, single, ambidextrous, all). Changelog

  • unicode_set version 0.5.0 adds support for quote-related unicode sets such as [[:quote_mark:]], [[:quote_mark_left:]], [[:quote_mark_double:]] and so on. Changelog

  • unicode_guards version 0.2.0 adds guards for quote marks. Changelog

    • is_quote_mark/1
    • is_quote_mark_left/1
    • is_quote_mark_right/1
    • is_quote_mark_ambidextrous/1
    • is_quote_mark_single/1
    • is_quote_mark_double/1

Have fun with Unicode!

5 Likes

ex_unicode_set version 0.6.0 is released today with a primary focus to underpin some upcoming basic unicode regex capabilities.

Enhancements

  • Unicode sets are now a %Unicode.Set{} struct

  • Add Unicode.Set.Sigil implementing sigil_u

  • Add support for String.Chars and Inspect protocols

Bug Fixes

  • Fixes parsing sets to ignore non-encoded whitespace

  • Fixes intersection and difference set operations for sets that include string ranges like {abc}

1 Like

Introducing unicode_string which in this initial release implements the Unicode Case Folding algorithm and also provides a case insensitive string matching function.

Unicode.String.equals_ignoring_case?/2 has the same performance as calling String.downcase/1 on both arguments and comparing with the added benefit of being Unicode aware.

Usage: Unicode.String.equals_ignoring_case?/2

Compares two strings in a case insensitive manner.

Case folding is applied to the two string arguments which are then compared with the == operator.

Arguments

  • string_a and string_b are two strings to be compared

  • type is the case folding type to be applied. The alternatives are :full, :simple and :turkic. The default is :full.

Returns

  • true or false

Notes

  • This function applies the Unicode Case Folding algorithm

  • The algorithm does not apply any treatment to diacritical marks hence “compare strings without accents” is not part of this function.

Examples

  iex> Unicode.String.equals_ignoring_case? "ABC", "abc"
  true

  iex> Unicode.String.equals_ignoring_case? "beißen", "beissen"
  true

  iex> Unicode.String.equals_ignoring_case? "grüßen", "grussen"
  false
2 Likes

Introducing the Unicode.Regex module that leverages all of the unicode sets supported by unicode_set. It is published on hex as unicode_set version 0.7.0.

This means you can use the power of unicode_set in a regular expressions in addition to guard clauses, compiled patterns and the nimble_parsec combinator utf8_char/2.

This works by pre-processing the regular expression and expanding any unicode sets in place before calling Regex.compile/2.

This functionality allows a developer to more fully use the power of the Unicode database, introspecting blocks, scripts, combining classes and a whole lot more.

Examples

# Posix and Perl forms are supported
iex> Unicode.Regex.compile("[:Zs:]")
{:ok, ~r/[\\x{20}\\x{A0}\\x{1680}\\x{2000}-\\x{200A}\\x{202F}\\x{205F}\\x{3000}]/u}

iex> Unicode.Regex.compile("\p{Zs}")
{:ok, ~r/[\\x{20}\\x{A0}\\x{1680}\\x{2000}-\\x{200A}\\x{202F}\\x{205F}\\x{3000}]/u}

# These are unicode sets supported by `unicode_set` that are not
# supported by `Regex.compile/2`
iex> Unicode.Regex.compile("[:visible:]")
{:ok,
 ~r/[\x{20}-~\x{A0}-\x{AC}\x{AE}-\x{377}\x{37A}-\x{37F}\x{384}-\x{38A} .../u}

iex> Unicode.Regex.compile("[:ccc=230:]")
{:ok,
 ~r/[\x{300}-\x{314}\x{33D}-\x{344}\x{346}\x{34A}-\x{34C} ...]/u}

iex> Unicode.Regex.compile("[:diacritic:]")
{:ok,
 ~r/[^`\x{A8}\x{AF}\x{B4}\x{B7}-\x{B8}\x{2B0}-\x{34E}\x{350}-\x{357}\x{35D}-\x{362} ...]/u}

Enhancements

  • Add Unicode.Set.character_class/1 which returns a string compatible with Regex.compile/2. This supports the idea of expanded Unicode Sets being used in standard Elixir/erlang regular expressions and will underpin implementation of Unicode Transforms in the package unicode_transform

  • Add Unicode.Regex.compile/2 to pre-process a regex to expand Unicode Sets and the compile it with Regex.compile/2. Unicode.Regex.compile!/2 is also added.

Bug Fixes

  • Fixes a bug whereby a Unicode Set intersection would fail with a character class that starts at the same codepoint as the Unicode set.

Have fun with Unicode!

2 Likes

Todays’ update is Unicode String version 0.2.0 which adds an implementation of the Unicode Segmentation Algorithm that support the detection of grapheme, word, line and sentence break boundaries.

Next steps

This work will support the next phase of the text library work on part-of-speech tagging which requires word segmentation as a precursor.

This work also marks another milestone. In order to implement the break algorithm I needed to implement Unicode Regular Expressions. That in turn required implementation of Unicode Sets which, finally, required the implementation of Unicode Properties. The standards are implemented across ex_unicode, unicode_set and unicode_string packages.

Its been a long road and, while not finished, the work is sufficiently advanced to be useful.

Examples

# Break a string by words and sentences
iex> Unicode.String.split "There is a letter. I will get it from the post office."  
["There", " ", "is", " ", "a", " ", "letter", ".", " ", "I", " ", "will", " ",  
"get", " ", "it", " ", "from", " ", "the", " ", "post", " ", "office", "."]

# Omit breaks that are all white space.
iex> Unicode.String.split "There is a letter. I will get it from the post office.", 
...> trim: true
["There", "is", "a", "letter", ".", "I", "will", "get", "it", "from", "the",
 "post", "office", "."]

# Break by sentence
iex> Unicode.String.split "There is a letter. I will get it from the post office.", 
...> break: :sentence
["There is a letter. ", "I will get it from the post office."]

# Sentence breaking that uses only character classes
# will break on well-known abbreviations
iex> Unicode.String.split "I went to see Mr. Smith today. He earned his Ph.D from Harvard.",
...>  break: :sentence
["I went to see Mr. ", "Smith today. ", "He earned his Ph.D from Harvard."]

# However several locales also have "suppressions" will are language dependent
# abbreviations that suppress a break. Suppressions are supplied for "en", "fr", "it", "es"
# "ru", "de" and other locales.
iex> Unicode.String.split "I went to see Mr. Smith today. He earned his Ph.D from Harvard.", 
...> break: :sentence, locale: "en"
["I went to see Mr. Smith today. ", "He earned his Ph.D from Harvard."]

# Other language rules are appropriate for different languages. For example
# Japanese doesn't use whitespace between words but we can still
# break on words.
iex> text = "助生レ和給ぴだそ更祈ーとどあ日丹サ申園たを大克リヘ円士マヌ一紙ごひなは団歳りン日予医ヨク従送コス反第ウ閣更内み暮81打ケ嘆乗アエセチ人字列え。19戸サシユ再回ウマヨカ日事ハレ属画核っル職追作モラネ容載フサ得注ぐで南最陸ぽへ玲訓リ八母式色ぎ 。"                            "助生レ和給ぴだそ更祈ーとどあ日丹サ申園たを大克リヘ円士マヌ一紙ごひなは団歳りン日予医ヨク従送コス反第ウ閣更内み暮81打ケ嘆乗アエセチ人字列え。19戸サシユ再回ウマヨカ日事ハレ属画核っル職追作モラネ容載フサ得注ぐで南最陸ぽへ玲訓リ八母式色ぎ。。"
iiex> Unicode.String.split text, break: :word, locale: "ja"                                                         ["助生", "レ", "和給", "ぴだそ", "更祈", "ー", "とどあ", "日丹",                                                 
 "サ", "申園", "たを", "大克", "リヘ", "円士", "マヌ", "一紙",
 "ごひなは", "団歳", "り", "ン", "日予医", "ヨク", "従送",
 "コス", "反第", "ウ", "閣更内", "み", "暮", "81", "打", "ケ",
 "嘆乗", "アエセチ", "人字列", "え", "。", "19", "戸", "サシユ",
 "再回", "ウマヨカ", "日事", "ハ
2 Likes