Unicode libraries - Fun with Unicode (introspection, lookup, sets, guards, transforms...)

kip · November 22, 2019, 9:55pm

Following on from my CLDR lbraries I started work on Unicode transforms. But like everything related to CLDR there is a lot of yak-shaving and rabbit-hole travelling required.

The net result is a bunch of new libraries designed to make it easier to work with Unicode blocks, scripts, categories, properties and sets. These are:

ex_unicode that introspects a string or code point and tells you a lot more than you probably want to know. Buts is a good building block for other libraries.
unicode_set supports the Unicode Set syntax and provides the macro Unicode.Set.match?/2 that can be used to build clever guards to match on Unicode blocks, scripts, categories and properties.
unicode_guards uses ex_unicode and unicode_set to provide a set of prepackaged unicode-friendly guards. Such as is_upper/1, is_lower/1, is_currency_symbol/1, is_whitespace/1 and is_digit/1.
unicode_transform is a work in progress to implement the unicode transform specification and to generate transformation modules.
unicode_string will be the last part of this series that will provide functions to split and replace strings based upon unicode sets. Work hasn’t yet started but its going to be a fun project.

Unicode sets in particular allow some cool expressions. For example:

require Unicode.Set

# Is a given code point a digit? This is the
# digit `1` in the Thai script
iex> Unicode.Set.match?(?๓, "[[:digit:]]")
true

# What if we want to match on digits, but not Thai digits?
# Use set difference!
iex> Unicode.Set.match?(?๓, "[[:digit:]-[:thai:]]")
false

Since Unicode.Set.match?/2 is a macro, all the work of parsing, extracting code points, doing set operations and generating the guard code is done at compile time. The resulting code runs about 3 to 8 times faster than a regex case. (although of course regex has a much larger problem domain).

tmbb · November 23, 2019, 9:24am

You should think of renaming your modules so that Unicode.Set becomes UnicodeSet instead, or at least Unicode.UnicodeSet if you want to make it clear that everything is under the Unicode namespace. The original name (Unicode.Set) doesn’t play well with aliasing.

kip · November 23, 2019, 9:55pm

Added two helpful functions in version 0.2.0 that:

Generates compiled patterns for speedy String.split/3 and String.replace/3:
Generates list of code point ranges that can be used in nimble_parsec.

Generating compiled patterns for String matching

String.split/3 and String.replace/3 allow for patterns and compiled patterns to be used with compiled patterns being the more performant approach. Unicode Set supports the generation of patterns and compiled patterns:

iex> pattern = Unicode.Set.compile_pattern "[[:digit:]]"
{:ac, #Reference<0.2128975411.2135031811.52710>}
iex> list = String.split("abc1def2ghi3jkl", pattern)
["abc", "def", "ghi", "jkl"]

Generating NimbleParsec ranges

The parser generator nimble_parsec allows a list of codepoint ranges as parameters to several combinators. Unicode Set can generate such ranges:

iex> Unicode.Set.utf8_char("[[^abcd][mnb]]")
[{:not, 97}, {:not, 98}, {:not, 99}, {:not, 100}, 98, 109, 110]

This can be used as shown in the following example:

defmodule MyCombinators do
  import NimbleParsec

  @digit_list Unicode.Set.utf8_char("[[:digit:]]")
  def unicode_digit do
    utf8_char(@digit_list)
    |> label("a digit in any Unicode script")
  end
end

kip · November 23, 2019, 9:57pm

Good suggestion and will do for the next version.

tmbb · November 25, 2019, 11:16am

This is very useful. I can use it to add proper support for unicode names in variable in my Elixir lexer

kip · March 11, 2020, 5:05am

The Unicode consortium today introduced Unicode version 13.0 that adds 5,390 characters, for a total of 143,859 characters. These additions include four new scripts, for a total of 154 scripts, as well as 55 new emoji characters. As a result there are some updates to ex_unicode and related packages.

ex_unicode version 1.4.0 adds support for Unicode 13. It also add some additional derived categories for detecting quote marks of varying kinds (left, right, double, single, ambidextrous, all). Changelog
unicode_set version 0.5.0 adds support for quote-related unicode sets such as [[:quote_mark:]], [[:quote_mark_left:]], [[:quote_mark_double:]] and so on. Changelog
unicode_guards version 0.2.0 adds guards for quote marks. Changelog
- is_quote_mark/1
- is_quote_mark_left/1
- is_quote_mark_right/1
- is_quote_mark_ambidextrous/1
- is_quote_mark_single/1
- is_quote_mark_double/1

Have fun with Unicode!

kip · May 12, 2020, 10:32pm

ex_unicode_set version 0.6.0 is released today with a primary focus to underpin some upcoming basic unicode regex capabilities.

Enhancements

Unicode sets are now a %Unicode.Set{} struct
Add Unicode.Set.Sigil implementing sigil_u
Add support for String.Chars and Inspect protocols

Bug Fixes

Fixes parsing sets to ignore non-encoded whitespace
Fixes intersection and difference set operations for sets that include string ranges like {abc}

kip · May 17, 2020, 2:09am

Introducing unicode_string which in this initial release implements the Unicode Case Folding algorithm and also provides a case insensitive string matching function.

Unicode.String.equals_ignoring_case?/2 has the same performance as calling String.downcase/1 on both arguments and comparing with the added benefit of being Unicode aware.

Usage: Unicode.String.equals_ignoring_case?/2

Compares two strings in a case insensitive manner.

Case folding is applied to the two string arguments which are then compared with the == operator.

Arguments

string_a and string_b are two strings to be compared
type is the case folding type to be applied. The alternatives are :full, :simple and :turkic. The default is :full.

Returns

true or false

Notes

This function applies the Unicode Case Folding algorithm
The algorithm does not apply any treatment to diacritical marks hence “compare strings without accents” is not part of this function.

Examples

  iex> Unicode.String.equals_ignoring_case? "ABC", "abc"
  true

  iex> Unicode.String.equals_ignoring_case? "beißen", "beissen"
  true

  iex> Unicode.String.equals_ignoring_case? "grüßen", "grussen"
  false

kip · May 18, 2020, 7:52am

Introducing the Unicode.Regex module that leverages all of the unicode sets supported by unicode_set. It is published on hex as unicode_set version 0.7.0.

This means you can use the power of unicode_set in a regular expressions in addition to guard clauses, compiled patterns and the nimble_parsec combinator utf8_char/2.

This works by pre-processing the regular expression and expanding any unicode sets in place before calling Regex.compile/2.

This functionality allows a developer to more fully use the power of the Unicode database, introspecting blocks, scripts, combining classes and a whole lot more.

Examples

# Posix and Perl forms are supported
iex> Unicode.Regex.compile("[:Zs:]")
{:ok, ~r/[\\x{20}\\x{A0}\\x{1680}\\x{2000}-\\x{200A}\\x{202F}\\x{205F}\\x{3000}]/u}

iex> Unicode.Regex.compile("\p{Zs}")
{:ok, ~r/[\\x{20}\\x{A0}\\x{1680}\\x{2000}-\\x{200A}\\x{202F}\\x{205F}\\x{3000}]/u}

# These are unicode sets supported by `unicode_set` that are not
# supported by `Regex.compile/2`
iex> Unicode.Regex.compile("[:visible:]")
{:ok,
 ~r/[\x{20}-~\x{A0}-\x{AC}\x{AE}-\x{377}\x{37A}-\x{37F}\x{384}-\x{38A} .../u}

iex> Unicode.Regex.compile("[:ccc=230:]")
{:ok,
 ~r/[\x{300}-\x{314}\x{33D}-\x{344}\x{346}\x{34A}-\x{34C} ...]/u}

iex> Unicode.Regex.compile("[:diacritic:]")
{:ok,
 ~r/[^`\x{A8}\x{AF}\x{B4}\x{B7}-\x{B8}\x{2B0}-\x{34E}\x{350}-\x{357}\x{35D}-\x{362} ...]/u}

Enhancements

Add Unicode.Set.character_class/1 which returns a string compatible with Regex.compile/2. This supports the idea of expanded Unicode Sets being used in standard Elixir/erlang regular expressions and will underpin implementation of Unicode Transforms in the package unicode_transform
Add Unicode.Regex.compile/2 to pre-process a regex to expand Unicode Sets and the compile it with Regex.compile/2. Unicode.Regex.compile!/2 is also added.

Bug Fixes

Fixes a bug whereby a Unicode Set intersection would fail with a character class that starts at the same codepoint as the Unicode set.

Have fun with Unicode!

kip · July 12, 2020, 12:44pm

Todays’ update is Unicode String version 0.2.0 which adds an implementation of the Unicode Segmentation Algorithm that support the detection of grapheme, word, line and sentence break boundaries.

Next steps

This work will support the next phase of the text library work on part-of-speech tagging which requires word segmentation as a precursor.

This work also marks another milestone. In order to implement the break algorithm I needed to implement Unicode Regular Expressions. That in turn required implementation of Unicode Sets which, finally, required the implementation of Unicode Properties. The standards are implemented across ex_unicode, unicode_set and unicode_string packages.

Its been a long road and, while not finished, the work is sufficiently advanced to be useful.

Examples

# Break a string by words and sentences
iex> Unicode.String.split "There is a letter. I will get it from the post office."  
["There", " ", "is", " ", "a", " ", "letter", ".", " ", "I", " ", "will", " ",  
"get", " ", "it", " ", "from", " ", "the", " ", "post", " ", "office", "."]

# Omit breaks that are all white space.
iex> Unicode.String.split "There is a letter. I will get it from the post office.", 
...> trim: true
["There", "is", "a", "letter", ".", "I", "will", "get", "it", "from", "the",
 "post", "office", "."]

# Break by sentence
iex> Unicode.String.split "There is a letter. I will get it from the post office.", 
...> break: :sentence
["There is a letter. ", "I will get it from the post office."]

# Sentence breaking that uses only character classes
# will break on well-known abbreviations
iex> Unicode.String.split "I went to see Mr. Smith today. He earned his Ph.D from Harvard.",
...>  break: :sentence
["I went to see Mr. ", "Smith today. ", "He earned his Ph.D from Harvard."]

# However several locales also have "suppressions" will are language dependent
# abbreviations that suppress a break. Suppressions are supplied for "en", "fr", "it", "es"
# "ru", "de" and other locales.
iex> Unicode.String.split "I went to see Mr. Smith today. He earned his Ph.D from Harvard.", 
...> break: :sentence, locale: "en"
["I went to see Mr. Smith today. ", "He earned his Ph.D from Harvard."]

# Other language rules are appropriate for different languages. For example
# Japanese doesn't use whitespace between words but we can still
# break on words.
iex> text = "助生レ和給ぴだそ更祈ーとどあ日丹サ申園たを大克リヘ円士マヌ一紙ごひなは団歳りン日予医ヨク従送コス反第ウ閣更内み暮81打ケ嘆乗アエセチ人字列え。19戸サシユ再回ウマヨカ日事ハレ属画核っル職追作モラネ容載フサ得注ぐで南最陸ぽへ玲訓リ八母式色ぎ 。"                            "助生レ和給ぴだそ更祈ーとどあ日丹サ申園たを大克リヘ円士マヌ一紙ごひなは団歳りン日予医ヨク従送コス反第ウ閣更内み暮81打ケ嘆乗アエセチ人字列え。19戸サシユ再回ウマヨカ日事ハレ属画核っル職追作モラネ容載フサ得注ぐで南最陸ぽへ玲訓リ八母式色ぎ。。"
iiex> Unicode.String.split text, break: :word, locale: "ja"                                                         ["助生", "レ", "和給", "ぴだそ", "更祈", "ー", "とどあ", "日丹",                                                 
 "サ", "申園", "たを", "大克", "リヘ", "円士", "マヌ", "一紙",
 "ごひなは", "団歳", "り", "ン", "日予医", "ヨク", "従送",
 "コス", "反第", "ウ", "閣更内", "み", "暮", "81", "打", "ケ",
 "嘆乗", "アエセチ", "人字列", "え", "。", "19", "戸", "サシユ",
 "再回", "ウマヨカ", "日事", "ハ

kip · October 11, 2020, 8:29am

Released today is Unicode Set version 0.11.0 which is primarily a bug fix release . The API, test coverage and overall stability is much improved. A version 1.0 can be expected before end of the year.

Two functional improvements may be useful:

Unicode sets for blank, graphic and print

From time-to-time on the forum there is the question “how can I detect if a string or character is printable”. In Unicode this is not a simple matter but Unicode Regular Expressions provide a portable definition of three unicode sets that may prove useful:

# `\p{blank}` is the set of "horizontal space characters" 
# and is defined as `\p{gc=Space_Separator}\N{CHARACTER TABULATION}`
iex> Unicode.Set.match? "K", "[:blank:]"
false
iex> Unicode.Set.match? " ", "[:blank:]"
true
# Non breaking space
iex> Unicode.Set.match? << 0xa0 :: utf8 >>, "[:blank:]"
true

# Graph is that set of characters that create an impression 
# and is defined as `[^\p{space}\p{gc=Control}\p{gc=Surrogate}\p{gc=Unassigned}]`
iex> Unicode.Set.match? << 0xa0 :: utf8 >>, "[:graph:]"
false
iex> Unicode.Set.match? " ", "[:graph:]"
false
iex> Unicode.Set.match? "克", "[:graph:]"              
true

# Print is the combination of graphic and space sets minus control characters
# and is defined as `\p{graph}\p{blank}-\p{cntrl}`
iex> Unicode.Set.match? "克", "[:print:]"
true
iex> Unicode.Set.match? << 0xa0 :: utf8 >>, "[:print:]"
true

Unicode Regular Expressions

Unicode.Regex.compile/2 is now largely compliant with the Unicode Regular Expression standard. It operates by expanding unicode sets before compiling in the usual manner with Regex.compile/2.

sheharyarn · January 20, 2021, 10:23pm

Just tried Unicode.Regex.compile("[[:Emoji:]]") in my console and the regex correctly matched all the latest emojis. Thank you for your excellent work!

kip · January 20, 2021, 11:54pm

Glad it does what you need it to do! And appreciate the feedback, its great for motivation

Next version of Unicode, 14.0, will be out in September and I will have my Unicode libraries up-to-date at launch time.

kip · March 29, 2021, 2:36am

Introducing the very first version of Unicode Transform which implements the CLDR Transform specification.

In this first version it implements only the Latin to ASCII transform. This is commonly thought of as “remove accents” so although its only one very small step, its possible this transform has some use to the community.

Examples

iex> Unicode.Transform.LatinAscii.transform "Considérant que la reconnaissance de la dignité inhérente à tous les membres"
"Considerant que la reconnaissance de la dignite inherente a tous les membres"

iex> Unicode.Transform.LatinAscii.transform "Da die Anerkennung der angeborenen Würde und der gleichen und unveräußerlichen Rechte aller Mitglieder"
"Da die Anerkennung der angeborenen Wurde und der gleichen und unverausserlichen Rechte aller Mitglieder"

Text which is not in the latin script (technically not in the set [[:Latin:][:Common:][:Inherited:][〇]]) is passed through unchanged.

Background

This is a fun project like most of the Unicode projects - if you can believe it! With a lot of rabbit holes. To implement this library required implementing

Introspection of Unicode character properties in ex_unicode.
Unicode Sets and Regular Expressions in unicode_set

Implementation

The implementation is in two parts:

Generate an elixir module from the CLDR transform .xml file. For example, the xml for the Latin to Ascii transform goes from:

# This handles only Latin, Common, and IDEOGRAPHIC NUMBER ZERO (Han).
#
:: [[:Latin:][:Common:][:Inherited:][〇]] ;
#
:: NFD() ;
[[:Latin:][0-9]] { [:Mn:]+ → ; # maps to nothing; remove all Mn following Latin letter/digit
:: NFC() ;
#
# Some of the following mappings (noted) are from CLDR ‹character-fallback› data.
# (Note, here "‹character-fallback›" uses U+2039/U+203A to avoid XML issues)
#
# Latin letters and IPA
#
Æ → AE ; # 00C6;LATIN CAPITAL LETTER AE (from ‹character-fallback›)
Ð → D ; # 00D0;LATIN CAPITAL LETTER ETH
Ø → O ; # 00D8;LATIN CAPITAL LETTER O WITH STROKE
Þ → TH ; # 00DE;LATIN CAPITAL LETTER THORN
...

becomes:

defmodule Unicode.Transform.LatinAscii do
  use Unicode.Transform

  # This file is generated. Manual changes are not recommended
  # Source: Latin
  # Target: ASCII
  # Transform direction: both
  # Transform alias: und-t-d0-ascii und-Latn-t-s0-ascii

  # This handles only Latin, Common, and IDEOGRAPHIC NUMBER ZERO (Han).
  #
  filter("[[:Latin:][:Common:][:Inherited:][〇]]")
  #
  transform("NFD")
  # maps to nothing; remove all Mn following Latin letter/digit
  replace("[:Mn:]+", "", preceeded_by: "[[:Latin:][0-9]]")
  #
  transform("NFC")
  #
  # Some of the following mappings (noted) are from CLDR ‹character-fallback› data.
  # (Note, here "‹character-fallback›" uses U+2039/U+203A to avoid XML issues)
  #
  # Latin letters and IPA
  #
  # 00C6;LATIN CAPITAL LETTER AE (from ‹character-fallback›)
  replace("Æ", "AE")
  # 00D0;LATIN CAPITAL LETTER ETH
  replace("Ð", "D")
  # 00D8;LATIN CAPITAL LETTER O WITH STROKE
  replace("Ø", "O")
  ...
end

Step two is the implementation of the macros (filter/1, convert/3, transform/1) and others that generate the final code. This approach lets developers define their own transforms in an Elixir-friendly way.

Next steps

The current version implements a minimal part part of the standard. Although parsing and generating the module is largely conformant, the code generation is not yet complete. Therefore version 0.1.0 is only useful to people who can benefit from the Unicode.Transform.LatinAscii.transform/1.

amnu3387 · March 29, 2021, 7:04am

Hi kip,

I don’t have much knowledge in this. Is the objective of CLDR Transform to allow one to parse text like those accented examples into basic ASCII while allowing one to revert it lossly back into its original “encoding”?

kip · August 27, 2021, 1:03am

Unicode 14 is due for release in September. As a preview, I have released ex_unicode version 1.12.0-rc.0. The key features of Unicode 14 are:

Add 838 characters, for a total of 144,697 characters. These additions include 5 new scripts, for a total of 159 scripts, as well as 37 new emoji characters.
Add support for lesser-used languages and unique written requirements worldwide, including numerous symbols additions. Funds from the Adopt-a-Character program provided support for some of these additions. The new scripts and characters include:
- Toto, used to write the Toto language in India near Bhutan
- Cypro-Minoan, an undeciphered historical script primarily used on the island of Cyprus
- Vithkuqi, an historic script used to write Albanian, and undergoing a modern revival
- Old Uyghur, an historic script used in Central Asia and elsewhere to write Turkic, Chinese, Mongolian, Tibetan, and Arabic languages
- Tangsa, a modern script used to write the Tangsa language, which is spoken in India and Myanmar
- Many Latin additions for extended IPA
- Arabic script additions used to write languages across Africa and in Iran, Pakistan, Malaysia, Indonesia, Java, and Bosnia, and to write honorifics, and additions for Quranic use
- Other character additions support languages of the Philippines, North America, India, and Mongolia
Popular symbol additions:
- 37 emoji characters. For complete statistics regarding all emoji as of Unicode 14.0, see Emoji Counts. For more information about emoji additions in version 14.0, including new emoji ZWJ sequences and emoji modifier sequences, see Emoji Recently Added, v14.0.

kip · August 27, 2021, 1:19am

I’m very sorry for not replying to you earlier!

What you are describing - lossless bi-directional conversion - sounds like you’re referring to character set encoding. Transforms ≠ Encoding.

Transforms are about changing representations and transliteration. For example, can we represent Chinese language in a romanised way? This is not a replacement of one Chinese ideogram into a latin character. Its a phonetic transform (in this case, can be pinyin or some other method).

kip · September 14, 2021, 12:04am

Today is Unicode 14 release day. All the related libraries I maintain are tested and updated against the new version. Several are released as 1.0 versions since the APIs are stable.

ex_unicode 1.12.0

Implements Unicode introspection. Helpful to identify the script, character categories and character properties. For example:

iex> Unicode.script ?خ
"arabic"
iex> Unicode.category ?ä
:Ll
iex> Unicode.category ?A
:Lu
iex> Unicode.category ?🧐
:So
iex> Unicode.properties ?A
[
  :alphabetic,
  :ascii_hex_digit,
  :cased,
  :changes_when_casefolded,
  :changes_when_casemapped,
  :changes_when_lowercased,
  :grapheme_base,
  :hex_digit,
  :id_continue,
  :id_start,
  :uppercase,
  :xid_continue,
  :xid_start
]

Unicode Set 1.0

Implements Unicode Sets that allow for flexible definition of unicode characters that can then be used as guards, to generate patterns, ranges for nimble_parsec and regular expressions. Some examples:

# The character "๓" is the thai digit `1`
iex> Unicode.Set.match? ?๓, "[[:digit:]]"
true

# Set operations allow union, insersection and difference
# This example matches on digits, but not the Thai script
iex> Unicode.Set.match? ?๓, "[[:digit:]-[:thai:]]"
false

iex> Unicode.Set.to_pattern("[{👦🏻}-{👦🏿}]")
{:ok, ["👦🏻", "👦🏼", "👦🏽", "👦🏾", "👦🏿"]}

iex> Unicode.Regex.compile("\\p{Zs}")
{:ok, ~r/[\x{20}\x{A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}]/u}

iex> Unicode.Regex.compile("[:graphic:]")
{:ok,
 ~r/[\x{20}-\x{7E}\x{A0}-\x{AC}\x{AE}-\x{377}\x{37A}-\x{37F}...]/u}

Unicode Guards 1.0

Leverages unicode_set to define a set of guards for common use. Examples:

defmodule Guards do
  import Unicode.Guards

  def upper(<< x :: utf8, _rest :: binary >>) when is_upper(x), do: :upper
  def lower(<< x :: utf8, _rest :: binary >>) when is_lower(x), do: :lower
  def digit(<< x :: utf8, _rest :: binary >>) when is_digit(x), do: :digit
  def whitespace(<< x :: utf8, _rest :: binary >>) when is_whitespace(x), do: :whitespace
  def currency(<< x :: utf8, _rest :: binary >>) when is_currency_symbol(x), do: :currency
end

Unicode String 1.0

Implements:

The Unicode Case Folding algorithm to provide case-independent equality checking irrespective of language or script.
The Unicode Segmentation algorithm to detect, break or split strings into grapheme clusters, works and sentences.

Examples:

iex> Unicode.String.equals_ignoring_case? "ABC", "abc"
true

iex> Unicode.String.equals_ignoring_case? "beißen", "beissen"
true

iex> Unicode.String.split "This is a sentence. And another.", break: :word
["This", " ", "is", " ", "a", " ", "sentence", ".", " ", "And", " ", "another", "."]

iex> Unicode.String.split "This is a sentence. And another.", break: :word, trim: true
["This", "is", "a", "sentence", ".", "And", "another", "."]

iex> Unicode.String.split "This is a sentence. And another.", break: :sentence
["This is a sentence. ", "And another."]

kip · September 14, 2021, 9:00pm

Way back in 2016, @Qqwy launched the Unicode package on hex. It was the original inspiration for ex_unicode - which is a project that I started in 2019 because I needed to support Unicode Sets and Level 1 of Unicode Regular Expressions in order to work on CLDR Transforms. Everything in Unicode and CLDR ends up being a loooooong journey down many unexpected but very rewarding paths.

As @wojtekmach once said to me “CLDR is … vast”. Only later on did I truly understand just how vast. I still haven’t finished CLDR transforms.

Anyway, @Qqwy and I are combining efforts with the following changes:

ex_unicode will, from the next release, be published as unicode, replacing the currently published package. Since Qqwy’s original work was the inspiration for mine the APIs are consistent and upgrades will be easy.
Qqwy becomes a co-owner of the elixir-unicode GitGub organisation.

kip · September 17, 2022, 2:14am

This week Unicode 15 was announced.

Today I’ve published unicode 1.15.0 that is based upon Unicode 15. There are no required changes to unicode_set, unicode_string, unicode_guards or unicode_transform.