Unicode libraries - Fun with Unicode (introspection, lookup, sets, guards, transforms...)

kip · August 16, 2023, 4:09am

Following on from the spirited conversation about string capitalisation I was motivated to finish up my work on the Unicode Case Mapping Algorithm. A release candidate is now published as part of unicode_string version 1.3.0-rc.0.

Difference from the Elixir String casing functions

The implementation is locale aware and implements all of the transforms in SpecialCasing.txt including the conditional mappings for Greek, Turkish, Azeri and Lithuanian (Elixir’s string functions do not handle Lithuanian since they are conditional transforms).

Supports uppercasing Greek (not specified in Unicode). Feedback is requested on the implementation, see the last section of the post on this topic.
Supports titlecasing the “IJ” dipthong in Dutch (also not specified in Unicode).
Titlecasing works on individual words which is different to String.capitalize/2.
Planned support for other languages as I learn more about common practise. Feedback and feature requests are welcome.

Examples

# Basic case transformation
iex> Unicode.String.upcase("the quick brown fox")
"THE QUICK BROWN FOX"

# Dotted-I in Turkish and Azeri
iex> Unicode.String.upcase("Diyarbakır", locale: :tr)
"DİYARBAKIR"

# Upper case in Greek removes diacritics
iex> Unicode.String.upcase("Πατάτα, Αέρας, Μυστήριο", locale: :el)
"ΠΑΤΑΤΑ, ΑΕΡΑΣ, ΜΥΣΤΗΡΙΟ"

# Lower case Greek with a final sigma
iex> Unicode.String.downcase("ὈΔΥΣΣΕΎΣ", locale: :el)
"ὀδυσσεύς"

# Title case Dutch with leading dipthong
iex> Unicode.String.titlecase("ijsselmeer", locale: :nl)
"IJsselmeer"

Seeking feedback on Greek uppercasing

There is conflicting information on how to uppercase Greek. I would greatly welcome any feedback from native Greek speakers on what is correct in common usage. Here is what I have consolidated (and is extracted from the docs):

CLDR algorithm (current implementation)

According to CLDR all accents on all characters are are omitted when upcasing. This is based upon the CLDR el-Upper text transform:

  Remove 0301 following Greek, with possible intervening 0308 marks.
  ::NFD();
  For uppercasing (not titlecasing!) remove all greek accents from greek letters.
  This is done in two groups, to account for canonical ordering.
  [:Greek:] [^[:ccc=Not_Reordered:][:ccc=Above:]]*? { [\u0313\u0314\u0301\u0300\u0306\u0342\u0308\u0304] → ;
  [:Greek:] [^[:ccc=Not_Reordered:][:ccc=Iota_Subscript:]]*? { \u0345 → ;
  ::NFC();

That transform basically says remove all accents except a subscripted iota. It doesn’t handle dipthongs correctly.

Mozilla algorithm

Mozilla has a thread on a bug report
that:

Greek accented letters should be converted to the respective non-accented uppercase
letters. The required conversions are the following (in Unicode):

ά → Α
έ → Ε
ή → Η
ί → Ι
ΐ → Ϊ
ό → Ο
ύ → Υ
ΰ → Ϋ
ώ → Ω

Also diphthongs (two-vowel constructs) should be converted as follows, when the
first vowel is accented:

άι → ΑΪ
έι → ΕΪ
όι → ΟΪ
ύι → ΥΪ
άυ → ΑΫ
έυ → ΕΫ
ήυ → ΗΫ
όυ → ΟΫ

That thread seems to align with current-day Mozilla which says the rules are:

In Greek (el), vowels lose their accent when the whole word is in
uppercase (ά/Α), except for the disjunctive eta (ή/Ή). Also, diphthongs
with an accent on the first vowel lose the accent and gain a diaeresis
on the second vowel (άι/ΑΪ).