Ex_unicode - Fun with Unicode (introspection, lookup, sets, guards, transforms...)

kip · May 2, 2023, 7:13pm

Up next is unicode_unihan which is a new library that introspects the Unihan database.

If you thought that Latin-1 with a 159 code points is tricky enough with majuscule and minuscule and diacritics - wait til you see Unihan.

Unihan is 98,000 code points (and counting) it’s built upon shared history, culture and politics amongst China, Japan, Korea and Vietnam. Han refers to the Chinese people; Unihan refers to the monumental task undertaken to try to produce a single set of code points that respects these historical, cultural and political contexts.

The resulting Unihan Database encapsulates a wealth of information. Which can now be introspected with a new unicode_unihan library.

Motivated and inspired by @jkwchui and the work he is doing at https://visual-fonts.com this library is a collaborative work in progress. Its early days but for anyone interested you’re very welcome to provide suggestions and feedback. Here’s a simple example:

iex> Unicode.Unihan.unihan("人")
%{
  kUnihanCore2020: "GHJKMPT",
  kCNS1992: "1-4429",
  kFourCornerCode: ["8000.0"],
  kIRGHanyuDaZidian: ["10101.100"],
  kFrequency: "1",
  kGradeLevel: "1",
  codepoint: 20154,
  kCowles: ["5115.5", "5117"],
  kCNS1986: "1-4429",
  kHanyuPinlu: ["rén(16866)", "ren(280)"],
  kBigFive: "A448",
  kHanyuPinyin: ["10101.100:rén"],
  kTotalStrokes: %{"zh-Hans": 2, "zh-Hant": 2},
  kIICore: ["AGTJHKMP"],
  kSBGY: ["102.23"],
  kCihaiT: ["80.201"],
  kKPS0: ["FCC5"],
  kTGHZ2013: ["313.110:rén"],
  kIRGDaiKanwaZiten: ["00344"],
  kTaiwanTelegraph: ["0086"],
  kMorohashi: ["00344"],
  kDaeJaweon: "0190.010",
  kTGH: ["2013:10"],
  kKSC0: ["7649"],
  kMainlandTelegraph: ["0086"],
  kTang: ["*njin", "njin"],
  kHangul: [%{grapheme: "인", source: "0E"}],
  kIRG_TSource: "T1-4429",
  kJapaneseKun: ["HITO"],
  kIRGKangXi: ["0091.010"],
  kIRG_JSource: "J0-3F4D",
  kHanYu: ["10101.100"],
  kXerox: ["241:051"],
  kRSAdobe_Japan1_6: ["C+2579+9.2.0"],
  kIRG_KSource: "K0-6C51",
  kCangjie: "O",
  kFenn: ["429A"],
  kCantonese: ["jan4"],
  kVietnamese: ["nhân"],
  kLau: ["3328"],
  kGB1: "4043",
  kIRG_KPSource: "KP0-FCC5",
  kKoreanEducationHanja: ["2007"],
  kKorean: ["IN"],
  kJapaneseOn: ["JIN", "NIN"],
  kRSUnicode: ["9.0"],
  kGB0: "4043",
  kFennIndex: ["226.01"],
  kXHC1983: ["0959.010:rén"],
  kMatthews: [...],
  ...
}

kip · May 28, 2023, 8:13pm

I’m pleased to report the first hex release of unicode_unihan. I’m not a domain expert, but thankfully @jkwchui is and the work and credit is all his. Thanks very much for a valuable contribution to the community.

I think this may be the most comprehensive introspection library of the Unicode Unihan database in any computer language.

You can try it out in Livebook. Just note that the Unihan database is huge and compile times for the library are correspondingly slow since all the parsing and decoding is done at compile time in order to present a very snappy user experience.

jkwchui · May 29, 2023, 5:46am

Thank you @kip for doing most of the intelligent heavy-lifting, and being such a gracious teacher. I have learnt much through this. The code is feature-complete, and the hefty documentation should be fleshed out over the next weeks.

Besides shifting the burden to compile-time and hence resulting in snappy runtime performance, the Elixir Unihan library have a pair of interesting features that is more than the sum of its parts:

decoding details into Elixir structs. For example, Unihan database may supply a binary of 1187.061 for a particular field, expecting the user to decode this based on the specs (then repeat for the other 90+ fields!) In Unicode.Unihan we decode every field for you as a Map with suitable keys and types:

%{
  page: 1187,
  position: 6,
  virtual: true
}

availability of filter/1 and reject/1. These are similar to Enum.filter/1 and Enum.reject/1, taking a function as the argument to return a subset of Unihan.

The combination of sub-fields and ability to act upon them means you can search through and chain the outcomes into pipelines:

filter(
  &(&1[:kGradeLevel] <= 2 and
    &1[:kCantonese][:tone] == "1")
)
|> Enum.sort_by(
  fn {_codepoint, map} ->
    map[:kTotalStrokes][:Hant]
  end,
  :asc
)

This is very flexible and, AFAIK, unique access to this trove of knowledge compiled over two decades. The included LiveBook should offer a easy way to give this a spin with low commitment.

Eiji · May 29, 2023, 7:07am

How about this instead?

&get_in(&1, [Access.elem(1), :kTotalStrokes, :Hant])

Also :asc is the default, so you can leave it.

jkwchui · May 29, 2023, 10:18am

Thanks for the suggestion. I’m not sure that would work. The input is a single tuple, so &2 doesn’t apply (only one term).

We can rewrite this to &get_in(elem(&1,1), [:kTotalStrokes, :Hant]), or pipe it as

Enum.sort_by(&(
    &1
    |> elem(1) 
    |> get_in([:kTotalStrokes, :Hant])
  ))

but I find the original more readable.

(About the :asc — I was toggling to tinker the first-sight output of a new user. I might be the one who got the most fun out of this )

Eiji · May 29, 2023, 10:24am

yes, my bad - I have updated my reply

kip · August 16, 2023, 4:09am

Following on from the spirited conversation about string capitalisation I was motivated to finish up my work on the Unicode Case Mapping Algorithm. A release candidate is now published as part of unicode_string version 1.3.0-rc.0.

Difference from the Elixir String casing functions

The implementation is locale aware and implements all of the transforms in SpecialCasing.txt including the conditional mappings for Greek, Turkish, Azeri and Lithuanian (Elixir’s string functions do not handle Lithuanian since they are conditional transforms).

Supports uppercasing Greek (not specified in Unicode). Feedback is requested on the implementation, see the last section of the post on this topic.
Supports titlecasing the “IJ” dipthong in Dutch (also not specified in Unicode).
Titlecasing works on individual words which is different to String.capitalize/2.
Planned support for other languages as I learn more about common practise. Feedback and feature requests are welcome.

Examples

# Basic case transformation
iex> Unicode.String.upcase("the quick brown fox")
"THE QUICK BROWN FOX"

# Dotted-I in Turkish and Azeri
iex> Unicode.String.upcase("Diyarbakır", locale: :tr)
"DİYARBAKIR"

# Upper case in Greek removes diacritics
iex> Unicode.String.upcase("Πατάτα, Αέρας, Μυστήριο", locale: :el)
"ΠΑΤΑΤΑ, ΑΕΡΑΣ, ΜΥΣΤΗΡΙΟ"

# Lower case Greek with a final sigma
iex> Unicode.String.downcase("ὈΔΥΣΣΕΎΣ", locale: :el)
"ὀδυσσεύς"

# Title case Dutch with leading dipthong
iex> Unicode.String.titlecase("ijsselmeer", locale: :nl)
"IJsselmeer"

Seeking feedback on Greek uppercasing

There is conflicting information on how to uppercase Greek. I would greatly welcome any feedback from native Greek speakers on what is correct in common usage. Here is what I have consolidated (and is extracted from the docs):

CLDR algorithm (current implementation)

According to CLDR all accents on all characters are are omitted when upcasing. This is based upon the CLDR el-Upper text transform:

  Remove 0301 following Greek, with possible intervening 0308 marks.
  ::NFD();
  For uppercasing (not titlecasing!) remove all greek accents from greek letters.
  This is done in two groups, to account for canonical ordering.
  [:Greek:] [^[:ccc=Not_Reordered:][:ccc=Above:]]*? { [\u0313\u0314\u0301\u0300\u0306\u0342\u0308\u0304] → ;
  [:Greek:] [^[:ccc=Not_Reordered:][:ccc=Iota_Subscript:]]*? { \u0345 → ;
  ::NFC();

That transform basically says remove all accents except a subscripted iota. It doesn’t handle dipthongs correctly.

Mozilla algorithm

Mozilla has a thread on a bug report
that:

Greek accented letters should be converted to the respective non-accented uppercase
letters. The required conversions are the following (in Unicode):

ά → Α
έ → Ε
ή → Η
ί → Ι
ΐ → Ϊ
ό → Ο
ύ → Υ
ΰ → Ϋ
ώ → Ω

Also diphthongs (two-vowel constructs) should be converted as follows, when the
first vowel is accented:

άι → ΑΪ
έι → ΕΪ
όι → ΟΪ
ύι → ΥΪ
άυ → ΑΫ
έυ → ΕΫ
ήυ → ΗΫ
όυ → ΟΫ

That thread seems to align with current-day Mozilla which says the rules are:

In Greek (el), vowels lose their accent when the whole word is in
uppercase (ά/Α), except for the disjunctive eta (ή/Ή). Also, diphthongs
with an accent on the first vowel lose the accent and gain a diaeresis
on the second vowel (άι/ΑΪ).

bottlenecked · August 21, 2023, 7:31am

I think I’m going to agree with the mozilla bug report (caveat: not a linguist). In common usage there’s two diacritic marks that survive:

the accent (as in ά or Ά - applicable to all vowels)
the dialytics (as in ϋ or Ϋ - applicable to some vowels only - ι and υ) - also mentioned as diaeresis in the text quoted above

The accent is used to denote where to stress the pronunciation. When capitalizing (e.g from άλλος -> Άλλος) the accent survives if it’s on the first letter of the word, but not when upppercasing (e.g. from άλλος -> ΑΛΛΟΣ)

The dialytics are also used to inform pronunciation: for example αυλή (means yard / backyard, pronounced ‘avlee’) does not require the dialytics because αυ is meant to be read as a diphthong, like ‘av’.

However in the word άυλη (means without matter / ethereal) notice that the accent is now on ά - which changes the pronunciation to something like ‘aelee’. When uppercasing we would do it as ΑΫΛΗ - so dialytics would be inserted to preserve the pronunciation since now there is no accent on Α to disambiguate (but if we were capitalizing, it would become Άυλη - the accent mark would be preserved because it’s on the first letter of the word and dialytics don’t have to be inserted for disambiguation)

There’s also the case where both an accent and dialytics are present on the same vowel - but the same rules apply as above for dialytics: the accent goes away when uppercasing, the dialytics either stay or get added as needed.

I hope that helps some

In the word ὀδυσσεύς you posted above, the ὀ has a diacritic mark called ‘psili’ but that is no longer in use so for this and other marks you probably shouldn’t spend any time on

kip · March 10, 2024, 3:50am

I’ve just published unicode_string version 1.4.0. The primary update is the addition of dictionary-based word breaking for certain locales.

Not all languages use whitespace to separate words (most commonly East Asian languages) so a dictionary lookup is more appropriate - although not perfect.

The motivation to get this done now it to support the release in 2 weeks of ex_cldr_person_names.

Supported locales for dictionary-based word breaking

This implementation supports dictionary-based word breaking for:

Chinese (zh, zh-Hant, zh-Hans, zh-Hant-HK, yue, yue-Hans) locales,
Japanese (ja) using the same dictionary as for Chinese,
Thai (th),
Lao (lo),
Khmer (km) and
Burmese (my).

The dictionaries implemented are those used in CLDR since they are under an open source license and also for consistency with ICU.

Downloading dictionaries

Note that these dictionaries need to be downloaded with mix unicode.string.download.dictionaries prior to use. Each dictionary will be parsed and loaded into persistent_term on demand.

Memory usage

Note that each dictionary has a sizable memory footprint as measured by :persistent_term.info/0:

Dictionary	Memory Mb
Chinese	104.8
Thai	9.6
Lao	11.4
Khmer	38.8
Burmese	23.1

Examples

iex>  Unicode.String.split("明德", locale: :en) == 
["明", "德"]

iex> Unicode.String.split("明德", locale: :zh_Hant_HK) == 
["明德"]

iex> Unicode.String.split("สวัสดีเจ้านาย", locale: :th) 
["สวัสดี", "เจ้า", "นาย"]

iex> Unicode.String.split("ສະບາຍດີນາຍຈ້າງ", locale: :lo)
["ສະບາຍດີ", "ນາຍ", "ຈ້າງ"]

iex> Unicode.String.split("ສະမင်္ဂလာပါ သူဌေး", locale: :my)
["ສ", "ະ", "မင်္ဂလာ", "ပါ", " ", "သူဌေး"]

iex> Unicode.String.split("ສជំរាបសួរចៅហ្វាយ", locale: :km)
["ສ", "ជំរាបសួរ", "ចៅហ្វាយ"]