Cldr Collation - language-aware string sorting and comparison with opt-in NIF

Announcing ex_cldr_collation version 1.0

ex_cldr_collation is an Elixir implementation of the Unicode Collation Algorithm (UCA) as extended by CLDR, providing language-aware string sorting and comparison. An opt-in NIF is provided for high performance collating.

ex_cldr_collation has no dependency on ex_cldr but if it is configured in your app, Cldr.LanguageTag.t locales can be passed as a :locale option.

Features

  • Full Unicode Collation Algorithm implementation in pure Elixir.

  • CLDR root collation based on the Unicode DUCET table.

  • Locale-specific tailoring for 10+ languages (Danish, German phonebook, Spanish, Swedish, Finnish, etc.)

  • All BCP47 -u- extension collation keys supported.

  • Optional high-performance NIF backend using ICU4C.

  • Sort key generation for efficient repeated comparisons.

Examples

There are lots of options to affect locale-specific and user-specific sort requirements. Here are just some basic examples:

iex> Cldr.Collation.sort(["café", "cafe", "Cafe"])
["cafe", "Cafe", "café"]

# Cased comparisons
iex> Cldr.Collation.sort(["café", "cafe", "Cafe"], case_first: :upper)
["Cafe", "cafe", "café"]

iex> Cldr.Collation.compare("café", "cafe")
:gt

iex> Cldr.Collation.compare("a", "A", casing: :insensitive)
:eq

# Numeric ordering. Note that the normal order places
# the 1 before the 2
iex> Cldr.Collation.sort(["Level 10", "Level 2"], numeric: true)
["Level 2", "Level 10"]

# But numeric sorting takes consecutive digits into account,
# and not just Indo-arabic digits - any digits in any script.
iex> Cldr.Collation.sort(["Level 10", "Level 2"], numeric: false)
["Level 10", "Level 2"]

# German phonebook ordering
iex> words = ["Ärger", "Alter", "Ofen", "Öl", "Über", "Ulm"]

iex> Cldr.Collation.sort(words)
["Alter", "Ärger", "Ofen", "Öl", "Über", "Ulm"]

iex> Cldr.Collation.sort(words, locale: "de-u-co-phonebk")
["Ärger", "Alter", "Öl", "Ofen", "Über", "Ulm"]

# Locale-based ordering
iex> Cldr.Collation.compare("a", "A", locale: "en-u-ks-level2")
:eq

# Sort key generation
iex> Cldr.Collation.sort_key("hello")
<<36, 196, 36, 83, 37, 40, 37, 40, 37, 152, 0, 0, 0, 32, 0, 32, 0, 32, 0, 32, 0,
  32, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2>>

Implementation notes

Like unicode_transform, Claude was a very valuable co-developer for this release. The same value proposition makes this a very powerful combination for development and testing:

  • The specification is clear and complete so easy to ingest for an LLM
  • There is a reference implementation against which test validation can run automatically. The NIF-based interface to ICU makes this almost trivial.
13 Likes