Unicode Transform - transliterate text between scripts, un-accenting and case mapping

unicode_transform has reached the 1.0 milestone after a complete rewrite delivering a library which transliterates text between scripts, applies normalization and case mappings, and executes arbitrary CLDR transform rule sets at runtime.

An opt-in NIF is included in the library for high performance transformations (see the performance section of the readme for details - the NIF is not always the fastest transformer). A fast-path Latin to ASCII module is included that is faster than the NIF (and it’s used automatically).

unicode_transform ships with all 394 CLDR transforms covering script conversions (Greek, Cyrillic, Arabic, Devanagari, Thai, Hangul, and many more), Indic cross-script transliterations, BGN/PCGN romanizations, and specialized transforms like Any-Publishing and Fullwidth-Halfwidth.

Examples

Here are some examples of what unicode_transform. The primary public API is Unicode.Transform.transform/2.

Script-to-Latin transliteration

Convert text from non-Latin scripts to Latin characters:

# Greek to Latin
iex> Unicode.Transform.transform("Ελληνικά", from: :greek, to: :latin)
{:ok, "Ellēniká"}

# Cyrillic to Latin
iex> Unicode.Transform.transform("Москва", from: :cyrillic, to: :latin)
{:ok, "Moskva"}

# Korean to Latin
iex> Unicode.Transform.transform("한글", from: :hangul, to: :latin)
{:ok, "hangeul"}

# Thai to Latin
iex> Unicode.Transform.transform("กรุงเทพ", from: :thai, to: :latin)
{:ok, "krungtheph"}

# Arabic to Latin
iex> Unicode.Transform.transform("عربي", from: :arabic, to: :latin)
{:ok, "ʿrby"}

Latin-ASCII (accent stripping)

Remove diacritics and convert to plain ASCII:

iex> Unicode.Transform.transform("Ä Ö Ü ß", from: :latin, to: :ascii)
{:ok, "A O U ss"}

iex> Unicode.Transform.transform("café résumé", from: :latin, to: :ascii)
{:ok, "cafe resume"}

German-specific ASCII transliteration

Uses context-sensitive rules (e.g., uppercase Ä becomes AE, lowercase ä becomes ae):

iex> Unicode.Transform.transform("Ä ö ü", transform: "de-ASCII")
{:ok, "AE oe ue"}

iex> Unicode.Transform.transform("Ä ö ü", from: :de, to: :ASCII)
{:ok, "AE oe ue"}

iex> Unicode.Transform.transform("Ä ö ü", from: "de", to: "ASCII")
{:ok, "AE oe ue"}

Cross-script Indic transliteration

Convert between Indic scripts without going through Latin:

iex> Unicode.Transform.transform("हिन्दी", from: :devanagari, to: :bengali)
{:ok, "হিন্দী"}

iex> Unicode.Transform.transform("বাংলা", from: :bengali, to: :gujarati)
{:ok, "બাંলা"}

Japanese script conversion

iex> Unicode.Transform.transform("あいうえお", from: :hiragana, to: :katakana)
{:ok, "アイウエオ"}

# Options accept strings too (case-insensitive)
iex> Unicode.Transform.transform("あいうえお", from: "Hiragana", to: "Katakana")
{:ok, "アイウエオ"}

iex> Unicode.Transform.transform("tokyo", from: :latin, to: :katakana)
{:ok, "トキョ"}

Normalization and case transforms

Built-in transforms for Unicode normalization forms and case mapping:

iex> Unicode.Transform.transform("hello world", to: :upper)
{:ok, "HELLO WORLD"}

iex> Unicode.Transform.transform("hello world", to: :title)
{:ok, "Hello World"}

iex> Unicode.Transform.transform("A\u0308", to: :nfc)
{:ok, "Ä"}

Migration

If you’re using unicode_transform versions before 1.0.0, the API has changed - but not dramatically. However you will need to make some modifications to use the updated Unicode.Transform.transform/2 function.

Implementation notes

The implementation was very strongly supported by using Claude. I think this kind of project really fits in well with using an LLM to support development:

  • The specification is well-written and complete so the LLM can readily derive a specification from it.
  • There is a reference implementation in ICU. Therefore the implementation can be tested against a reference implementation. Having the NIF interface to ICU definitely helps speed development and testing.
13 Likes