Ex_cldr - Common Locale Data Repository (CLDR) functions for Elixir

ex_cldr provides localisation and internationalisation support based upon the data from the Unicode CLDR project.

Unicode released CLDR version 34 this week and ex_cldr is now updated to reflect that data which now consists of 537 locales that can be used in Elixir.

The full list of updated packages (core and optional) is:

Also updated is ex_money since it uses ex_cldr and friends under the cover for localisation and formatting:

This is expected to be the last functional release of ex_cldr version 1.x with the release of the 2.0 by the end of this year. Bugs in 1.x will of course continue to be eradicated as quickly as possible.

17 Likes

Cldr version 2.0 has just been released on hex. Its a major version bump with breaking changes - primarily to restructure the code in the manner of Ecto, Phoenix and Gettext by requiring the provision of a <backend> module into which most of the public API that hosts the CLDR content is generated. It gets rid of the horrid :ex_cldr compiler.

ex_cldr provides the underlying data that powers number, date time, list, units and territory formatting in over 500 different locals. Its also used to underpin ex_money.

The changelog contains the changes. Several of the dependent packages are also updated:

Some packages will be updated over the next two weeks:

8 Likes

Hey Kip - thanks for this library!

I’m trying to determine whether I should use CLDR or AINA, or both, to specify languages in my app.

Your library/CLDR follows RFC 5646.
W3 also recommends RFC 5646 -
https://www.w3.org/International/articles/language-tags/ … but also refers to https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry as the source of truth for RFC 5646.

However, some language codes in iana’s registry are not available in Cldr’s registry:

iex(27)> Enum.member?(MyApp.Cldr.Language.all_languages(), "adq")
false

CLDR still parses it and gets the language key filled out, though:

iex(28)> Cldr.Locale.new("adq", Onward.Cldr)
{:ok,
 %Cldr.LanguageTag{
   canonical_locale_name: "adq-Latn-US",
   cldr_locale_name: nil,
   extensions: %{},
   gettext_locale_name: nil,
   language: "adq",
   language_subtags: [],
   language_variant: nil,
   locale: %{},
   private_use: [],
   rbnf_locale_name: nil,
   requested_locale_name: "adq",
   script: "Latn",
   territory: "US",
   transform: %{}
 }}

so that’s great. But that language isn’t found:

iex(29)> MyApp.Cldr.Number.to_string 12345, locale: "adq"
{:error, {Cldr.UnknownLocaleError, "The locale \"adq\" is not known."}}

Do you know why does CLDR not have a language code that AINA has, given they’re following the same spec?

I think it’s because the spec is bcp47, which they both follow… But I’m not sure how AINA has a higher quantity of language tags, where CLDR has language-REGION tag combos which are not specified IANA’s registry.

I’m ultimately trying to ensure I’m using languages properly as my organization has a massive amount of languages.

I’m wondering about storing AINA’s codes separately from CLDR, and using your library to augment AINA’s codes where possible with the wealth of extra data your lib provides. Just was interested in your thoughts on AINA vs. CLDR if you had any, and if you think there’s room to use both.

Thanks!

1 Like

All but one of the companion libs is now updated. Cldr.DatesTimes is being actively updated to reflect the new Calendar functions in Elixir 1.8 and will be out by the end of January.

5 Likes

@gdub01, thanks for your interest. The difference is primarily that CLDR is not a registry but a data repository. (Common Locale Data Repository). Whilst ex_cldr will parse adq as a valid language tag, CLDR doesn’t have any translation data so therefore you see that the cldr_locale field in the struct is empty.

There are 533 languages supported in the current CLDR version 34 repository.

If your primary objective is to detect valid language tags then I think the two choices are:

  1. Use IANA data alone
  2. Use CLDR in conjunction with IANA data. CLDR will detect obsolete tags and update them to the modern version, apply default sub tags where known and also apply known aliases. This may (or may not) be useful for you in having as lenient a parse as possible.

If your primary objective is application localisation then I think CLDR is the most comprehensive repository available and it underpins most of the application domain globally - often through the libs icu4c and icu4j.

ex_cldr is an elixir implementation that largely matches the functionality of icu4j for output but does not implement parsing.

2 Likes

Thanks so much @kip ! I think I’ll go with CLDR & AINA. Appreciate it.

A few long flights have given an opportunity to make some updates to the ex_cldr set of libraries:

  • ex_cldr_print provides C-compatible printf/3 and sprint/3 functions for formatting strings. Since its built on CLDR data, it includes localising for grouping characters, decimal points and exponent characters. It also means you can output in different digit systems (like thai, arab and so on). So far only on GitHub, it needs further tuning and some development before a hex.pm release.

  • ex_cldr_collation which implements locale-specific collations (sorting). NIF-based, it currently only supports the default CLDR collation. This lib is based upon the erlang library erlang-ucol. Next step is to support the full range of collations for configured locales.

3 Likes

CLDR version 35 was release on March 27th and is now incorporated into updates to the cldr_* family on hex. CLDR supports localisation of number, dates, times, lists, units for 540 locales. It supports multiple calendars (coming soon in ex_cldr_calendars) as well.

Summary of CLDR 35.0.0 update

Data 70,000+ new data fields, 13,400+ revised data fields
Basic coverage New languages at Basic coverage: Cebuano (ceb), Hausa (ha), Igbo (ig), Yoruba (yo)
Modern coverage Languages Somali (so) and Javanese (jv) increased coverage from Moderate to Modern
Emoji 12.0 Names and annotations (search keywords) for 90+ new emoji; Also includes fixes for previous names & keywords
Collation Collation updated to Unicode 12.0, including new emoji; Japanese single-character (ligature) era names added to collation and search collation
Measurement units 23 additional units
Date formats Two additional flexible formats, and 20 new interval formats
Japanese calendar In Japanese locale, updated to use Gannen (元年) year numbering for non-numeric formats (which include 年), and to consistently use narrow eras in numeric date formats such as “H31/3/27”.
Region Names Many names updated to local equivalents of “North Macedonia” (MK) and “Eswatini” (SZ).
Segmentation Enhanced Grapheme Cluster Boundary rules for 6 Indic scripts: Gujr, Telu, Mlym, Orya, Beng, Deva.

Related Cldr releases on Hex

Migration from earlier versions of Cldr

No code changes are expected for client applications however since CLDR is a data repository, underlying data may have changed.

3 Likes

I have pushed to hex a new member of the ex_cldr_* family of packages: ex_cldr_calendars.

From the readme:

Cldr Calendars builds on Elixir’s standard Calendar module to provide additional calendars and calendar functionality intended to be of practical use. In particular Cdlr Calendars:

  • Provides support for configurable month-based and week-based calendars that are in common use as “Fiscal Year” calendars for countries and organizations around the world. See Cldr.Calendar.new/3

  • Supports localisation of common calendar terms such as “day of the week” and “month of the year” using the CLDR data that is available for over 500 locales. See Cldr.Calendar.localize/3

  • Supports locale-specific knowledge of what is a weekend or a workday. See Cldr.Calendar.weekend/1, Cldr.Calendar.weekend?/2, Cldr.Calendar.weekdays/1 and Cldr.Calendar.weekday?/2.

  • Provides convenient Date.Range calculators for years, quarters, months and weeks for calendars and provides the means to move to the next and previous period in a calendar where a period may be a year, quarter, month, week or day.

  • Supports adding or substracting periods to dates and date ranges. See Calendar.plus/3 and Calendar.minus/3

  • Includes pre-defined calendars for Gregorian (compatible with the builtin Calendar module), ISOWeek and National Retail Federation (NRF) calendars

  • Includes functions to find the first, last, nearest and nth days of the week from a date. For example, find the 2nd Tuesday in November.

4 Likes

ex_cldr_calendars_format is a new member of the cldr_* family of libs that formats calendars - either standard Calendar.ISO calendars or any calendar that implements the Cldr.Calendar behaviour. Given its platform it provides localised formatting. Formatting is done by any module that implements the Cldr.Calendar.Formatter behaviour, Formatters for HTML (basic formatting and another formatter for week-based calendars) and Markdown are included. Its easy to implement a formatter - there are only four callbacks.

A couple of examples:

iex> Cldr.Calendar.Format.month 2019, 4, formatter: Cldr.Calendar.Formatter.Markdown          
"### April 2019\n\nMon | Tue | Wed | Thu | Fri | Sat | Sun\n :---:  |  :---:  |  :---:  |  :---:  |  :---:  |  :---:  |  :---: \n**1** | **2** | **3** | **4** | **5** | **6** | **7**\n**8** | **9** | **10** | **11** | **12** | **13** | **14**\n**15** | **16** | **17** | **18** | **19** | **20** | **21**\n**22** | **23** | **24** | **25** | **26** | **27** | **28**\n**29** | **30** | 1 | 2 | 3 | 4 | 5\n6 | 7 | 8 | 9 | 10 | 11 | 12\n"

Which when rendered produces:

Mon Tue Wed Thu Fri Sat Sun
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 1 2 3 4 5
6 7 8 9 10 11 12

And its easy to localise:

iex> Cldr.Calendar.Format.month 2019, 4, formatter: Cldr.Calendar.Formatter.Markdown, locale: "ar-SA"
"### أبريل ٢٠١٩\n\nالاثنين | الثلاثاء | الأربعاء | الخميس | الجمعة | السبت | الأحد\n :---:  |  :---:  |  :---:  |  :---:  |  :---:  |  :---:  |  :---: \n**١** | **٢** | **٣** | **٤** | **٥** | **٦** | **٧**\n**٨** | **٩** | **١٠** | **١١** | **١٢** | **١٣** | **١٤**\n**١٥** | **١٦** | **١٧** | **١٨** | **١٩** | **٢٠** | **٢١**\n**٢٢** | **٢٣** | **٢٤** | **٢٥** | **٢٦** | **٢٧** | **٢٨**\n**٢٩** | **٣٠** | ١ | ٢ | ٣ | ٤ | ٥\n٦ | ٧ | ٨ | ٩ | ١٠ | ١١ | ١٢\n"

Which renders as:

الاثنين الثلاثاء الأربعاء الخميس الجمعة السبت الأحد
١ ٢ ٣ ٤ ٥ ٦ ٧
٨ ٩ ١٠ ١١ ١٢ ١٣ ١٤
١٥ ١٦ ١٧ ١٨ ١٩ ٢٠ ٢١
٢٢ ٢٣ ٢٤ ٢٥ ٢٦ ٢٧ ٢٨
٢٩ ٣٠ ١ ٢ ٣ ٤ ٥
٦ ٧ ٨ ٩ ١٠ ١١ ١٢

Thats all for calendars for now. Back to finally finishing up ex_cldr_dates_times which needed ex_cldr_calendars to be done as a prerequisite.

7 Likes

Unicode has released CLDR version 35.1.0.

In addition to updates related to the new Reiwa era, the CLDR 35.1 release includes a small number of other updates, including more localized name updates for North Macedonia, and support for tzdata 2019a.

As a result, ex_cldr version 2.7.0 is now published to support CLDR version 35.1.0.

5 Likes

Great! Thanks so much for providing such essential libs to elixir :smiley:

2 Likes

Finally updated ex_cldr_dates_times to version 2.0. This completes the migration of the Cldr_* family to version 2.0.

Please note that ex_cldr_dates_times requires Elixir 1.8 or later since it leverages capabilities of the Calendar module.

It also depends on ex_cldr_calendars which implements enhanced calendar functionality including week-based calendars, financial year calendars and user defined calendars based upon the Gregorian calendar. In future releases it will support additional calendars with a first focus on Persian and Islamic.

With this release out the door the Cldr_* family has only two packages left on my “todo” list:

9 Likes

A pretty large update to ex_cldr_units today. Primary focus is on converting units from one measurement system (ie :metric) to the system customary for a territory (country). I think this is pretty cool :slight_smile: The changelog entry is:

Enhancements

  • Add Cldr.Unit.localize/3 to support converting a given unit into units that are familiar to a given territory. For example, given a unit of #Unit<1.8, :meter> it would normally be expected to show this as [#Unit<:foot, 5>, #Unit<:inch, 11>] in the US. The data to support these conversions is returned by Cldr.Unit.unit_preferences/0. An example:
  iex> height = Cldr.Unit.new(1.8, :meter)
  iex> Cldr.Unit.localize height, :person, territory: :US, style: :informal
  [#Unit<:foot, 5>, #Unit<:inch, 11>]
  iex> Cldr.Unit.to_string us_height
  "5 feet and 11 inches"
  • Note that conversion is dependent on context. The context above is :person reflecting that we are referring to the height of a person. For units of length category, the other contexts available are :rainfall, :snowfall, :vehicle, :visibility and :road. Using the above example with the context of :rainfall we see
  iex> Cldr.Unit.localize height, :rainfall, territory: :US
  [#Unit<:inch, 71>]
  iex> Cldr.Unit.to_string height
  "71 inches"
  • Adds a :per option to Cldr.Unit.to_string/3. This option leverages the per formatting style to allow compound units to be printed. For example, assume want to emit a string which represents “kilograms per second”. There is no such unit defined in CLDR (or perhaps anywhere!). But if we define the unit unit = Cldr.Unit.new(:kilogram, 20) we can then execute Cldr.Unit.to_string(unit, per: :second). Each locale defines a specific way to format such a compount unit. Usually it will return something like 20 kilograms/second

  • Adds Cldr.Unit.unit_preferences/0 to map units into a territory preference alternative unit(s)

  • Adds Cldr.Unit.measurement_systems/0 that identifies the unit system in use for a territory

  • Adds Cldr.Unit.measurement_system_for/1 that returns the measurement system in use for a given territory. The result will be one of :metric, :US or :UK.

3 Likes

ex_cldr_messages version 0.1.0 is now published. This package implements the ICU Message Format. A good explanation of the motivation for the ICU Message format is in this presentation by Mark Davis from Google.

Eventually this package will be an alternative to Gettext however this first release is simply a message formatter.

The ICU message format leverages the localisation data from CLDR. This implementation uses the capabilities of ex_cldr, ex_cldr_numbers, ex_cldr_dates_times, ex_cldr_units and ex_cldr_lists to simplify the localisation of messages.

A couple of examples:

 # Simple binding interpolation
 iex> Cldr.Message.to_string! "My name is {name}", name: "Kip"
 "My name is Kip"

 # Pluralization used the CLDR pluralisation rules implemented in ex_cldr
 iex> Cldr.Message.to_string! "On {taken_date, date, short} {name} took {num_photos, plural,
        =0 {no photos.}
        =1 {one photo.}
        other {# photos.}}", 
      taken_date: Date.utc_today(), name: "Kip", num_photos: 10
 "On 8/26/19 Kip took 10 photos."

 # ICU Messages can manage grammatical gender, nested message formats,
 iex> Cldr.Message.to_string! "{gender_of_host, select,
      female {
        {num_guests, plural, offset: 1
          =0 {{host} does not give a party.}
          =1 {{host} invites {guest} to her party.}
          =2 {{host} invites {guest} and one other person to her party.}
          other {{host} invites {guest} and # other people to her party.}}}
      male {
        {num_guests, plural, offset: 1
          =0 {{host} does not give a party.}
          =1 {{host} invites {guest} to his party.}
          =2 {{host} invites {guest} and one other person to his party.}
          other {{host} invites {guest} and # other people to his party.}}}
      other {
        {num_guests, plural, offset: 1
          =0 {{host} does not give a party.}
          =1 {{host} invites {guest} to their party.}
          =2 {{host} invites {guest} and one other person to their party.}
          other {{host} invites {guest} and # other people to their party.}}}}",
      gender_of_host: "male", host: "Kip", guest: "Jim", num_guests: 4
"Kip invites Jim and 3 other people to his party."

The package is under active development with expected frequent updates over the next couple of months.

7 Likes

Do you compile the strings or interpret them every time you call Message.to_string!?

This first version is interpreted each call. It’s clearly not ready for real world use.

Next version I’ll have done in a few days introduces the f/1/2 macro which will compile the format.

What I’m aiming for is to make it as easy as possible to update existing strings to be localised. So something like just converting ”this is a string” to f”this is a string” would be one way to do that. With interpolation it would be f”this is a {number}” and it will assume number is defined in the scope and therefore resolve to var!/1. This way the behaviour is similar to standard elixir string interpolation.

Then after that its all of the translation scaffolding. Rather than take the po file approach I am looking at pluggable backends with a primary intent to build a backend translation service so that translations can be updated without code building and release cycles and allow good support for crowd-sourced translations.

I know you’ve done a lot of work in this area so any feedback, thoughts, suggestions and criticisms are most welcome.

I‘m not sure I really like that. It means coupling the name in the translatable string to code concerns, a.k.a. a variable name. I‘d guess I‘d rather like it to work like gettext, where you supply a keyword list for substitution.

This is especially relevant if the same string is used in multiple contexts, where variables might be names differently.

2 Likes

Definitely there iwill be an option to supply a binding in the same way as for gettext. I’m still in experimentation mode. For standard string interpolation the binding is already implicit though. So there is a behaviour difference between normal strings and gettext strings.

Yeah, but the important difference is that for normal string interpolation the scope of the variable name in the string is the scope of the variable itself. For translations the scope of variable names extends to being useful names for translators as well. E.g. I just read the medium post on slack implementing translation using ICU messages and they have a rule to always suffix _possissive to their variables if the variable should be in possessive form.

Another difference is that string interpolation understands elixir code, e.g. "#{helper.name} helped #{receiver.name}", which I’m not sure if possible in ICU messages, where I guess you need a plain variable without nesting.

I don’t think code like this is preferable to the latter

helper_name = helper.name
reciver_name = receiver.name
…
f("{helper_name} helped {receiver_name}")
f("{helper_name} helped {receiver_name}", helper_name: helper.name, receiver_name: receiver.name)
1 Like