Proposal: ICU/Unicode calendar/datetime formatting

josevalim · December 3, 2018, 10:18am

NOTE: this is a focused thread, so we appreciate if everybody stayed on topic. Feel free to comment anything in regards to calendar formatting but avoid off-topic or loosely related topics. For example, if you would like to discuss or propose other Calendar/DateTime features, please use a separate thread.

Hi everyone,

This is a proposal for calendar/datetime formatting in Elixir. The formatting will use the markup specified Unicode’s Locale Date Markup Language.

Here is how the API will look like:

Calendar.format(date_or_time_or_datetime, "yyyy-MM-dd HH:mm:ss.SSSSSS")
#=> {:ok, "2018-11-29 13:19:41.032412"}

The Calendar.format/2 entry point accepts any calendar type, using structural typing. This means we will be able to format any map that has the fields being formatted.

We will also support a third argument, which is a formatter module, that can be used for translation of month names, eras, etc:

Calendar.format(date_or_time_or_datetime, "yyyy-MM-dd HH:mm:ss.SSSSSS", MyApp.PTBR)

Would you also be able to proxy all of this to Gettext or CLDR if you want to:

Calendar.format(date_or_time_or_datetime, "yyyy-MM-dd HH:mm:ss.SSSSSS", Gettext.Calendar)

In other words, the third argument supports translation but not localization. For example, CLDR specifies the short and long formats for dates and times that each region/locale uses but we won’t support those as we believe those can be trivially built on top of Elixir:

defmodule CLDR do
  def long_date(date_or_time_or_datetime) do
    Calendar.format(date_or_time_or_datetime, CLDR.Locale.long_date, CLDR.Locale)
  end
end

Note: this proposal was written by José Valim and Michał Muskała.

The ICU syntax

As described in the ICU page:

A date pattern is a string of characters, where specific strings of characters are replaced with date and time data from a calendar when formatting.

The Date Field Symbol Table below contains the characters used in patterns to show the appropriate formats for a given locale, such as yyyy for the year. Characters may be used multiple times. For example, if y is used for the year, ‘yy’ might produce ‘99’, whereas ‘yyyy’ produces ‘1999’. For most numerical fields, the number of characters specifies the field width. For example, if h is the hour, ‘h’ might produce ‘5’, but ‘hh’ produces ‘05’. For some characters, the count specifies whether an abbreviated or full form should be used, but may have other choices.

Two single quotes represents a literal single quote, either inside or outside single quotes. Text within single quotes is not interpreted in any way (except for two adjacent single quotes). Otherwise all ASCII letter from a to z and A to Z are reserved as syntax characters, and require quoting if they are to represent literal characters.

“Stand Alone” values refer to those designed to stand on their own, as opposed to being with other formatted values. “2nd quarter” would use the stand alone format (QQQQ), whereas “2nd quarter 2007” would use the regular format (qqqq yyyy).

The complete specification can be found here. Elixir will implement a subset of those formats, outlined below.

Format	Description	Examples	Source
G	abbreviated_era	AD; BC	Calendar.year_of_era/1 + Formatter.abbreviated_era/1
GG	wide_era	Anno Domini; Before Christ	Calendar.year_of_era/1 + Formatter.wide_era/1
GGG	narrow_era	A; B	Calendar.year_of_era/1 + Formatter.narrow_era/1
u+	year	2004	struct.year
yy	two_digits_year_of_era	4, 14, 14, 14	Calendar.year_of_era/1
y+	year_of_era	4, 14, 214, 2014	Calendar.year_of_era/1
D+	day_of_year	189	Calendar.day_of_year/3
M, MM	month	1, 01	struct.month
MMM	abbreviated_month	Nov	struct.month + Formatter.abbreviated_month/1
MMMM	wide_month	November	struct.month + Formatter.wide_month/1
MMMMM	narrow_month	N	struct.month + Formatter.narrow_month/1
d+	day	1, 14, 31	struct.day
Q, QQ	quarter	2, 02	Calendar.quarter_of_year/3
QQQ	abbreviated_quarter	Q2	Calendar.quarter_of_year/3 + Formatter.abbreviated_quarter/1
QQQQ	wide_quarter	2nd Quarter	Calendar.quarter_of_year/3 + Formatter.wide_quarter/1
QQQQQ	narrow_quarter	2	Calendar.quarter_of_year/3 + Formatter.narrow_quarter/1
YY	two_digits_week_based_year	4, 14, 14, 14	Calendar.week_in_year/3
Y+	week_based_year	4, 14, 214, 2014	Calendar.week_in_year/3
w+	week_in_year	1, 9, 13, 42	Calendar.week_in_year/3
W+	week_in_month	1, 9, 13, 42	Calendar.week_in_month/3
E	abbreviated_day_of_week	Tue	Calendar.day_of_week/3 + Formatter.abbreviated_day_of_week/1
EE	wide_day_of_week	Tuesday	Calendar.day_of_week/3 + Formatter.wide_day_of_week/1
EEE	narrow_day_of_week	T	Calendar.day_of_week/3 + Formatter.narrow_day_of_week/1
H+	hour (0-23)	1, 01, 23	struct.hour
h+	am_pm_hour (1-12)	1, 01, 11	struct.hour
a	am_pm	AM, PM	struct.hour + Formatter.am_pm/1
m+	minute	1, 11, 59	struct.minute
s+	second	1, 11, 59	struct.second
S+	fraction_of_second	1, 001, 123456	struct.microsecond
VV	time_zone	Brasil/Sao Paulo	struct.time_zone
zz	zone_abbr	BRT	struct.zone_abbr
x	zone_offset_basic_optional	-08, +0530, +00	struct.std_offset + struct.utc_offset
xx	zone_offset_basic	-0800, +0530, +0000	struct.std_offset + struct.utc_offset
xxx	zone_offset_basic	-08:00, +05:30, +00:00	struct.std_offset + struct.utc_offset
X	zone_offset_basic_optional_with_z	-08, +0530, Z	struct.std_offset + struct.utc_offset
XX	zone_offset_basic_with_z	-0800, +0530, Z	struct.std_offset + struct.utc_offset
XXX	zone_offset_extended_with_z	-08:00, +05:30, Z	struct.std_offset + struct.utc_offset

Whenever there is a plus sign at the end, it means the number of entries specifies the minimum number of digits. The exceptions are yy and YY, which is always shown as two digits regardless of how many digits, and S, which truncates.

Besides all entries above, we also support the following stand-alone formats: L for months (same as M) and q for quarters (same as Q).

The source column is used as a reference for the implementation and it won’t be present in the final documentation.

Rationale

Last but not least, it is worth discussing the rationale for date/time formatting. If you have an application that works with calendar types, it is likely that you have to format them at some point. If your application mostly interfaces with other systems, then there is a chance the built-in ISO format is enough, but not always. For example, some HTTP headers use a different format than the recommended ISO one. Therefore adding formatting to the standard library feels like a natural next step to the existing functionality.

Of course, there are some downsides to adding this functionality. First of all, there are many syntaxes for date/time formatting, and they all feel unnatural to some extent. While we could easily argue the ICU/Unicode is the one that makes the most sense for Elixir, as Elixir follows the standard in many other occasions, it is hard to argue it is the best approach generally (or even if there is such thing as best). One approach, which is orthogonal to the one above, is to allow the format to also be given as a list of atoms instead of cryptic single-letter definitions.

Another discussion, which may or may not impact this one, is about parsing. The parsing specification is often the same as the formatting specification but we have explicitly decided to not support parsing in Elixir. First of all, it is really hard to support a general but efficient runtime date/time parsing strategy. If you expect certain formats, it is almost always better to define functions that parse specifically those formats. Things get trickier if we consider the fact we need to support internalization, which is trivial for formatting, but quite expensive for parsing. In other words, while we can provide a general and efficient implementation for formatting, we can’t do so for parsing. Since different trade-offs can be made here, ranging from performance to flexibility, we are not comfortable in picking one or another.

Roadmap

We don’t plan to add this functionality directly to Elixir. Instead we will develop it as a library and collect feedback. The complexity of the implementation will also dictate if this will become part of core or not, but we believe the implementation will be relatively simple.

The first step is validation of this proposal and then a library will be developed as part of github.com/elixir-lang for futher validation and feedback.

Feedback

Your turn.

Qqwy · December 3, 2018, 12:11pm

Interesting!
I can see nothing wrong with the proposal itself; it seems very well-written, clear and well-defined in scope. However, I do have one question to ask as “devil’s advocate”:

There exist widely-used libraries such as Timex that have a very complete implementation of formatting (as well as parsing). (And I presume the proposal is based on this).

So why is this something Elixir (specifically speaking, the ‘elixir-lang’ team) should concern itself with? What would the advantage of an official library be in this case, and would it be strong enough to warrant the creation and maintenance of this library?

dimitarvp · December 3, 2018, 12:13pm

I would guess marketing. There are a lot of people out there who judge languages based on how much official support they have in a number of areas. Gradually pulling and adopting more and more battle-hardened libraries into the core of Elixir – or officially endorsed 3rd party libraries – would send a message that the language and its ecosystem are ready for mainstream usage and widespread adoption.

I don’t speak for the Elixir core team though, just guessing.

wmnnd · December 3, 2018, 1:35pm

I have a question about the G and GG formats:
The Unicode specification mentions the more neutral names CE, BCE / Common Era and Before Common Era as “variants” for these fields and alternatives to AD et. al.

Will there be a way to determine which of these two variants should be used? I think there are many contexts in which the religious connotation of AD and BC is undesirable and CE and BCE would be more appropriate.
Whether you make this configurable or not, I think the secular versions should be the default.

fertapric · December 3, 2018, 1:47pm

Will the library allow to configure a default formatter at application level (similar to what was done for the default time zone database)?

josevalim · December 3, 2018, 5:25pm

Those are very good questions. We generally want Elixir core to be lean. We want the community to rely on third party packages. In fact, I have talked with @bitwalker about breaking Timex apart (for example, parsing/formatting could be its own lib).

The only possible rationale for including this in Elixir is because it belongs to the current feature set already part of the language and that this feature is small enough to warrant its inclusion in the language but I agree it is a blurry line.

To avoid any debate, I will just follow whatever Unicode and/or CLDR recommends as default. You will be able to customize by passing a third argument to format.

I haven’t thought about this yet and to be honest, I don’t think so. A timezone database is meant to provide the same results. What would change between them is functionality such as live updates, performance, etc. So while it is global, it is not mean to cause a difference in behaviour.

Setting a different formatter will change behaviour. So for now I would keep it explicitly opt-in.

wmnnd · December 3, 2018, 6:41pm

This is what the CLDR page on the issue says:

There are only two values for an era in a Gregorian calendar. The common presentation of these era names in English are the more religious forms “BC” (Before Christ) and “AD (Anno Domini)” - from the Latin for “The year of our Lord”. The secular equivalents of these two era names are “BCE” (Before Common Era) and “CE” (Common Era).

As of CLDR 24 it is now possible for a locale to supply both forms, if both are used in the locale. You will need to consider whether the religious (BC/AD) form or the secular (BCE/CE) form is more commonly used in your language, and make the most common form the default form (code 0, 1). The alternate form, if used, can be provide under the entries for codes 0-variant, 1-variant. If your locale does not commonly use an alternate form, do not provide any entries for these.
– http://cldr.unicode.org/translation/date-time-names

Unfortunately it’s not quite so simple and I don’t think the issue can be ignored.

josevalim · December 3, 2018, 7:09pm

Thanks for the reference! If we need to support variants, then maybe it is better to skip G, GG and GGG for now and see how everything will align with existing libraries, such as @kip’s CLDR, before making a decision on how variants will work.

wmnnd · December 3, 2018, 7:24pm

Sounds good! By the way, the ISO 8601 norm has years BCE prefixed with a minus sign. Year 1 BCE is year 0 in ISO 8601 and the year before that is -1.

Maybe using this system would be an alternative?

kip · December 3, 2018, 8:41pm

Good proposal, although I would like to see Elixir embrace the principle of locales, even though explicitly Elixir lang supports only the en-US locale. An optional 4th param for a locale would be wonderful (in my opinion of course). And the default formatter would support only en-US.

I observe that the codes: E+, M+, c+ and other textual codes aren’t supported in the proposal, presumably because they are locale-specific and therefore outside the scope of this proposal. If locales were embraced they could easily be supported without undue burden.

Does that mean the full set of UCI codes is outside the scope of the proposal, or only outside the scope of the default formatter? ie does the proposal permit the full set of codes if the specified formatter supports them?

I think leaving out the G+ codes for now makes sense for the default formatter.

michalmuskala · December 3, 2018, 9:34pm

I believe that you can easily build something that has the notion of locales in the fore-front based on the abstractions proposed in here. I could imagine a package that could expose function Localize.format_date(date, :long | :short | :narrow, locale) that would underneath call Calendar.format/3. In that way, I consider Calendar.format/3 to be a low-level primitive and a building block and something actually more suitable of being included in the core language.

The proposal is based on Unicode Locale Data Markup Language (LDML) Part 4: Dates. I’m also not 100% sure what you mean by E+, M+ and c+ in this context.

Cochonours · December 3, 2018, 9:41pm

I know it can be a sensitive subject for some people, but that CE/BCE naming is not used a lot in American English in general, and almost non existent elsewhere. Its equivalent is never used in French for example, and when French people look up the english translation of “après JC” google won’t even mention this alternative. Thus, even if Americans could understand both alternatives easily, I think introducing such a convention is a bad idea.

kip · December 3, 2018, 9:50pm

The full table of supported codes for ICU is in the reference you supplied, more specifically at https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table

E, EE and EEE are week day formats like “Tuesday, Tue, T”
MMM, MMMM are month names like “September, Sept”

There are others that are primarily textual and therefore locale specific like a, b, B that I recognise are outside of the scope of this proposal.

I’m just trying to work out if they remain valid formatting codes for this proposal, but not supported by the default formatter. Or if they are considered invalid codes for this proposal.

On the locales front - yes, your API strategy makes a lot more sense. The core of what i was trying to express, badly, is that even when a package or a language doesn’t support multiple locales, it implicitly is supporting one which for Elixir is en-US. Today, each package that intends to help support multiple locales like Gettext, Cldr, Trans has its own notion of current locale and what constitutes a supported locale, or locale name format. A unifying strategy for current locale and locale name would be helpful. But outside the scope of this proposal of course.

josevalim · December 4, 2018, 5:19pm

Thanks everyone for the feedback!

I am bringing the proposal down for now because the discussion has made it clear that, if we want to support this, then we should go ahead and fully support ICU (with all formats, variants and what not) or we don’t support it all.

After all, if we only support part of it, then the community will have to implement everything again, like in @kip’s great projects, or we will have to define many interfaces so libraries are able to continue the work we started. None of those options feel great and then I would rather make sure that the Calendar functionality provides everything @kip needs in his libs.

So here are the next steps: we will look into simpler formatting syntaxes, possibly strftime, and see if we can write a proposal with reduced scope and less pitfalls that provides a minimum set of functionality without overlapping with more complex use cases. In case we can’t find anything that fits those constraints then we will give up on having formatting in core for now.