Sort strings without taking into account the accents

thojanssens1 · March 12, 2020, 1:08pm

I want to sort a list of countries. Unfortunately, the sort function will set “Åland Islands” as last element, because it starts with an accent. I would like to sort without taking into account the accents, so that “Åland Islands” will appear after “Afghanistan”:

Enum.sort(["Thailand", "Åland Islands", "Afghanistan"])

Anyone has a solution for this?

thojanssens1 · March 12, 2020, 1:15pm

I found the solution. I have to apply the following on the country names when sorting:

:unicode.characters_to_nfd_binary(country_name)  |> String.replace(~r/\W/u, "")

hauleth · March 12, 2020, 1:38pm

This is not a correct solution. This is some solution, but for sure not correct one. The best (and correct) solution is to use proper collation, for example via Cldr.Collation.

NobbZ · March 12, 2020, 1:41pm

This is not as easy as it sounds.

Lets for example take a look at the German language.

We have Ä, Ö, Ü and ẞ, and all of them also have a lower case variant.

Usually the umlauts Ä, Ö, Ü are treated as their regular counterparts. This is usually done in simple lists that list “existence” like inventories.

Though lists which are used to “look up”, are usually sorted as if Ä were AE, Ö OE, and Ü UE.

ẞ is always treated like SS.

This is called “collation” and mostly dealt with through I18n/L21n libraries.

kip · March 12, 2020, 7:47pm

As @hauleth notes, ex_cldr_collation applies the UCA which, in the basic form implemented in this lib, does a reasonable job across many languages. For your example (on Elixir 1.10):

iex> Enum.sort(["Thailand", "Åland Islands", "Afghanistan"], Cldr.Collation)
["Afghanistan", "Åland Islands", "Thailand"]

It is NIF-based though. Also credit where due - its built from erlang-ucol. I am very very slowly extending it to support locale-specific collations, not just the DUCET.

thojanssens1 · March 13, 2020, 3:00am

So a collation is a (standardized) set of rules? And what is the link then between the collation and the encoding? Because I remember in the MySQL database for example, the name of the collation included the name of the encoding (e.g. utf8_general_ci, latin1_general_ci, latin1_swedish_ci, etc.). But in your example, that Ä is sorted as AE, how does the encoding here matters?

kip · March 13, 2020, 3:12am

Yes, collation is a set of rules for ordering characters. In fact with UCA the Default Unicode Collation Element Table (DUCET) is expressed as a set of rules. Ordering is never about the numerical order of codepoints in UCA.

Encoding is, to my understanding, orthogonal to collation since ordering is applied to characters. The algorithms in UCA are related to codepoint(s) that form characters (either NFC or NFD).

The implementation in ex_cldr_collation only applies the DUCET with the optional flag for case sensitivity. The UCA itself also provides for ordering ignoring modifiers (accents for example) and other options like handling expansions and contractions (think ß for example which has its own sort order but in upper case may be treated as SS). Its very clever stuff - which one day I will implement in a native Elixir version.

Lastly, there are rules that varying ordering based upon language and locale preferences. These are what utf8_swedish_ci implements, for example.

I think in the MySQL case, the encoding is mentioned because under the covers it will need to know what transforms to apply before ordering. In Elixir the assumption is UTF8 - and thats also the assumption in ex_cldr_collation. Note that PostgresQL (and probably also MySQL) use libicu to implement collation.

An extract of what the rules looks like is this:

0020 ; [*0209.0020.0002] # SPACE
02DA ; [*0209.002B.0002] # RING ABOVE
0041 ; [.06D9.0020.0008] # LATIN CAPITAL LETTER A
3373 ; [.06D9.0020.0017] [.08C0.0020.0017] # SQUARE AU
00C5 ; [.06D9.002B.0008] # LATIN CAPITAL LETTER A WITH RING ABOVE
212B ; [.06D9.002B.0008] # ANGSTROM SIGN
0042 ; [.06EE.0020.0008] # LATIN CAPITAL LETTER B
0043 ; [.0706.0020.0008] # LATIN CAPITAL LETTER C
0106 ; [.0706.0022.0008] # LATIN CAPITAL LETTER C WITH ACUTE
0044 ; [.0712.0020.0008] # LATIN CAPITAL LETTER D

And a ruleset (like the DUCET) can be tailored into another collation by applying some modifier rules like:

Syntax	Description
& y < x	Make x primary-greater than y
& y << x	Make x secondary-greater than y
& y <<< x	Make x tertiary-greater than y
& y = x	Make x equal to y

where x and y are two code points (or glyphs) that are ordered differently than the base data.

I suspect thats a lot more than you wanted to know

NobbZ · March 13, 2020, 5:56am

This was mainly an performance optimisation.

If the collation was encoding unaware, then the input had to get encoded into a normalised encoding (probably one of the UTF variants) by another step. This would have caused additional space and time costs.

Therefore collations were made encoding aware.

thojanssens1 · March 13, 2020, 6:32am

I see. Actually it would have been better to have two fields then. One for collation, and another one for the encoding of input. But instead, they (MySQL and maybe others) provide only one confusing field.

Btw thanks @kip for all the additional details provided, awesome

NobbZ · March 13, 2020, 8:15am

Nothing in the database actually needs to know the encoding, except for the collation, for every other aspects of the database a string or a text is nothing but a binary blob with nicer syntax in the REPL.