Collation again - Enum.sort and Enum.sort_by

Sorry I’m such a bore but I feel something is slightly wrong with how Enum.sort and Enum.sort_by work. It would be acceptable if Elixir was using only ASCII but if strings are encoded in UTF-8 alphabetical sorting order should reflect that. However now all characters with diacritics end up after ASCII characters which is not acceptable in any language, not even in English. Obvious solution would be to use locales. But it’s not something that is going to happen in Elixir any time soon, I guess. Another solution that comes to my mind is creating a library with new versions of Enum.sort and Enum.sort_by which would use locales or even a very simplified solution for roman scripts only that would disregard diacritcs, which is the correct way to sort alphabetically in English. What do you think? Maybe there are simple workarounds that I don’t know about?

1 Like

Enum.sort and friends don’t sort strings, but binaries for efficiency reasons.

If you really need an ordering that respects collations you will need to roll your own or find a library at hex.pm.

Also a string isn’t aware of its collation, so you will need another parameter for this. That again would not fit when you sort a list of numbers.

2 Likes

Great thanks! Now I see it’s by design. The reference says that Enum.sort()

Sorts the enumerable according to Erlang’s term ordering.

Is it possible to influence Erlang’s term ordering then? I found only one library that seems to deal with collation in Erlang GitHub - barrel-db/erlang-ucol: ICU based collation Erlang module . Is it possible to use Erlang libraries in Elixir?

Yes, you can include it as a regular Hex dependency (https://hex.pm/packages/ucol).

Elixir usage (adapted from the README) would be:

iex> :ucol.compare("foo", "bar")
1
iex> :ucol.compare("foo", "foo")
0
iex> :ucol.compare("A", "aai").
-1

…so you should be able use it with Enum.sort/2 like so, for an ascending sort:

iex> Enum.sort(["foo", "bar"], fn (x, y) -> :ucol.compare(x, y) != 1 end)
["bar", "foo"]

Note that I needed to install ICU dev files in Ubuntu, so that the NIF extension would compile:

sudo apt-get install libicu-dev

EDIT: Updated Enum.sort/2 usage example to return true for equal elements, so that we get stable sorting…

1 Like

Interestingly, the current version of that library doesn’t seem to allow selecting a particular locale; it does…

…which, according to the ICU API reference docs for ucol_open(), causes it to use the “root collator”:

Special values for locales can be passed in - if NULL is passed for the locale, the default locale collation rules will be used. If empty string (“”) or “root” are passed, the root collator will be returned.

1 Like