String compare depends on upper/lower case?

I am wondering why I have this behavior, that seems counter intuitive to me:

iex(1)> "x" > "a"
true
iex(2)> "X" > "a"
false

When I sort a list of strings, I don’t expect this result

iex(1)> Enum.sort(["b", "a", "X"])
["X", "a", "b"]

I didn’t find in the documentation where this behavior is explained. If somebody knows, I’m interested :slight_smile:
Thanks!

Bitstrings are compared byte by byte, incomplete bytes are compared bit by bit.

Uppercase letters use smaller byte values than lowercase ones.

https://hexdocs.pm/elixir/operators.html#term-ordering

3 Likes

Thanks!

How do you easily sort by alphabetical order then?

Just off the cuff… maybe something like this.

iex(4)> Enum.sort(["b", "X", "a"], &(String.downcase(&1) <= String.downcase(&2)))
["a", "b", "X"]

There could be some string handling caveats that I’m not thinking about, but it’s the general idea.

2 Likes

JavaScript has the same behavior as well:

["a", "b", "X"].sort();          // ["X", "a", "b"]

Internally the Ascii values are being compared, which you can check in iex:

iex(15)> 'X'
[88]
iex(16)> 'a'
[97]
iex(17)> 'x'
[120]

So, always convert to lowercase before comparing, to avoid running into edge cases.

For instance, see UpperCase win:

iex(22)> ["derpycoder", "Derpycoder", "DerpyCoder"] |> Enum.sort()
["DerpyCoder", "Derpycoder", "derpycoder"]

See the answer by @sbuttgereit.

1 Like

And also be aware that if you are sorting non-ASCII strings then you should also normalise the string first. For example String.downcase(string) |> String.normalize(:nfkd).

Lastly, collation rules are language and culture dependent even for the same strings so depending on what you’re trying to do this is a much more complex topic than it seems on the surface.

6 Likes

Very interesting, thanks for all those details. Much more complex topic than what I expected indeed!