I am wondering why I have this behavior, that seems counter intuitive to me:
iex(1)> "x" > "a"
true
iex(2)> "X" > "a"
false
When I sort a list of strings, I don’t expect this result
iex(1)> Enum.sort(["b", "a", "X"])
["X", "a", "b"]
I didn’t find in the documentation where this behavior is explained. If somebody knows, I’m interested
Thanks!
Bitstrings are compared byte by byte, incomplete bytes are compared bit by bit.
Uppercase letters use smaller byte values than lowercase ones.
https://hexdocs.pm/elixir/operators.html#term-ordering
3 Likes
Thanks!
How do you easily sort by alphabetical order then?
Just off the cuff… maybe something like this.
iex(4)> Enum.sort(["b", "X", "a"], &(String.downcase(&1) <= String.downcase(&2)))
["a", "b", "X"]
There could be some string handling caveats that I’m not thinking about, but it’s the general idea.
2 Likes
JavaScript has the same behavior as well:
["a", "b", "X"].sort(); // ["X", "a", "b"]
Internally the Ascii values are being compared, which you can check in iex:
iex(15)> 'X'
[88]
iex(16)> 'a'
[97]
iex(17)> 'x'
[120]
So, always convert to lowercase before comparing, to avoid running into edge cases.
For instance, see UpperCase win:
iex(22)> ["derpycoder", "Derpycoder", "DerpyCoder"] |> Enum.sort()
["DerpyCoder", "Derpycoder", "derpycoder"]
See the answer by @sbuttgereit .
1 Like
kip
January 26, 2023, 3:53pm
6
And also be aware that if you are sorting non-ASCII strings then you should also normalise the string first. For example String.downcase(string) |> String.normalize(:nfkd)
.
Lastly, collation rules are language and culture dependent even for the same strings so depending on what you’re trying to do this is a much more complex topic than it seems on the surface.
6 Likes
Very interesting, thanks for all those details. Much more complex topic than what I expected indeed!