How to lowercase Greek letter 'Σ' correctly as for example rust does?

vjorjo · August 6, 2017, 12:46pm

Hi everybody,

In rust:
println!("{}", “ΣΑΝ ΠΑΣ ΣΕ ΣΑΣ”.to_lowercase());
writes:
σαν πας σε σας

which is correct but in elixir this:
String.downcase(“ΣΑΝ ΠΑΣ ΣΕ ΣΑΣ”)

gets:
“σαν πασ σε σασ”

which is wrong.

Does anybody know any way to get the correct result as in rust?

Thanks,
George

NobbZ · August 6, 2017, 1:19pm

The elixir result does conform to unicode 9.0, please do a bug report at the rust team

To get the same behaviour as in rust, you need to find a library that does follow local rules of up/downcasing.

But as far as I know there aren’t any.

josevalim · August 6, 2017, 1:52pm

It is a bug, can you please fill in a report? Elixir does not support language specific mappings but the behaviour above is not language specific according to Unicode.

NobbZ · August 6, 2017, 2:05pm

Oh… I just believed in correctness here because you generate those from files that are part of unicode standard as far as I understood the process so far…

josevalim · August 6, 2017, 8:55pm

This entry is a conditional mapping, so those cannot be generated automatically because they have conditions that have to be manually implemented. There are also language sensitive conditional mappings and those we don’t support at all.

NobbZ · August 6, 2017, 8:57pm

I think I do understand. So the document you use to generate the module from is only partial and omits some special/edge cases?

josevalim · August 7, 2017, 8:07am

Here is the document: http://unicode.org/Public/UNIDATA/SpecialCasing.txt

You can see this rule is inside Conditional Mappings in the Language Insensitive area. The name of the conditional rule is Final_Sigma and you need to look up the rules for that in another document - which I don’t have handy right now - but it means the rule applies if the next character is not a letter and it is not the end of the string.

vjorjo · August 7, 2017, 2:10pm

I haven’t seen how the rule is specified in unicode papers but I know the rule from my mother language, the Greek language.
The rule is very simple.
There is one capital letter (‘Σ’) for these two small letters ‘σ’ and ‘ς’.
Upcasing is easy, it’s always ‘Σ’ for both small letters.
Downcasing of ‘Σ’ is always ‘σ’ unless ‘Σ’ is the last letter of an, at least 2 letters, word where it is converted to ‘ς’ which for this reason is called final sigma.
e.g.
“Σ” -> “σ”
“ΣΣ” -> “σς”
“ΣΑΣ ΜΑΣ ΜΑΣΑΣ” -> “σας μας μασας”
“ΣΑΣ. ΜΑΣ, ΜΑΣΑΣ.” -> “σας. μας, μασας.”

NobbZ · August 7, 2017, 2:20pm

Back in the old days, we had similar rules for “round s” and “long ſ (s)” in germany. AFAIK even other latin-alphabet languages had something similar.

But in germany, we had not only to look at the end of the complete word, but in case of compound words we had to look at the composition boundaries as well.

Eg there where the words “Wachstube” (guardroom) and “Wachſtube” (tube of wax). Today you have to conclude from context what is meant… Its been so much easier 60 years ago… Is there a similar rule for the final-sigma?

vjorjo · August 7, 2017, 2:33pm

No it isn’t.
It exists only at the end of a word.
However, there is an exception in the case where you cut the word (abbreviation).
eg.
You have the word “ΑΘΑΝΑΣΙΑΔΗΣ” which is just a surname.
And you want to cut it at first "Σ"¨:
“ΑΘΑΝΑΣ.”
then if you lowercase it you 'll get: “αθανασ.”
In this case you cannot put a final ‘ς’. It changes the word. You no longer think it is an abbreviation if it has a final sigma.
As I see in rust if you downcase this:
“ΑΘΑΝΑΣ. ΑΘΑΝΑΣ.ΧΡΗΣΤΟΣ”
you get this:
“αθανας. αθανασ.χρηστος”
maybe they catch this case. If there is a dot followed immediately by letters they consider this abbreviation.
Every other dot is considered the end of a sentence.