I wanted to capitalize a string, and tried using String.capitalize().
That generally works well, until you try to capitalize a word like ATM, which then comes out as Atm.
That is correct according to the documentation, as it states that String.capitalize() does this:
Converts the first character in the given string to uppercase and the remainder to lowercase according to mode. [emphasis mine]
Now that is not what I expected from a function called capitalize(). I naïvely thought that would only touch the first letter, as String.capitalize does in Python and similarly named functions does in other programming languages.
Elixir is probably mimicking Ruby here, and the point of this post is not to say that the implementation is wrong, it’s too late to change now anyway.
However, it would be handy if there was a :first_letter_only or similar mode, that would only touch the first letter, so developers everywhere won’t have to write their own capitalize function to be able to safely capitalize words without mangling abbreviations.
(If you come across this topic wanting to just capitalize the first letter, you can use String.Casing.titlecase_once(word, nil), although this is a little undocumented). Don’t do this, see comment from josevalim below.
The reason why we do first letter only is because Unicode has specific rules that consider the rest is being lowercased.
For the behaviour that you want, you should be able to implement it using String.next_grapheme, getting the first part, upcasing it an then concatenating the rest. Just make sure to take into account empty strings.
Don’t do this. String.Casing is private API and may change at any time, breaking your code.
Since this is a fairly common thing to do, and one with a bunch of tricky edge cases, wouldn’t it be the kind of thing that belongs in the standard library?
At least as a relative newcomer to Elixir, I find it a bit surprising to have to write my own String.cap_first() implementation, because the language does not provide one.
Anyway, Erlang’s titlecase() (added in OTP 20) function works for my use case at least, it can be called like this :string.titlecase(text) – although you should probably have a wrapper for it to filter out non-binary strings and such things.
Note that title case means that “black beans” will become “Black Beans”. But that is usually what you want, for sure.
I have been looking at capitalize quite closely and wondering what use case it is trying to be helpful for. It seems like titleize would have been more useful.
Elixir does a good job of conforming to the Unicode standard. Unicode doesn’t have anything particular to say about “capitalisation” that I can find but it is clear about what title casing means and String.capitalize/2 appears to conform to that expectation (upcase the first grapheme, downcase the rest).
Capitalization (North American English) or capitalisation (British English) is writing a word with its first letter as a capital letter (uppercase letter) and the remaining letters in lower case, in writing systems with a case distinction. The term also may refer to the choice of the casing applied to text.
Which is what String.capitalize/2 does. Wikipedia goes on to suggest Capital Case and Title Case are synonyms.
The issue of casing multi-word strings (in any form: up, down, title, other) is complicated. What is a word boundary? In some languages like Chinese there is no way to determine word boundaries without a lexicon and heuristics. Elixir’s String.split/1 uses mostly Unicode whitespace and a few others graphemes to break words. I have a more complete implementation of word (and sentence) breaking in the lib unicode_string (which I must improve the docs for) but even that is limited to a smallish number of languages.
Maybe a proposal to add a new function like String.upcase_first/2 would be a reasonable suggestion - at least its intent would be unambiguous
The above is an epic resource for capitalizing words in a sentence based on well known writing standards. It’s no where near as simple as just having a blacklist of words not to capitalize. Luckily all of those styles are well described so it should be possible to port it to another language. That’s what that website does. No ML required (I asked the author).
The default (Chicago) is used in a lot of places for headlines and sub-heading capitalization. That would be a good one to begin with.
Thats a cool site. But I don’t think it’s the role of the standard lib to provide that level of casing. It’s specific to the english language for one and as the site says it expresses opinions on the rules.