String.capitalize() should have a “leave the rest of the word alone” option

I wanted to capitalize a string, and tried using String.capitalize().

That generally works well, until you try to capitalize a word like ATM, which then comes out as Atm.

That is correct according to the documentation, as it states that String.capitalize() does this:

Converts the first character in the given string to uppercase and the remainder to lowercase according to mode. [emphasis mine]

Now that is not what I expected from a function called capitalize(). I naïvely thought that would only touch the first letter, as String.capitalize does in Python and similarly named functions does in other programming languages.

Elixir is probably mimicking Ruby here, and the point of this post is not to say that the implementation is wrong, it’s too late to change now anyway.

However, it would be handy if there was a :first_letter_only or similar mode, that would only touch the first letter, so developers everywhere won’t have to write their own capitalize function to be able to safely capitalize words without mangling abbreviations.

(If you come across this topic wanting to just capitalize the first letter, you can use String.Casing.titlecase_once(word, nil), although this is a little undocumented). Don’t do this, see comment from josevalim below.

The reason why we do first letter only is because Unicode has specific rules that consider the rest is being lowercased.

For the behaviour that you want, you should be able to implement it using String.next_grapheme, getting the first part, upcasing it an then concatenating the rest. Just make sure to take into account empty strings.

Don’t do this. :slight_smile: String.Casing is private API and may change at any time, breaking your code.

5 Likes

Since this is a fairly common thing to do, and one with a bunch of tricky edge cases, wouldn’t it be the kind of thing that belongs in the standard library?

At least as a relative newcomer to Elixir, I find it a bit surprising to have to write my own String.cap_first() implementation, because the language does not provide one.

Anyway, Erlang’s titlecase() (added in OTP 20) function works for my use case at least, it can be called like this :string.titlecase(text) – although you should probably have a wrapper for it to filter out non-binary strings and such things.

6 Likes

Note that title case means that “black beans” will become “Black Beans”. But that is usually what you want, for sure.

I have been looking at capitalize quite closely and wondering what use case it is trying to be helpful for. It seems like titleize would have been more useful.

1 Like

Elixir does a good job of conforming to the Unicode standard. Unicode doesn’t have anything particular to say about “capitalisation” that I can find but it is clear about what title casing means and String.capitalize/2 appears to conform to that expectation (upcase the first grapheme, downcase the rest).

Wikipedia notes on the subject:

Capitalization (North American English) or capitalisation (British English) is writing a word with its first letter as a capital letter (uppercase letter) and the remaining letters in lower case, in writing systems with a case distinction. The term also may refer to the choice of the casing applied to text.

Which is what String.capitalize/2 does. Wikipedia goes on to suggest Capital Case and Title Case are synonyms.

The issue of casing multi-word strings (in any form: up, down, title, other) is complicated. What is a word boundary? In some languages like Chinese there is no way to determine word boundaries without a lexicon and heuristics. Elixir’s String.split/1 uses mostly Unicode whitespace and a few others graphemes to break words. I have a more complete implementation of word (and sentence) breaking in the lib unicode_string (which I must improve the docs for) but even that is limited to a smallish number of languages.

Maybe a proposal to add a new function like String.upcase_first/2 would be a reasonable suggestion - at least its intent would be unambiguous :slight_smile:

3 Likes

I’d really like to see this, but have it be written in a way that it can be practically applied.

For example: https://titlecaseconverter.com/

The above is an epic resource for capitalizing words in a sentence based on well known writing standards. It’s no where near as simple as just having a blacklist of words not to capitalize. Luckily all of those styles are well described so it should be possible to port it to another language. That’s what that website does. No ML required (I asked the author).

The default (Chicago) is used in a lot of places for headlines and sub-heading capitalization. That would be a good one to begin with.

1 Like

Thats a cool site. But I don’t think it’s the role of the standard lib to provide that level of casing. It’s specific to the english language for one and as the site says it expresses opinions on the rules.

To implement it in Elixir I think one would need a (a) language detector, (b) a text segmenter, © a part os speech tagger and then (d) apply styles. I’ve written a version of (a) and (b) and writing © is on my list. I hadn’t thought about (d) but its an interesting area - I think some rules could be written to drive it. I’ll add it to the long list of things to experiment with!

2 Likes

Recase may do what you want, and if not, you may be able to submit a change to do what you want.

3 Likes
  def capitalize_first_grapheme(string) when is_binary(string) do
    <<_first_grapheme::utf8, rest::binary>> = string
    String.capitalize(String.slice(string, 0..0)) <> rest
  end

In case anyone else needs a quick function to do this in a UTF-8 compatible way, and doesn’t want to introduce a new dependency.

2 Likes

Nice! No point in slicing twice though:

    <<first_grapheme::utf8, rest::binary>> = string
    String.capitalize(<<first_grapheme::utf8>>) <> rest

EDIT: Fixed a bug pointed out by @adamu thanks!

4 Likes

Related SO Q&A string - Elixir upcase only first letter of a word - Stack Overflow

1 Like

Needs to specify ::utf8:

String.capitalize(<<first_grapheme::utf8>>) <> rest

Otherwise it will give you the equivalent of <<rem(first_grapheme, 256)>>.

edit: also first_grapheme is a codepoint, not a grapheme :slight_smile:

4 Likes

Python has the same behaviour as Ruby (at least in Python 3):

>>> "THIS IS A SENTENCE".capitalize()
'This is a sentence'

>>> "THIS IS A SENTENCE".title()
'This Is A Sentence'
1 Like

Found another potential bug (if not checked for): If the string is “”, the binary pattern match will throw a MatchError. (String.capitalize will, however, accept a “” argument, just returning “”.)