Person name formatting / normalisation - feedback request

TLDR; Seeking feedback on what, if any, normalisations and validations should be provided when accepting input that represents a person’s name.

Context

When you accept input for the name of a person, what validation or normalisation do you normally do?. CLDR 48 will be out very soon and with it new releases of many of the ex_cldr libraries.

One of these is ex_cldr_person_names which supports formatting names in a locale-sensitive way. I have an opportunity to provide some validations or normalisations as optional capabilities.

Person Name Validation - current practise

When implementations allow entry of person names, they can be too strict; there are instances where people can’t enter their real names, such as O’Brian, Stéphanie, Wałęsa, Þjóðólfr.

When you accept input for a person’s name do you:

  • Do nothing - as long as its UTF-8 it’s ok?
  • Restrict to a single script? Maybe thats too lenient - even the Latin script has over 1_400 characters in it
  • Do you allow emoji characters, like the name B🅾️b?
  • What about the always fun Zalgo text?

Possible normalisations

  • Transform to Unicode NFC format
  • Replacement of arbitrary sequences of whitespace characters by a single space .
    • \p{whitespace}{2,∞} → U+0020
  • Replacement of U+2010 HYPHEN and U+2011 NON-BREAKING HYPHEN
    • [‐‑] → -

Additional possible constraints

  • Too many identical grapheme clusters in a sequence
    • (Tóóóóóm)
  • Too many non-letters in a row
    • (Jean—Luc Jr.., MD)
  • Too many combining marks in a row
    • Faruq̣̣̈̈

Not in scope

Removing profane words or otherwise accepting or rejecting words according to some dictionary.

1 Like

I usually just leave it free form. Because if someone wants to have fun with your name input, they will be able to, even with only ASCII characters. And once you allow a bit of Unicode, which you need to anyway, it already opens up too many possibilities.

2 Likes

Thanks José. For sure anything I do will be opt in.

I suspect there are certain domains - thinking mostly of governments and legal registries - where the requirements are more strict. For example, the state of NSW in Australia has a formal policy that is quite restrictive. And it must be a very very old system since it says:

The Registry is currently unable to include diacritical marks or accents in any name registered in NSW.

Which suggests even your name wouldn’t be able to be registered!

1 Like

I have some experience dealing with names (and addresses on that matter) back in Russia, and I am absolutely sure any restriction you are to provide would be violated once your DB has at least 100 customers, frustrating the next son of Musk. The same goes for email validation: you check it has at least one letter before a mandatory @ and one after, that’s it.

Check Mongolian names for Zalgo, for instance.

No assumption can be made when it comes to names. E. g. I had issues registering on IKEA Spain website, because Spaniards all have two lastnames, and I have the only one. IKEA still ships my furniture to Aleksei Matiushkin Idonthaveasecondsurname, fwiw.

In my humble opinion, the general purpose library should not try to satisfy needs of any weird restrictive software behind the business/govt websites/apps. If they don’t allow anything but ASCII-7, they would be perfectly able to filter it out on their own.

2 Likes

If I understand correctly, Kip is not asking what should be generally normalized/validated (this has been discussed very often), but what is often done, so that his library can provide the means.

1 Like

That’s indeed my primary objective. If there are common community transforms I am open to including them.

Thats a very reasonable expectation. But it seems to be very uncommon - including your own example. My objective is always to make is easier for a developer to deliver a better experience for an end user and a more reliable and understandable implementation.

Perhaps I am too unrealistic (wouldn’t be the first time!)

I see a logic gap here. The best UX ever would be to allow the user to enter their name, even if it looks weird to anybody else (meaning no restriction whatsoever.) On the other hand, the developer might work for the business/govt who decided to limit UX to whatever they consider valid (usually it is dictated by their ancient accounting software or like.) In this case, each developer would meet the unique rules, which noone can ever predict, let alone implement in general purpose lib.

What I am afraid of, that if the library provides a handy way to screw up the input, 20% of devs would switch it on, ruining the UX of the 80% of users.

1 Like

I think names in general are so screwed up in so many applications is possible my ego is telling me I can do better (my own name gets screwed up all the time and its far from a difficult case).

I agree with the sentiment that a user should be able to call themselves whatever they want, and write it any way they like. Government ID verification and regulation makes the unrealistic in so many situations.

I think you’re say that I can’t realistically do anything to improve that situation. I’m not (quite) ready to give up on it yet!

Honestly, if there is anybody on the Green Earth who can, that’s you. I could never have thought/expressed anything like a doubt in this, really. I am just saying this is a minefield and providing easy ways to restrict the names is literally planting another mine into it.

Looking forward to published attribution in cldr_routes.

So far I never restricted usernames, other than rejecting ‘@‘ to prevent user mistakes. There are too many gotcha’s involved.

That being said, I do believe that there are legacy systems en databases constraints. But that can be solved with conversion. Maybe a standard how to approach it would be nice though and your expertise will be invaluable crafting such guide/behaviour.

1 Like

I always hate it when a site tries some kind of validation enforcement on a name you enter. In our country it is usual that a last name starts with a capital letter if it is a single word, well mine doesn’t and more than half of the sites either do not allow me to enter my name correctly or capitalize automatically.

1 Like

Mandatory link: Falsehoods Programmers Believe About Names | Kalzumeus Software

4 Likes