How to replace accented letters with ASCII letters?

DanielRS · May 7, 2016, 7:35pm

I’m implementing a blogging system in my website, and I’m trying to generate post slugs from the title, like this:

def slugify(string) do
    string
    |> String.normalize(:nfd)
    |> String.replace(~r/[^A-z\s]/u, "")
    |> String.replace(~r/\s/, "-")
end

If I try something like slugify("árboles más grandes") I get arboles-ms-grandes

Trying slugify("los árboles más grandes") returns los-rboles-ms-grandes

My slugify function seems to only work with accented letters at the start of the string.

Best regards,
Daniel Rivas.

KronicDeth · May 7, 2016, 9:06pm

Is it a requirement that you ascii-ify the slugs? There is support in modern browsers for non-ascii URLs being percent encoded in the HTML source, but displayed as the unicode characters as described on stackoverflow. Certain languages have words that can be confused with other words if you strip accents. I’m aware of it in Polish, I don’t know if it affects any of your target languages. If you’re sure you want to strip the accents, then you can disregard this and we’ll move on to a technical solution to the loss of accented characters.

DanielRS · May 7, 2016, 9:13pm

It is definitely not required, as I can do just fine striping out the punctuation and replacing white-spaces with dashes (my target languages are English and Spanish); however, I would still like to learn how to get this done, for future reference.

Best regards,
Daniel Rivas.

KronicDeth · May 7, 2016, 9:21pm

I’m not sure what is going on. When I tested it, it works:

iex> "árboles más grandes" |> String.normalize(:nfd) |> String.replace(~r/[^A-z\s]/u, "") |> String.replace(~r/\s/, "-")
"arboles-mas-grandes"
iex> "los árboles más grandes" |> String.normalize(:nfd) |> String.replace(~r/[^A-z\s]/u, "") |> String.replace(~r/\s/, "-")
"los-arboles-mas-grandes"

I got the test string by copying the string from your posting, so I don’t know if that changed the encoding.

hubertlepicki · May 7, 2016, 9:32pm

@KronicDeth this does not seem to work on all systems, at least not on mine:

"los árboles más grandes" |> String.normalize(:nfd) |> String.replace(~r/[^A-z\s]/u, "") |> String.replace(~r/\s/, "-")
"los-rboles-ms-grandes"

There is unix library libiconv that does that. Erlang has a few wrappers, one can be installed from hex and is called iconv.

iex(1) > :application.start(:iconv)
:ok
iex(2) > :iconv.convert "utf-8", "ascii//translit", "Hubert Łępicki"
"Hubert Lepicki"
iex(3) > :iconv.convert "utf-8", "ascii//translit", "árboles más grandes"
"arboles mas grandes"

I think using iconv transliteration also replaces some of the national characters to recognized ascii replacements, I think in German ß is replaced with “ss” etc.

After you transliterated the string to most matching ascii equivalents, you can downcase it and replace whitespace with dashes.

KronicDeth · May 7, 2016, 9:59pm

Let’s (both @hubertlepicki and @DanielRS) back up and compare each part of the pipeline. I’m curious where the difference is. Here’s my output for each stage

iex> "árboles más grandes" |> String.normalize(:nfd)
"a´rboles ma´s grandes"
iex> "árboles más grandes" |> String.normalize(:nfd) |> String.replace(~r/[^A-z\s]/u, "")
"arboles mas grandes"
iex> "árboles más grandes" |> String.normalize(:nfd) |> String.replace(~r/[^A-z\s]/u, "") |> String.replace(~r/\s/, "-")       
"arboles-mas-grandes"

NOTE: I had to manually type in the acute accents separate from the a for the first stage because although iex prints them separate, when I copied and pasted into Chrome they were recombined, so the above is visually what I saw

Also, let’s not use the abbreviated form of the range in the regex just in case that makes a difference:

iex> "árboles más grandes" |> String.normalize(:nfd) |> String.replace(~r/[^A-Za-z\s]/u, "")
"arboles mas grandes"

In my mind, A-z covers A-Z, [, \``,],^,_, backtick (since it can't quote it in markdown), anda-z` and I would think we don’t want the symbols in the range really.

hubertlepicki · May 7, 2016, 10:02pm

@KronicDeth it’s String.normalize. I do not think it does what you think it does, I quite frankly do not understand what it should do. But it does not seem to convert UTF-8 national characters to matching ASCII ones at all on my system:

iex(10)> String.normalize "Łępicki", :nfd
"Łępicki"
iex(11)> "árboles más grandes" |> String.normalize(:nfd)
"árboles más grandes"

(And above is exactly what I see on my IEX terminal). I’m on Linux, en_US.UTF-8 LANG.

DanielRS · May 7, 2016, 10:48pm

@hubertlepicki String.normalize separates each special character in multiple characters in such a way that their combination represents the original character. Simple example:

iex(11)> "á" |> String.codepoints
["á"]
iex(12)> "á" |> String.normalize(:nfd) |> String.codepoints
["a", "́"]

However, for some reason it doesn’t work when the accentuated character is not the first one in the string:

iex(7)> "aá" |> String.normalize(:nfd) |> String.codepoints
["a", "á"]

@KronicDeth Here’s my output:

 iex(15)> "árboles más grandes" |> String.normalize(:nfd)
"árboles más grandes"
iex(16)> "árboles más grandes" |> String.normalize(:nfd) |> String.replace(~r/[^A-z\s]/u, "")
"arboles ms grandes"
iex(17)> "árboles más grandes" |> String.normalize(:nfd) |> String.replace(~r/[^A-z\s]/u, "") |> String.replace(~r/\s/, "-")
"arboles-ms-grandes"

My machine is running Archlinux, this is the output of running locale in the terminal:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

I wonder what the problem could be…

hubertlepicki · May 8, 2016, 4:51am

Ok, so I would definitely use iconv instead, it will allow you to work with
broader number of characters and it works as expected

Possibly you found a bug, may be worth submitting GH issue on elixir-lang/elixir.

gringocl · May 11, 2016, 12:59am

I ran into this issue today, after upgrading from 1.2.3. This was a bug and I submitted a pr to fix this on elixir-lang/elixir. Hopefully it will get merged soon!

yurko · January 27, 2017, 4:33pm

If anyone stumbles upon this issue like I did, it might not be clear right away but at the moment it is doable to slugify a string using only string functions mostly discussed here: String.normalize(:nfd) would split the string into separate characters so that accents can be removed and ASCII parts remain leaving us with a reasonable slug (not a grammatically correct transcriptions but the ASCII parts of the special chars).

Here is a changeset function I came up with:

defp normalize_slug(changeset) do
  slug =
    changeset
    |> get_field(:slug)
    |> String.normalize(:nfd)
    |> String.downcase()
    |> String.replace(~r/[^a-z-\s]/u, "") 
    |> String.replace(~r/\s/, "-")

    put_change(changeset, :slug, slug)
end

Few tests from above:

Hubert Łępicki > hubert-epicki
árboles más grandes > arboles-mas-grandes
Übel wütet der Gürtelwürger > ubel-wutet-der-gurtelwurger

vyachkonovalov · July 24, 2017, 1:03pm

str = "Órbita 9"
diacritics = Regex.compile!("[\u0300-\u036f]")
String.normalize(str, :nfd) |> String.replace(diacritics, "")

# https://stackoverflow.com/a/37511463/1878180

massimo · September 18, 2017, 11:07am

Elixir support the Unicode flag in Regex

You can simply use

String.normalize("NäytẗkuvaèüÀÁÂÃĀĂȦÄẢÅǍȀȂĄẠḀẦẤàáâä", :nfd) |> String.replace(~r/\W/u, "")
"NayttkuvaeuAAAAAAAAAAAAAAAAAAaaaa"

this one will strip white spaces and ~~special characters~~ diacritical marks (accents and such) but keep numbers

hubertlepicki · September 18, 2017, 12:38pm

String.normalize("Łępicki", :nfd) |> String.replace(~r/\W/u, "")       
=> "Łepicki"

why this converts “ę” to “e” correctly, but not “Ł” to “L” then?

massimo · September 18, 2017, 1:17pm

why this converts “ę” to “e” correctly, but not “Ł” to “L” then?

because

"Ł" |> String.normalize(:nfd) |> String.codepoints   
["Ł"]

while

"ę" |> String.normalize(:nfd) |> String.codepoints
["e", "̨"]

EDIT:

it’s the same for ü and ø or ß

"ü ø ß" |> String.normalize(:nfd) |> String.codepoints     
["u", "̈", " ", "ø", " ", "ß"]

:iconv correctly normalizes unicode characters, but what we’re doing here is just removing the diacritical marks

hubertlepicki · September 18, 2017, 1:22pm

So… I’d still use Iconv myself if you don’t mind extension, it was created for the purpose of converting between encodings - and removing code points just a hack

massimo · September 18, 2017, 1:27pm

Yes, it’s correct.

The original poster was asking how to remove accented characters

:iconv is overkill for just that

hubertlepicki · September 18, 2017, 1:29pm

ah yes that’s correct. It’ll work for Spanish just fine :). Hope they don’t use Polish

jmurphyweb · February 8, 2020, 3:03pm

By the way String.normalize is now deprecated.

:unicode.characters_to_nfd_binary("NöytẗkuvaèüÀÁÂÃĀĂȦÄẢÅǍȀȂĄẠḀẦẤàáâä") 
|> String.replace(~r/\W/u, "")
# "NoyttkuvaeuAAAAAAAAAAAAAAAAAAaaaa"

yuchunc · June 10, 2022, 12:34pm

I don’t think it is.
https://hexdocs.pm/elixir/1.12/String.html#normalize/2