Is it a requirement that you ascii-ify the slugs? There is support in modern browsers for non-ascii URLs being percent encoded in the HTML source, but displayed as the unicode characters as described on stackoverflow. Certain languages have words that can be confused with other words if you strip accents. I’m aware of it in Polish, I don’t know if it affects any of your target languages. If you’re sure you want to strip the accents, then you can disregard this and we’ll move on to a technical solution to the loss of accented characters.
It is definitely not required, as I can do just fine striping out the punctuation and replacing white-spaces with dashes (my target languages are English and Spanish); however, I would still like to learn how to get this done, for future reference.
I think using iconv transliteration also replaces some of the national characters to recognized ascii replacements, I think in German ß is replaced with “ss” etc.
After you transliterated the string to most matching ascii equivalents, you can downcase it and replace whitespace with dashes.
Let’s (both @hubertlepicki and @DanielRS) back up and compare each part of the pipeline. I’m curious where the difference is. Here’s my output for each stage
iex> "árboles más grandes" |> String.normalize(:nfd)
"a´rboles ma´s grandes"
iex> "árboles más grandes" |> String.normalize(:nfd) |> String.replace(~r/[^A-z\s]/u, "")
"arboles mas grandes"
iex> "árboles más grandes" |> String.normalize(:nfd) |> String.replace(~r/[^A-z\s]/u, "") |> String.replace(~r/\s/, "-")
"arboles-mas-grandes"
NOTE: I had to manually type in the acute accents separate from the a for the first stage because although iex prints them separate, when I copied and pasted into Chrome they were recombined, so the above is visually what I saw
Also, let’s not use the abbreviated form of the range in the regex just in case that makes a difference:
iex> "árboles más grandes" |> String.normalize(:nfd) |> String.replace(~r/[^A-Za-z\s]/u, "")
"arboles mas grandes"
In my mind, A-z covers A-Z, [, \``,],^,_, backtick (since it can't quote it in markdown), anda-z` and I would think we don’t want the symbols in the range really.
@KronicDeth it’s String.normalize. I do not think it does what you think it does, I quite frankly do not understand what it should do. But it does not seem to convert UTF-8 national characters to matching ASCII ones at all on my system:
iex(10)> String.normalize "Łępicki", :nfd
"Łępicki"
iex(11)> "árboles más grandes" |> String.normalize(:nfd)
"árboles más grandes"
(And above is exactly what I see on my IEX terminal). I’m on Linux, en_US.UTF-8 LANG.
@hubertlepickiString.normalize separates each special character in multiple characters in such a way that their combination represents the original character. Simple example:
I ran into this issue today, after upgrading from 1.2.3. This was a bug and I submitted a pr to fix this on elixir-lang/elixir. Hopefully it will get merged soon!
If anyone stumbles upon this issue like I did, it might not be clear right away but at the moment it is doable to slugify a string using only string functions mostly discussed here: String.normalize(:nfd) would split the string into separate characters so that accents can be removed and ASCII parts remain leaving us with a reasonable slug (not a grammatically correct transcriptions but the ASCII parts of the special chars).
So… I’d still use Iconv myself if you don’t mind extension, it was created for the purpose of converting between encodings - and removing code points just a hack