The trouble lies on the :extract, that’s where some of the invalid characters appear.
How would you approach this? What’s the best way to force the encoding to UTF8, discarding invalid characters? Should this be done in Elixir or on Postgres?
I believe the answer you are looking for is “use iconv”.
You can either use the command line thing iconv and pipe the stuff through it, or use the hex package. In both cases if you want to discard the unmatched characters you should use //IGNORE option.
If you are struggling with a character or two ill formatted bytes, and you know that the text you have is otherwise utf8, you could reject any non utf8 characters. Something like:
defmodule Strip do
@doc """
iex> Strip.strip_utf "Tallak\xc3\xb1 Tveide"
"Tallakñ Tveide"
"""
def strip_utf(str) do
strip_utf_helper(str, [])
end
defp strip_utf_helper(<<x :: utf8>> <> rest, acc) do
strip_utf_helper rest, [x | acc]
end
defp strip_utf_helper(<<x>> <> rest, acc), do: strip_utf_helper(rest, acc)
defp strip_utf_helper("", acc) do
acc
|> :lists.reverse
|> List.to_string
end
end
For a iconv replacement in pure Elixir, check out my codepagex library on hex.pm
This is very cool, thank you! You wrote it to get rid of iconv system dependency? I actually found a few people have had troubles installing it or compiling the erlang bindings.
Actually from my ruby experience I always had trouble with native gems. And also on a windows machine dealing with non utf8 text all the time. So I created codepagex. It was also an excercise in macro programming. I’m quite happy with it, but there’s always room for improvement. I started making a performance comparison with iconv, but unfortunately it is rather slow in comparison right now. It has sort of equal performance for short strings and high parallelism…
But i can guarantee it installs in a few seconds with no issues. Also, the options for handling errors in encoding are super flexible, and the interface is very Elixirish (I hope at least), not just a mapping over a C interface
Thank you @tallakt - avoiding the iconv dependency would be great!
I’ve tried the example you posted, but I can’t make it work. How do you call it? The function has 2 parameters, but all I have is a string, with unknown encoding (as it comes from an extract of a random page), that has invalid utf8 characters.
I tried also the Codepagex library, but I can’t use to_string/2 because I don’t know the encoding the string might be in.
The Null character and the other control ascii characters should always be escaped, Space and punctuation marks should be encoded to html to prevent xss and code injection attacks from users. Someone should come up with embedded filters that does this automatically, and can’t be fooled from url encoding methods.