Force encoding to UTF8

jaimeiniesta · September 5, 2016, 12:07pm

I have a Phoenix / Ecto app where I take validation results from the W3C validator and I insert them on a Postgres database.

I’ve found a problem inserting that when the results include non-UTF8 characters:

** (Postgrex.Error) ERROR (character_not_in_repertoire): invalid byte sequence for encoding "UTF8": 0x00

For example, when the results come from:

https://validator.w3.org/nu/?doc=http%3A%2F%2Fwww.australia.gov.au%2F&out=json

The trouble lies on the :extract, that’s where some of the invalid characters appear.

How would you approach this? What’s the best way to force the encoding to UTF8, discarding invalid characters? Should this be done in Elixir or on Postgres?

hubertlepicki · September 5, 2016, 12:17pm

I believe the answer you are looking for is “use iconv”.

You can either use the command line thing iconv and pipe the stuff through it, or use the hex package. In both cases if you want to discard the unmatched characters you should use //IGNORE option.

Here are 2 topics from this forum you may find useful:
https://elixirforum.com/search?q=iconv

jaimeiniesta · September 5, 2016, 12:26pm

That’s it! Thanks @hubertlepicki

tallakt · September 6, 2016, 5:55am

If you are struggling with a character or two ill formatted bytes, and you know that the text you have is otherwise utf8, you could reject any non utf8 characters. Something like:

defmodule Strip do
  @doc """
    iex> Strip.strip_utf "Tallak\xc3\xb1 Tveide"
    "Tallakñ Tveide"
  """
  def strip_utf(str) do
    strip_utf_helper(str, [])
  end

  defp strip_utf_helper(<<x :: utf8>> <> rest, acc) do
    strip_utf_helper rest, [x | acc]
  end

  defp strip_utf_helper(<<x>> <> rest, acc), do: strip_utf_helper(rest, acc)

  defp strip_utf_helper("", acc) do
    acc
    |> :lists.reverse
    |> List.to_string
  end
end

For a iconv replacement in pure Elixir, check out my codepagex library on hex.pm

hubertlepicki · September 6, 2016, 6:47am

This is very cool, thank you! You wrote it to get rid of iconv system dependency? I actually found a few people have had troubles installing it or compiling the erlang bindings.

tallakt · September 6, 2016, 7:14am

Actually from my ruby experience I always had trouble with native gems. And also on a windows machine dealing with non utf8 text all the time. So I created codepagex. It was also an excercise in macro programming. I’m quite happy with it, but there’s always room for improvement. I started making a performance comparison with iconv, but unfortunately it is rather slow in comparison right now. It has sort of equal performance for short strings and high parallelism…

tallakt · September 6, 2016, 7:16am

But i can guarantee it installs in a few seconds with no issues. Also, the options for handling errors in encoding are super flexible, and the interface is very Elixirish (I hope at least), not just a mapping over a C interface

jaimeiniesta · September 6, 2016, 9:09am

Thank you @tallakt - avoiding the iconv dependency would be great!

I’ve tried the example you posted, but I can’t make it work. How do you call it? The function has 2 parameters, but all I have is a string, with unknown encoding (as it comes from an extract of a random page), that has invalid utf8 characters.

I tried also the Codepagex library, but I can’t use to_string/2 because I don’t know the encoding the string might be in.

tallakt · September 6, 2016, 9:44am

Sorry. That initial code was written on my ipad and would never work. I have since updated it.

jaimeiniesta · September 6, 2016, 10:07am

Thanks a lot, now it compiles and makes sense

I’m still unable to insert this on Postgres after passing the string through the UTF stripper, I’m investigating to find out the exact example:

** (Postgrex.Error) ERROR (character_not_in_repertoire): invalid byte sequence for encoding "UTF8": 0x00

jaimeiniesta · September 6, 2016, 10:14am

The trouble character is on line 131 of http://www.australia.gov.au/ - it looks like the UTF8 stripper isn’t removing it?

[debug] CLEAN UP UTF, before: -actions"><input type="submit" id="edit-submit" name="op" value="� Search" class="form-submit" /></div>
[debug] CLEAN UP UTF,  after: -actions"><input type="submit" id="edit-submit" name="op" value="� Search" class="form-submit" /></div>

zambal · September 6, 2016, 10:33am

I don’t think that’s the problem. To me that seems like a valid UTF-8 character that won’t render because the font doesn’t support it.

If you look at the error Postgresx returns you can see that it complains about null bytes (0x00), which indeed isn’t a valid UTF-8 character.

After looking at the W3C’s validator’s output, there are indeed some entries like:

"message":"Saw U+0000 in stream.","extract":"icle></div>\n\u0000\u0000\u0000\u0000\u0000</div></div><"

Although the validator has escaped these null bytes to \u0000, somewhere in your codepath these characters get unescaped again.

To illustrate what i mean:

iex> "\u0000"
<<0>>

iex> "\\u0000"
"\\u0000"

Hope that helps finding the actual issue.

-vincent

jaimeiniesta · September 6, 2016, 10:39am

Thanks @zambal, this rings a bell - Nokogiri has also trouble with the null char:

https://github.com/sparklemotion/nokogiri/issues/1240

jaimeiniesta · September 6, 2016, 10:44am

Yep, String.replace(str, "\u0000", "") fixed the issue.

Thanks everyone!

hubertlepicki · November 9, 2019, 8:09am

Here I am after many years stumbling on the same issue and this simplest thing seems to solve the issue for me

sd4code · November 10, 2019, 3:22pm

The Null character and the other control ascii characters should always be escaped, Space and punctuation marks should be encoded to html to prevent xss and code injection attacks from users. Someone should come up with embedded filters that does this automatically, and can’t be fooled from url encoding methods.