Can't save HTML from a website to Postgres UTF-8 field. Any tips?

sergio · September 19, 2017, 2:46am

This is the content I’m trying to save to my Postgres text field.

<<60, 98, 111, 100, 121, 62, 60, 104, 114, 47, 62, 60, 104, 49, 62, 70, 108,
  117, 105, 100, 32, 67, 97, 109, 101, 114, 97, 50, 32, 84, 114, 97, 99, 107,
  101, 114, 194, 160, 84, 111, 103, 103, 108, 105, 110, 103, 60, 47, 104, 49,
  ...>>

And this is the error:

** (Postgrex.Error) ERROR 22021 (character_not_in_repertoire): invalid byte sequence for encoding “UTF8”: 0xe3 0xaf 0xe2

I just need a way to ask if my_html is valid UTF-8? I’ve been looking all night and haven’t made a lick of progress on something seemingly simple haha, one of those days.

kylethebaker · September 19, 2017, 3:01am

There is String.valid? which will return true/false depending on if the string is valid utf8. I feel like postgres is probably accurate though so you might just get ‘false’, but it’s worth checking to see what elixir says first and then going from there.

sergio · September 19, 2017, 3:03am

String.valid? is returning true and it’s still giving me the same error in Postgres.

sergio · September 19, 2017, 3:08am

I would be happy even checking the variable type.

If it’s a normal string "foobar baz", return true.

If it’s a binary like this, return false.

<<60, 98, 111, 100, 121, 62, 60, 104, 114, 47, 62, 60, 104, 49, 62, 70, 108,
  117, 105, 100, 32, 67, 97, 109, 101, 114, 97, 50, 32, 84, 114, 97, 99, 107,
  101, 114, 194, 160, 84, 111, 103, 103, 108, 105, 110, 103, 60, 47, 104, 49,
  ...>>

Is there a function for this perhaps?

kip · September 19, 2017, 3:34am

If by normal you mean printable then you could:
String.chunk(foo, :printable) and see what non-printable chunks you have that might be a source of error.

kylethebaker · September 19, 2017, 3:38am

The variable type for both “foo” and for the binary will both be ‘binary’. There isn’t a formal ‘string’ type in elixir, ‘strings’ are just binaries that can be expressed as valid utf8. You can test yourself by doing:

binary_html = <<60, 98, 111, 100, 121, 62>>
string_html = "<body>"
binary_html === string_html # true, notice triple '===' for type equality
is_binary(binary_html)  # true
is_binary(string_html) # true

What I’m wondering is if there is something in the string contents itself, like the literal string “0xe3afe2”, that postgres interprets to mean ‘the byte value of 0xe3afe2’ or the bytes <<227, 175, 226>> rather than the actual string. That is just a hunch, I’m not sure if postgres does anything like that.

sergio · September 19, 2017, 3:45am

@kip: String.chunk divided up the html string into three parts.

["normal html", <<227, 175, 226, 189, 226, 191>>, "normal html"]

I’m not entirely sure how I can use this though to achieve my end goal. I just want to remove all these strange bytes from my string.

kip · September 19, 2017, 3:51am

Try:

string
|> String.chunk(:printable)
|> Enum.filter(&String.printable?/1)
|> Enum.join

kylethebaker · September 19, 2017, 3:51am

It’s interesting that String.valid? gave you true for the whole string, but it returns false for the non-printable sequence. This should give you a string with the invalid chunks filtered out:

the_html |> String.chunk(:printable) |> Enum.filter(&String.valid?) |> Enum.join()

kip · September 19, 2017, 3:53am

According to the unicode table these appear in the valid utf-8 range, but there is no character assignment (that I could find). Hence valid as encoding, but not printable since there is no character assignment.

May be worth filing a issue on Elixir?

sergio · September 19, 2017, 3:54am

I’ll do that good idea. I’ll link to your post explaining the issue.

kip · September 19, 2017, 4:00am

Can you PM me the entire string? Now I’m curious to look into this. The code for String.valid? is really simple:

  def valid?(<<_::utf8, t::binary>>), do: valid?(t)
  def valid?(<<>>), do: true
  def valid?(_), do: false

Relying about binary decomposition so if its in error, its in error way on down the stack

sergio · September 19, 2017, 4:01am

@kip: IO.inspect outputs this:

<<60, 98, 111, 100, 121, 62, 60, 104, 114, 47, 62, 60, 104, 49, 62, 70, 108,
  117, 105, 100, 32, 67, 97, 109, 101, 114, 97, 50, 32, 84, 114, 97, 99, 107,
  101, 114, 194, 160, 84, 111, 103, 103, 108, 105, 110, 103, 60, 47, 104, 49,
  ...>>

I’m guessing it’s truncating it? How do I make it show everything so I can copy and paste it for you?

kip · September 19, 2017, 4:03am

File.write!(the_string, "a_file_name.txt") should do it, then put the file on GitHub in a snippet?

sergio · September 19, 2017, 4:17am

@kip: Done:

I pasted the contents of the file verbatim into this gist.

hildersantos · September 19, 2017, 4:36am

On raw file, I saw 3 weird chars… maybe it’s something with encoding of this specific part? So Elixir is ok with the string, but complaining about the encoding in UTF-8?

kylethebaker · September 19, 2017, 4:49am

Hmm, I’m not able to reproduce the chunks. Doing:

"gistfile1.txt" |> File.read! |> String.chunk(:printable) |> length

gives me 1 instead of the 3 that you had.

I do see those three weird characters too though. I looked at them as hex and they are as follows:

efbf bdef bfbd efbf bd

which is the following repeated three times:

ef bf bd

I was kind of expecting them to be the bytes that postgres was complaining about (0xe3 0xaf 0xe2), but that doesn’t seem to be the case. Although, ef bf bd appears to be the unicode replacement character, so maybe those aren’t the original bytes but just the representation they got converted to along the way. I tried downloading the raw gist from github using curl to avoid copying the browsers rendition of them or anything like that, but I see the same result. I’m wondering if copy pasting somewhere has converted the original bytes you were dealing with into the replacement character that we’re seeing now?

kylethebaker · September 19, 2017, 5:04am

Just to clarify: when you wrote the string to a file did you write the original string to the file, or did you write the string result from the chunk/filter/join pipe?

NobbZ · September 19, 2017, 5:20am

This is the replacement character, so it is copy and pasted.

kylethebaker · September 19, 2017, 5:27am

I see, so I think the easiest thing to do is just:

the_string |> Base.encode64 |> File.write!("a_file_name.txt")

and then copy paste that to a gist.