Can't save HTML from a website to Postgres UTF-8 field. Any tips?

This is the content I’m trying to save to my Postgres text field.

<<60, 98, 111, 100, 121, 62, 60, 104, 114, 47, 62, 60, 104, 49, 62, 70, 108,
  117, 105, 100, 32, 67, 97, 109, 101, 114, 97, 50, 32, 84, 114, 97, 99, 107,
  101, 114, 194, 160, 84, 111, 103, 103, 108, 105, 110, 103, 60, 47, 104, 49,
  ...>>

And this is the error:

** (Postgrex.Error) ERROR 22021 (character_not_in_repertoire): invalid byte sequence for encoding “UTF8”: 0xe3 0xaf 0xe2

I just need a way to ask if my_html is valid UTF-8? I’ve been looking all night and haven’t made a lick of progress on something seemingly simple haha, one of those days.

There is String.valid? which will return true/false depending on if the string is valid utf8. I feel like postgres is probably accurate though so you might just get ‘false’, but it’s worth checking to see what elixir says first and then going from there.

String.valid? is returning true and it’s still giving me the same error in Postgres.

I would be happy even checking the variable type.

If it’s a normal string "foobar baz", return true.

If it’s a binary like this, return false.

<<60, 98, 111, 100, 121, 62, 60, 104, 114, 47, 62, 60, 104, 49, 62, 70, 108,
  117, 105, 100, 32, 67, 97, 109, 101, 114, 97, 50, 32, 84, 114, 97, 99, 107,
  101, 114, 194, 160, 84, 111, 103, 103, 108, 105, 110, 103, 60, 47, 104, 49,
  ...>>

Is there a function for this perhaps?

If by normal you mean printable then you could:
String.chunk(foo, :printable) and see what non-printable chunks you have that might be a source of error.

The variable type for both “foo” and for the binary will both be ‘binary’. There isn’t a formal ‘string’ type in elixir, ‘strings’ are just binaries that can be expressed as valid utf8. You can test yourself by doing:

binary_html = <<60, 98, 111, 100, 121, 62>>
string_html = "<body>"
binary_html === string_html # true, notice triple '===' for type equality
is_binary(binary_html)  # true
is_binary(string_html) # true

What I’m wondering is if there is something in the string contents itself, like the literal string “0xe3afe2”, that postgres interprets to mean ‘the byte value of 0xe3afe2’ or the bytes <<227, 175, 226>> rather than the actual string. That is just a hunch, I’m not sure if postgres does anything like that.

@kip: String.chunk divided up the html string into three parts.

["normal html", <<227, 175, 226, 189, 226, 191>>, "normal html"]

I’m not entirely sure how I can use this though to achieve my end goal. I just want to remove all these strange bytes from my string.

Try:

string
|> String.chunk(:printable)
|> Enum.filter(&String.printable?/1)
|> Enum.join
6 Likes

It’s interesting that String.valid? gave you true for the whole string, but it returns false for the non-printable sequence. This should give you a string with the invalid chunks filtered out:

the_html |> String.chunk(:printable) |> Enum.filter(&String.valid?) |> Enum.join()
1 Like

According to the unicode table these appear in the valid utf-8 range, but there is no character assignment (that I could find). Hence valid as encoding, but not printable since there is no character assignment.

May be worth filing a issue on Elixir?

1 Like

I’ll do that good idea. I’ll link to your post explaining the issue.

Can you PM me the entire string? Now I’m curious to look into this. The code for String.valid? is really simple:

  def valid?(<<_::utf8, t::binary>>), do: valid?(t)
  def valid?(<<>>), do: true
  def valid?(_), do: false

Relying about binary decomposition so if its in error, its in error way on down the stack :slight_smile:

@kip: IO.inspect outputs this:

<<60, 98, 111, 100, 121, 62, 60, 104, 114, 47, 62, 60, 104, 49, 62, 70, 108,
  117, 105, 100, 32, 67, 97, 109, 101, 114, 97, 50, 32, 84, 114, 97, 99, 107,
  101, 114, 194, 160, 84, 111, 103, 103, 108, 105, 110, 103, 60, 47, 104, 49,
  ...>>

I’m guessing it’s truncating it? How do I make it show everything so I can copy and paste it for you?

File.write!(the_string, "a_file_name.txt") should do it, then put the file on GitHub in a snippet?

@kip: Done:

I pasted the contents of the file verbatim into this gist.

On raw file, I saw 3 weird chars… maybe it’s something with encoding of this specific part? So Elixir is ok with the string, but complaining about the encoding in UTF-8?

Hmm, I’m not able to reproduce the chunks. Doing:

"gistfile1.txt" |> File.read! |> String.chunk(:printable) |> length

gives me 1 instead of the 3 that you had.

I do see those three weird characters too though. I looked at them as hex and they are as follows:

efbf bdef bfbd efbf bd

which is the following repeated three times:

ef bf bd

I was kind of expecting them to be the bytes that postgres was complaining about (0xe3 0xaf 0xe2), but that doesn’t seem to be the case. Although, ef bf bd appears to be the unicode replacement character, so maybe those aren’t the original bytes but just the representation they got converted to along the way. I tried downloading the raw gist from github using curl to avoid copying the browsers rendition of them or anything like that, but I see the same result. I’m wondering if copy pasting somewhere has converted the original bytes you were dealing with into the replacement character that we’re seeing now?

Just to clarify: when you wrote the string to a file did you write the original string to the file, or did you write the string result from the chunk/filter/join pipe?

This is the replacement character, so it is copy and pasted.

I see, so I think the easiest thing to do is just:

the_string |> Base.encode64 |> File.write!("a_file_name.txt")

and then copy paste that to a gist.