List.to_string sometimes returning <<60, 33, 68...>> type results

sergio · December 30, 2018, 5:26pm

12:20:46.435 [info]  Web request: https://1337x.to/torrent/3490955/-Irozuku-Sekai-no-Ashita-kara/
<<60, 33, 68, 79, 67, 84, 89, 80, 69, 32, 104, 116, 109, 108, 62, 10, 60, 104,
  116, 109, 108, 62, 10, 60, 104, 101, 97, 100, 62, 10, 60, 109, 101, 116, 97,
  32, 99, 104, 97, 114, 115, 101, 116, 61, 34, 117, 116, 102, 45, 56, ...>>

12:20:46.609 [info]  Web request: https://1337x.to/torrent/3490541/DB-Seishun-Buta-Yarou-wa-Bunny-Girl-Senpai-no-Yume-wo-Minai-Rascal-Does-Not-Dream-of-Bunny-Girl-Senpai-10bit-1080p-HEVC-x265/
"<!DOCTYPE html>\n<html>\n<head>\n<meta charset=\"utf-8\">\n<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">\n<title>Download  [DB] Seishun Buta Yarou wa Bunny Girl Senpai no Yume wo Minai | Rascal Does Not Dream of Bunny Girl S...

I’m using httpc to make a GET request to this URL and converting the result to a string.

{:ok, {{_http, 200, 'OK'}, _headers, body}} = :httpc.request(:get, {url, headers}, [], [])

{:ok, List.to_string(body)}

The charlist of body, the raw body httpc returns before I try to List.to_string it is:

[60, 33, 68, 79, 67, 84, 89, 80, 69, 32, 104, 116, 109, 108, 62, 10, 60, 104,
 116, 109, 108, 62, 10, 60, 104, 101, 97, 100, 62, 10, 60, 109, 101, 116, 97,
 32, 99, 104, 97, 114, 115, 101, 116, 61, 34, 117, 116, 102, 45, 56, ...]

Other URLs body convert normally to the expected HTML payload. It’s only this one that converts to this type of result.

Using a DIFF tool the only meaningful difference I see between broken vs. working is the broken has japanese letters present. (Left: Broken - Right: Working)

What is this kind of result?
Can I detect if this type of result is returned and skip it?

NobbZ · December 30, 2018, 5:34pm

The reply probably contains unprintable characters or does contains byte sequences that are not valid UTF-8.

As the server does not set a content-type with encoding parameter or otherwise specifies the encoding, the result has to be treated “ISO-8859-1”. Yoiu need to re-encode the resuilt.

sergio · December 30, 2018, 5:38pm

This works:

body |> IO.iodata_to_binary() |> IO.inspect()

But I have no idea why. Is this an expensive process? I may have to do this for every charlist returned by httpc.

NobbZ · December 30, 2018, 5:58pm

Now I get it.

The returned list contains the bytes [232, 137, 178], List.to_string does treat them as separate codepoints and converts it to <<195, 168, 194, 137, 194, 178>>, so three separate Codepoints: ["è", <<194, 137>>, "²"], the middle one is in fact a non-printable special version of tab.

List.to_string is not working as expected.

IO.iodata_to_binary/1 though interpretes the list as a list of bytes, instead of as a list of codepoints.

The best thing to do though, obviously, is to pass {:body_format, :binary} as an option to :httpc.request/x as an option.

kip · December 30, 2018, 9:26pm

The docs for List.to_string/1 do say that it expects a list of code points, not bytes. And that if you have the latter, :binary is the module of choice. Vis:

iex> :binary.list_to_bin [232, 137, 178]
"色"

NobbZ · December 30, 2018, 9:28pm

Perhaps I should have written, “not as expected by the OP”.

No, the way of choice should be to pass the correct options to :httpc.request/x.

kip · December 30, 2018, 9:32pm

Agreed. I should have been clearer that the List.to_string/1 docs make the recommendation of :binary in the case that List.to_string/1 isn’t the right thing.