Converting a list of bytes from UTF-8 or ISO-8859-1 to Elixir string?

scouten · February 9, 2019, 4:07am

I have a list of bytes that might be in UTF-8 or might be in ISO-8859-1 or might be in some other code format. In some cases, I have a hint as to the encoding I’m decoding, but it’s not guaranteed. (I am not in control of the format I’m parsing … don’t judge, please!)

How would I design a function to decode this myriad of possibilities? Are there built-in libraries or third-party libraries that I should be examining?

I’d want to try decoding as UTF-8 first; if that fails, then follow the hint that I may or may not have (and if no hint, then fall back to ISO-8859-1).

ConnorRigby · February 9, 2019, 6:09am

I think Latin-1 is a subset of UTF-8, so there’s no way to differentiate UTF-8 from a bunch of Latin-1 characters.
Here is a nice decoder i found from NineNines that i converted to Elixir: https://ninenines.eu/articles/erlang-validate-utf8/

  # This function returns 0 on success, 1 on error, and 2..8 on incomplete data.
  def validate_utf8(<<>>, state), do: state
  def validate_utf8(<< c, rest :: bits >>, 0) when c < 128, do: validate_utf8(rest, 0)
  def validate_utf8(<< c, rest :: bits >>, 2) when c >= 128 when c < 144, do: validate_utf8(rest, 0)
  def validate_utf8(<< c, rest :: bits >>, 3) when c >= 128 when c < 144, do: validate_utf8(rest, 2)
  def validate_utf8(<< c, rest :: bits >>, 5) when c >= 128 when c < 144, do: validate_utf8(rest, 2)
  def validate_utf8(<< c, rest :: bits >>, 7) when c >= 128 when c < 144, do: validate_utf8(rest, 3)
  def validate_utf8(<< c, rest :: bits >>, 8) when c >= 128 when c < 144, do: validate_utf8(rest, 3)
  def validate_utf8(<< c, rest :: bits >>, 2) when c >= 144 when c < 160, do: validate_utf8(rest, 0)
  def validate_utf8(<< c, rest :: bits >>, 3) when c >= 144 when c < 160, do: validate_utf8(rest, 2)
  def validate_utf8(<< c, rest :: bits >>, 5) when c >= 144 when c < 160, do: validate_utf8(rest, 2)
  def validate_utf8(<< c, rest :: bits >>, 6) when c >= 144 when c < 160, do: validate_utf8(rest, 3)
  def validate_utf8(<< c, rest :: bits >>, 7) when c >= 144 when c < 160, do: validate_utf8(rest, 3)
  def validate_utf8(<< c, rest :: bits >>, 2) when c >= 160 when c < 192, do: validate_utf8(rest, 0)
  def validate_utf8(<< c, rest :: bits >>, 3) when c >= 160 when c < 192, do: validate_utf8(rest, 2)
  def validate_utf8(<< c, rest :: bits >>, 4) when c >= 160 when c < 192, do: validate_utf8(rest, 2)
  def validate_utf8(<< c, rest :: bits >>, 6) when c >= 160 when c < 192, do: validate_utf8(rest, 3)
  def validate_utf8(<< c, rest :: bits >>, 7) when c >= 160 when c < 192, do: validate_utf8(rest, 3)
  def validate_utf8(<< c, rest :: bits >>, 0) when c >= 194 when c < 224, do: validate_utf8(rest, 2)
  def validate_utf8(<< 224, rest :: bits >>, 0), do: validate_utf8(rest, 4)
  def validate_utf8(<< c, rest :: bits >>, 0) when c >= 225 when c < 237, do: validate_utf8(rest, 3)
  def validate_utf8(<< 237, rest :: bits >>, 0), do: validate_utf8(rest, 5)

  def validate_utf8(<< c, rest :: bits >>, 0) 
    when c === 238 
    when c === 239,
  do: validate_utf8(rest, 3)
  
  def validate_utf8(<< 240, rest :: bits >>, 0), do: validate_utf8(rest, 6)

  def validate_utf8(<< c, rest :: bits >>, 0) 
    when c === 241 
    when c === 242 
    when c === 243,
  do: validate_utf8(rest, 7);

  def validate_utf8(<< 244, rest :: bits >>, 0), do: validate_utf8(rest, 8)
  def validate_utf8(_, _), do: 1

so i think you can use that to determine if your string is valid utf8 (0), if its not valid utf8 (1) its not either of those (2…8) its neither

bjorng · February 9, 2019, 7:22am

You can use the unicode module in OTP to test for valid UTF-8 encoding.

Here is an example:

iex(1)> latin1_list = [66,106,246,114,110]
[66, 106, 246, 114, 110]
iex(2)> utf8_list = [66,106,195,182,114,110]
[66, 106, 195, 182, 114, 110]
iex(3)> :unicode.characters_to_binary(:erlang.list_to_binary(latin1_list))
{:error, "Bj", <<246, 114, 110>>}
iex(4)> :unicode.characters_to_binary(:erlang.list_to_binary(utf8_list))
"Björn"
iex(5)>

The unicode:characters_to_binary/1 function accepts either a list or a binary as input. If the input is a list, it expects that each element in the list is a single Unicode codepoint. Since your input list is a list of bytes, it is not appropriate to use the input list directly. Instead, it must be converted to a binary. When given a binary, unicode:characters_to_binary/1 expects that the characters in the binary are encoded in UTF-8. If the binary is indeed encoded in UTF-8, the return value will be the same binary. If it is not properly encoded in UTF-8, an error tuple will be returned instead.

If the list was not encoded in UTF-8, here is how to convert it to a UTF-8 encoded binary, using unicode:characters_to_binary/2:

iex(5)> :unicode.characters_to_binary(:erlang.list_to_binary(latin1_list), :latin1)
"Björn"

NobbZ · February 9, 2019, 7:50am

It is impossible to infer the enccoding just from the bytes.

<<64, 65, 66>> is probably valid in any 8bit encoding that ever existed, you will never know if this is meant to be ASCII, latin, or EBCDIC without someone telling you! If its ASCII or latin1, you can both treat the same, but if its EBCDIC you are lost.

bjorng · February 9, 2019, 8:31am

That blog post was written in 2015. At the time, the suggested solution might have been the best way to test for UTF-8 encoding.

Today, it’s easier to use the unicode module or binary matching:

  def is_valid_utf8?(<<_ :: utf8, rest :: binary>>), do: is_valid_utf8?(rest)
  def is_valid_utf8?(<<>>), do: :true
  def is_valid_utf8?(<<_ :: binary>>), do: :false

(Using the unicode module as shown in my previous post is probably faster, but will build more garbage.)

ConnorRigby · February 9, 2019, 8:34am

Hey I’m all about using language built-ins. I learned something.

scouten · February 10, 2019, 4:19pm

Thank you all. This implementation (slightly tuned from Björn’s answer) seems to do what I need it to do:

def decode(b) when is_list(b) do
  raw = :erlang.list_to_binary(b)

  case :unicode.characters_to_binary(raw) do
    utf8 when is_binary(utf8) -> utf8
    _ -> :unicode.characters_to_binary(raw, :latin1)
  end
end

Nicd · February 10, 2019, 4:56pm

Latin-1 is not a subset of UTF-8. Only the ASCII parts match (0–127), the higher values do not.

NobbZ · February 10, 2019, 5:05pm

Well, latin1 is a subset of UTF8 in the sense that every character that you can encode in latin1 is also encodable in UTF8.

Nicd · February 10, 2019, 5:16pm

That would make pretty much every charset a subset of UTF-8 since UTF-8 encodes Unicode and that includes all characters in the world.

NobbZ · February 10, 2019, 5:17pm

Yupp, all I wanted to say is, that one needs to be careful when saying “is a subset of” or is not…

scouten · February 10, 2019, 5:23pm

Yes, latin-1 is a subset of UTF-8 in the sense that all latin-1 characters can be expressed in UTF-8. However, the encodings are significantly different for code points >127.

Everything in latin-1 is a single byte; some of those single-byte values (>127) are not legal by themselves in UTF-8. To continue using Björn’s example (his name), the ö character is encoded in latin-1 as the single byte 246. In UTF-8 the byte 246 marks the first byte of a four-byte character sequence; if those bytes (which must each be in the range 128…191) do not follow, the UTF-8 sequence is invalid and the decoder must return an error.

If that error occurs when attempting to decode as UTF-8, we then fall back to latin-1.

NobbZ · February 10, 2019, 5:24pm

Yes, thats why I tend to say X is a subset of UTF-8, but their encodings are incompatible.