Newbie with UnicodeConversionError

Matthew · January 20, 2017, 9:47am

Hello, I am new to Elixir and so far finding it awesome. However I have just stumbled across a problem with non ascii strings. Please see below to see the error message I am getting:

iex(4)> "하이"
** (UnicodeConversionError) invalid encoding starting at <<199, 207, 192, 204, 34, 10>>
    (elixir) lib/string.ex:1801: String.to_charlist/1
iex(4)>

I note that I am using Windows 10 and running iex from a Cmder shell (the same error occurs in windows cmd and in powershell).

Any suggestions for how I can fix this? (Besides using developing on a linux system )

Marcus · January 20, 2017, 10:19am

You can try to use the raw representation of the string.

iex(30)> i "하이"
...
Raw representation
  <<237, 149, 152, 236, 157, 180>>
...
iex(31)> i <<237, 149, 152, 236, 157, 180>>
Term
  "하이"
...
Raw representation
  <<237, 149, 152, 236, 157, 180>>
...
iex(32)> str = <<237, 149, 152, 236, 157, 180>>
"하이"
iex(33)>

As you can see your string has just the wrong encoding.

iex(43)> String.to_charlist(<<199, 207, 192, 204, 34, 10>>)
** (UnicodeConversionError) invalid encoding starting at <<199, 207, 192, 204, 34, 10>>
    (elixir) lib/string.ex:1801: String.to_charlist/1
iex(43)>

michalmuskala · January 20, 2017, 1:54pm

While I have very little windows experience, I would expect this error to be caused by shell being not in unicode encoding.

In general any unicode issues on windows related to the shell were solved by using the graphical iex --werl shell.

CharlesO · May 21, 2018, 11:24am

This is not always the case. For example i see the same / similar issues reading data from SQL… i’m not printing to the console… my solution is generating xml files from SQL data, and i run into the UnicodeConversionError as well.

For example:

%UnicodeConversionError{encoded: " <serial-no>1</serial-no>\n <full-name>UBA Pensions Custodian Limited </full-name>\n <contact-address>3rd Floor, 22B,", message: "invalid encoding starting at <<160, 73, 100, 111, 119, 117, 32, 84, 97, 121, 108, 111, 114, 32, 83, 60, 47, 99, 111, 110, 116, 97, 99, 116, 45, 97, 100, 100, 114, 101, 115, 115, 62, 10>>"}

code:

xml =
      if add_serial do
        for {ix, v} <- data,
            do:
              "#{indent}<data>\n#{xml_row([{"serial-no", ix} | v], indent <> @t1)}#{indent}</data>\n"
      else
        for {_, v} <- data, do: "#{indent}<data>\n#{xml_row(v, indent <> @t1)}#{indent}</data>\n"
      end

kip · May 21, 2018, 11:50am

That looks like an UTF-8 encoding error (which is what the exception is reporting).

<< 160 >> is ISO8859 encoding for a non breaking space. For UTF-8 the encoding is << 0xc2, 160 >> (or << 0xc2, 0xa0 >>). The exception isn’t showing the full binary I know, but I assume that the previous byte to the referenced binary is not << 0xc2 >>.

I wonder if your database connection is set to ISO8859 instead of UTF-8?

CharlesO · May 21, 2018, 11:53am

I totally did not check I just went with the defaults

kip · May 21, 2018, 12:01pm

If you have access to psql try:

show client_encoding

Also would be helpful to see the full binary (if possible) in order to see if its maybe the wrong encoding, or whether its really an encoding error.

The default encoding for postgresql is, iirc, utf8. So although a possibility it would be a little surprising if the database encoding was something else. But definitely possible.

CharlesO · May 21, 2018, 12:02pm

Oh, i’m connecting to MS-SQL not postgresql

kip · May 21, 2018, 12:03pm

Ahhhhhhhhhhh. No idea about the defaults there but since I think Windows has a predilection for ISO-8859, checking the encoding would definitely be worth doing first.

CharlesO · May 21, 2018, 12:06pm

The safe bet would be to have a simple function that outputs just printable chars on String … like to_printable/1 that would eliminate guess work (… in may case)

kip · May 21, 2018, 12:08pm

Actually you can: String.codepoints/1 will decode what it can and show you the code points it can’t.

iex> x = <<160, 73, 100, 111, 119, 117, 32, 84, 97, 121, 108, 111, 114, 32, 83, 60, 47,
  99, 111, 110, 116, 97, 99, 116, 45, 97, 100, 100, 114, 101, 115, 115, 62, 10>>
iex> String.codepoints x
[
  <<160>>,
  "I",
  "d",
  "o",
  "w",
  "u",
  " ",
  "T",
  "a",
  "y",
  "l",
  "o",
  "r",
  " ",
  "S",
  "<",
  "/",
  "c", 
  "o",
  "n",
  "t",
  "a",
  "c",
  "t",
  "-",
  "a",
  "d",
  "d",
  "r",
  "e",
  "s",
  "s",
  ">",
  "\n"
]