How to detect non-UTF8 characters in a BitString?

Rich_Morin · October 31, 2018, 6:01am

I have some files which contain text copied from web sites. In consequence, some of them contain UTF-16 characters. File.read/1 can read these files just fine, but the resulting BitString causes toml-elixir to crash. So, I’d like a way to detect the presence of non-UTF8 characters.

I tried using the following code, but it seems to return a list that includes the UTF-16 characters, rather than an error tuple, as documented. Suggestions?

tmp_cl =  to_charlist(binary)
result  = :unicode.characters_to_list(tmp_cl, :utf8)

-r

Nicd · October 31, 2018, 6:43am

Maybe you can check with String.valid?(str): https://hexdocs.pm/elixir/String.html#valid?/1

Nicd · October 31, 2018, 6:45am

Btw for :unicode.characters_to_list/2, you may want to give the original binary as argument, see if that works better.

Rich_Morin · October 31, 2018, 7:05am

No, String.valid?(str) throws an exception:

** (EXIT from #PID<0.174.0>) shell process exited with reason:
    an exception was raised:
** (ArgumentError) argument error
    :erlang.iolist_to_binary([84, ..., 32, 34, 8217, 34, 10, 32, 32])

-r

CharlesO · October 31, 2018, 7:25am

You can use this:

non_utf8 = for <<c <- source_string>>, c not in 32..255, into: "", do: <<c>>

if non_utf8 !="" then you have non-utf8 chars

Rich_Morin · October 31, 2018, 8:44am

Hmmmm. The character in question is actually UTF-8, just not USASCII (7-bit):

$ echo "’" | od -t x1
0000000    e2  80  99  0a

So, for now I’m using a variation on CharlesO’s approach:

    eight_bit = for <<c <- binary>>, c not in 0..127, into: "", do: <<c>>

-r

NobbZ · October 31, 2018, 8:51am

So according to your opinion, "\n" is not valid UTF-8, while <<0xff, 0xff>> is? That filter does not work very well…

Rich_Morin:

** (EXIT from #PID<0.174.0>) shell process exited with reason:
    an exception was raised:
** (ArgumentError) argument error
    :erlang.iolist_to_binary([84, ..., 32, 34, 8217, 34, 10, 32, 32])

Can you provide an example of how you read the file?

PS: It might be helpful if you were identifying those files on download, the HTTP header should contain all necessary information to identify them, and convert the encoding in an intermediate step.

CharlesO · October 31, 2018, 8:56am

you can redefine the range, depending on what characters you wish to allow / disallow

NobbZ · October 31, 2018, 8:58am

The problem with that approach is, that it tries to validate by byte, while UTF-8 is a multi-byte encoding. A single glyph can be made up from 1 to 4 bytes, so looking at a single one at a time will give you nothing.

josevalim · October 31, 2018, 10:03am

Rich_Morin:

No, String.valid?(str) throws an exception:

** (EXIT from #PID<0.174.0>) shell process exited with reason:
    an exception was raised:
** (ArgumentError) argument error
    :erlang.iolist_to_binary([84, ..., 32, 34, 8217, 34, 10, 32, 32])

-r

What is the input being given to String.valid?? Assuming it is a binary, it should never raise. Otherwise we have a bug.

OvermindDL1 · October 31, 2018, 2:52pm

Wouldn’t something simple like this work well enough?

defmodule IsUTF8 do
  def is?(bin)
  def is?(<<c::utf8, rest::binary>>), do: is?(rest)
  def is?(<<c::size(8), _::binary>>), do: {:error, c}
  def is?(""), do: :ok
end

Used like:

iex(1)> s = "hello"
"hello"
iex(2)> b = <<1, 2, 3, 4, 5, 255, 255>>
<<1, 2, 3, 4, 5, 255, 255>>
iex(3)> IsUTF8.is?(s)
:ok
iex(4)> IsUTF8.is?(b)
{:error, 255}

Adjust the error reporting as you wish of course (like byte location, etc… etc…).

Rich_Morin · October 31, 2018, 6:10pm

Per José’s comment on String.valid?/1:

My bad; the exception came from Toml.decode/1. Nothing to see here; move along…

-r