How to detect non-UTF8 characters in a BitString?

I have some files which contain text copied from web sites. In consequence, some of them contain UTF-16 characters. File.read/1 can read these files just fine, but the resulting BitString causes toml-elixir to crash. So, I’d like a way to detect the presence of non-UTF8 characters.

I tried using the following code, but it seems to return a list that includes the UTF-16 characters, rather than an error tuple, as documented. Suggestions?

tmp_cl =  to_charlist(binary)
result  = :unicode.characters_to_list(tmp_cl, :utf8)

-r

Maybe you can check with String.valid?(str): https://hexdocs.pm/elixir/String.html#valid?/1

Btw for :unicode.characters_to_list/2, you may want to give the original binary as argument, see if that works better.

No, String.valid?(str) throws an exception:

** (EXIT from #PID<0.174.0>) shell process exited with reason:
    an exception was raised:
** (ArgumentError) argument error
    :erlang.iolist_to_binary([84, ..., 32, 34, 8217, 34, 10, 32, 32])

-r

You can use this:

non_utf8 = for <<c <- source_string>>, c not in 32..255, into: "", do: <<c>>

if non_utf8 !="" then you have non-utf8 chars

Hmmmm. The character in question is actually UTF-8, just not USASCII (7-bit):

$ echo "’" | od -t x1
0000000    e2  80  99  0a

So, for now I’m using a variation on CharlesO’s approach:

    eight_bit = for <<c <- binary>>, c not in 0..127, into: "", do: <<c>>

-r

So according to your opinion, "\n" is not valid UTF-8, while <<0xff, 0xff>> is? That filter does not work very well…

Can you provide an example of how you read the file?

PS: It might be helpful if you were identifying those files on download, the HTTP header should contain all necessary information to identify them, and convert the encoding in an intermediate step.

1 Like

you can redefine the range, depending on what characters you wish to allow / disallow

The problem with that approach is, that it tries to validate by byte, while UTF-8 is a multi-byte encoding. A single glyph can be made up from 1 to 4 bytes, so looking at a single one at a time will give you nothing.

4 Likes

What is the input being given to String.valid?? Assuming it is a binary, it should never raise. Otherwise we have a bug.

Wouldn’t something simple like this work well enough?

defmodule IsUTF8 do
  def is?(bin)
  def is?(<<c::utf8, rest::binary>>), do: is?(rest)
  def is?(<<c::size(8), _::binary>>), do: {:error, c}
  def is?(""), do: :ok
end

Used like:

iex(1)> s = "hello"
"hello"
iex(2)> b = <<1, 2, 3, 4, 5, 255, 255>>
<<1, 2, 3, 4, 5, 255, 255>>
iex(3)> IsUTF8.is?(s)
:ok
iex(4)> IsUTF8.is?(b)
{:error, 255}

Adjust the error reporting as you wish of course (like byte location, etc… etc…).

1 Like

Per José’s comment on String.valid?/1:

My bad; the exception came from Toml.decode/1. Nothing to see here; move along…

-r

2 Likes