I have some files which contain text copied from web sites. In consequence, some of them contain UTF-16 characters. File.read/1 can read these files just fine, but the resulting BitString causes toml-elixir to crash. So, I’d like a way to detect the presence of non-UTF8 characters.
I tried using the following code, but it seems to return a list that includes the UTF-16 characters, rather than an error tuple, as documented. Suggestions?
tmp_cl = to_charlist(binary)
result = :unicode.characters_to_list(tmp_cl, :utf8)
So according to your opinion, "\n" is not valid UTF-8, while <<0xff, 0xff>> is? That filter does not work very well…
Can you provide an example of how you read the file?
PS: It might be helpful if you were identifying those files on download, the HTTP header should contain all necessary information to identify them, and convert the encoding in an intermediate step.
The problem with that approach is, that it tries to validate by byte, while UTF-8 is a multi-byte encoding. A single glyph can be made up from 1 to 4 bytes, so looking at a single one at a time will give you nothing.