Detecting non-ASCII characters in a binary

ijdickinson · April 11, 2024, 10:29am

As part of our data validation, I’d like to spot unexpected UTF8 characters in user supplied details. Some non-ASCII characters can be anticipated (é for example), but I’d like to spot unintended changes like UTF-8 single quote mark in place of ASCII ', invisible spaces, etc.

So the question I have is: what’s a good way to scan an Elixir string to detect characters that are in UTF8 but outside the range of ASCII (i.e. which would require File.open/2 to use :utf8 mode when writing)?

dimitarvp · April 11, 2024, 11:40am

You can just make an allow-list of characters (since you said you are not only after ASCII characters but a few more as well), open a file in :utf8 mode and then work on each character via the in operator and the String.codepoints function. Seems easy.

dwark · April 11, 2024, 12:50pm

In addition to @dimitarvp 's reply, when rolling your own maybe this
String.valid? thread could be helpful.

cevado · April 11, 2024, 12:58pm

I have this library used to transliterate unicode to ascii… it’s based on the pearl and ruby version of it. from your question it seems to be good enough:
https://hexdocs.pm/unidecode/Unidecode.html

edit:
also if you’re going from utf-8 to a specific encoding. maybe using codepagex might work better:

dwark · April 11, 2024, 2:09pm

I had a use-case to simply detect non-ascii and used:

name != for(<<c <- name>>, c < 128, into: "", do: <<c>>)

which seems to do the trick. Didn’t need to be fast per se.

kip · April 12, 2024, 9:20pm

Is こんにちは unexpected UTF-8? Is नमस्ते? I’m curious what the issue is, in your use case, with valid UTF-8?

I think the answer to that would help decide whether whitelisting or blacklisting or blacklisting is the better choice.

ijdickinson · April 12, 2024, 9:54pm

Which of these is more likely to be correct input from a user:

jane.o’brien@example.com
jane.o’brien@example.com

?

Turns out, one of them is in ascii, and one not. We can tell this, because an IO.write/2 with one of those strings will unless the output is opened in :utf8: mode.

It’s unlikely (not impossible, but unlikely) that the UTF8 apostrophe is part of a correct email address. However, we can’t just reject UTF8 outright because Anders.Ångström@example.com is entirely legitimate. So the goal is to detect unexpected input that might be a mistake, and flag it for a human reviewer to check. Hence the task: point out characters that are valid UTF8 codepoints, but outside ASCII, allowing for a a list of common exceptions. The exceptions are basically Roman alphabet characters with accents.

Given the current audience for our app, at the moment it’s very unlikely we’ll get a string of entirely Arabic or Kanji or whatever. If that becomes an issue in the future, we’ll have to redesign the approach, but for today yagni.

kip · April 12, 2024, 10:13pm

こんにちは@example.com is also entirely legitimate as you know. I empathise with the intent to help check user-supplied input but other than actually sending a validation email you have some risk ending up with as many complaints about blocking valid email addresses as you do fixing unintentional input errors.

You might get some additional ideas from the Unicode Security Guide which overlaps, in part, with your objectives.

dimitarvp · April 13, 2024, 9:44am

Confusing part is why would you do IO.write without :utf8 mode. Still, OK:

defmodule AllowlistUnicode do
  def allowset_to_list(list) when is_list(list) do
    list |> allowset_element_to_list() |> List.flatten()
  end

  def allowset_element_to_list([]), do: []

  def allowset_element_to_list([range | rest]) when is_struct(range, Range) do
    [Range.to_list(range) | allowset_element_to_list(rest)]
  end

  def allowset_element_to_list([allowed | rest]) when is_list(allowed) or is_integer(allowed) do
    [allowed | allowset_element_to_list(rest)]
  end

  def get_allowed_and_blocked_characters(string, allow_list)
      when is_binary(string) and is_list(allow_list) do
    allow_list = allowset_to_list(allow_list)

    string
    |> String.to_charlist()
    |> Enum.reduce({[], []}, fn char, {allowed, blocked} ->
      if char in allow_list do
        {[char | allowed], blocked}
      else
        {allowed, [char | blocked]}
      end
    end)
    |> then(fn {allowed, blocked} ->
      {Enum.reverse(allowed), Enum.reverse(blocked)}
    end)
  end
end

This allows you to test with arbitrary mixes of lists, Ranges and separate integers like so:

iex(11)> AllowlistUnicode.get_allowed_and_blocked_characters("гzжf", [0..128, 8217, [0x2D, 0x2011]])
{~c"zf", [1075, 1078]}

I.e. a tuple where the first element is a list of allowed characters and the second one: of the blocked characters.

That way you can easily filter out stuff you dislike.

I’ll agree with @kip that this is a slippery slope and carries the potential of you having to respond to human support requests for a while until you nail your audience… and then indeed any Chinese / Korean / Japanese / Arabic name will trip your code up again. But maybe you’re OK with it, hence the code above.

Or if you only want to check against a pre-constructed allow-list + only need either the allowed or the blocked characters then @dwark’s code is a literal one-liner that gets the job done just fine.

ijdickinson · April 13, 2024, 10:46am

Well I didn’t expect the Spanish Inquisition. It just happened that way. We’ve had six months of processing bulk customer data, which was all fine until it wasn’t. The IO.write is fixed, obviously, but the interesting part - to me - is that missing the :utf8 flag exposed an error case in the data pipeline that we hadn’t come across before. Production code has bugs sometimes. You find them, you fix them, and move on.

ijdickinson · April 13, 2024, 11:01am

Thanks for all the comments and suggestions, folks. I have enough now to make our data pipeline a bit more robust (at least to the error cases we know about!)

dimitarvp · April 13, 2024, 12:59pm

Heh. Consider that a good amount of posters arrive here with their variant of the XY problem so we do our best to establish a good foundation for the discussion – which means dispelling potential myths and bad practices from the get go, because they usually prolong the discussion and tend to make it unfocused. Nobody is criticizing you in particular or saying that real code doesn’t have problems – of course it has.

Hindsight, 20/20, and all that.

Hope we were helpful.