Regex and Unicode

Anyone who uses regex with Elixir (especially with Unicode) would be wise to read this essay by Dr. Jamie Jennings on the subject and why it can be such a minefield of problems:

2 Likes

One more reason to like Rust’s regex library – because it doesn’t do lookbacks.

1 Like

Quite the interesting read!
I’m looking forward to part two which is hinted at actually going to talk about the Regex lookback operation.

As someone who has done some work on parsers, lexers and state-machines, I completely understand the frustration that arises from an over-use of “regular” expressions that attempt to add extra features to support a set of languages other than the set of regular languages, with subtle carnage as a result.

If other people like reading about this topic, I can also recommend this blog post about RipGrep which goes in detail on how it works, how it can be more efficient than other text-search tools out there, and why many of the other text-search tools go wrong in their search-implementations when Unicode is used.

2 Likes

Very interesting read and given the work I’ve been doing in unicode and localisation is resonates strongly. In fact its part of why I got interested in this topic at all. There are tools and techniques in Unicode and CLDR that can help. Not the same as regex, or general string matching, but at least they have some formal underpinnings. Here’s an template example of how I approach this using some of the libs I maintain.

defmodule UnicodeString do
  @doc """
  Compares two strings for equality.

  The comparison makes use of CLDR transforms to
  transform the input text to ASCII. This is losely
  called "un-accenting" but it is more than that.

  Secondly, it uses the Unicode case algorithm
  to compare the strings in a case insensitive way.
  It does so for all scripts, not just latin scripts.

  ## Examples

  These examples are taken from the very interesting
  article on [regexs and unicode](https://jamiejennings.com/posts/2021-09-07-dont-look-back-1/).

      iex> no_accent = "Bogota"
      iex> latin_accent = "Bogotá"
      iex> unicode_accent = "Bogotá"
      iex> UnicodeString.equals? no_accent, latin_accent
      true
      iex> UnicodeString.equals? no_accent, unicode_accent
      true
      iex> UnicodeString.equals? latin_accent, unicode_accent
      true

  """
  def equals?(a, b) do
    Unicode.String.equals_ignoring_case? normalize(a), normalize(b)
  end

  def normalize(string) do
    Unicode.Transform.LatinAscii.transform(string)
  end
end

Looking forward to the next couple of parts of the article series.

2 Likes

In my experience, regex ends to be over-used, even when it’s the right tool to be using.

What I mean by this is that it’s often the most costly option in terms of performance, and complex regex expressions can be very tricky to get right and to troubleshoot.

I had a code base with some pretty straightforward Regex in it on hot paths that worked correctly but when we had to tune for performance, one of our biggest wins was to reduce the actual calls for regex with some sort of “gatekeeper” test. For example, if your regex is " STORE (#)?\d+$" and most of the strings you’ll examine won’t match, it’s FAR faster, even if your regex engine allows you to precompile, to do something like this:

if (someString.IndexOf(" STORE ") > -1 && Regex.IsMatch(someString," STORE (#)?\d+$") {
     // Due to short-circuit evaluation, if STORE isn't in the string, the Regex is never called
     someString = RemoveStoreNumber(someString);
}

Once you are clear that regex is an embedded just-in-time compiled DSL, these sorts of patterns write themselves. I get the impression that many devs think regex comes for free in terms of performance. I was aware that couldn’t be true, but even so I was surprised just how costly it actually is.

Also keep in mind that, regardless of topic, if what you’re implementing carries a lot of cognitive burdens and “gotchas” then it may not be the best answer. Regex is rife with these. Nontrivial regex is hard to read, easy to screw up, and has occult behaviors that are easy to forget you are triggering.

All that said, when you need regex, you REALLY need it. It’s like any number of things that can easily get away from you if you’re not careful – inheritance in OOP, for example. All things in moderation. They’re in the toolbox for a reason, not to go off on a bender and use for the wrong things.

On the topic of Unicode, well, I had the luxury in my most recent long-running assignment to be working effectively with 7-bit ASCII and so could simply ignore the issue. Mixed case slows you down and accents and ligatures greatly expand the problem space. All the more reason to keep regex simple and straightforward to the maximum possible extent.

1 Like

Dr. Jennings has put up the second part of her blog post:

This one is on the idea of look-behind in a regex and the inherent issues in that technology. Again, another excellent read from someone who’s done a lot of research (and thinking) on RegEx’s!

1 Like