Regex and Unicode

Anyone who uses regex with Elixir (especially with Unicode) would be wise to read this essay by Dr. Jamie Jennings on the subject and why it can be such a minefield of problems:

2 Likes

One more reason to like Rust’s regex library – because it doesn’t do lookbacks.

1 Like

Quite the interesting read!
I’m looking forward to part two which is hinted at actually going to talk about the Regex lookback operation.

As someone who has done some work on parsers, lexers and state-machines, I completely understand the frustration that arises from an over-use of “regular” expressions that attempt to add extra features to support a set of languages other than the set of regular languages, with subtle carnage as a result.

If other people like reading about this topic, I can also recommend this blog post about RipGrep which goes in detail on how it works, how it can be more efficient than other text-search tools out there, and why many of the other text-search tools go wrong in their search-implementations when Unicode is used.

2 Likes

Very interesting read and given the work I’ve been doing in unicode and localisation is resonates strongly. In fact its part of why I got interested in this topic at all. There are tools and techniques in Unicode and CLDR that can help. Not the same as regex, or general string matching, but at least they have some formal underpinnings. Here’s an template example of how I approach this using some of the libs I maintain.

defmodule UnicodeString do
  @doc """
  Compares two strings for equality.

  The comparison makes use of CLDR transforms to
  transform the input text to ASCII. This is losely
  called "un-accenting" but it is more than that.

  Secondly, it uses the Unicode case algorithm
  to compare the strings in a case insensitive way.
  It does so for all scripts, not just latin scripts.

  ## Examples

  These examples are taken from the very interesting
  article on [regexs and unicode](https://jamiejennings.com/posts/2021-09-07-dont-look-back-1/).

      iex> no_accent = "Bogota"
      iex> latin_accent = "Bogotá"
      iex> unicode_accent = "Bogotá"
      iex> UnicodeString.equals? no_accent, latin_accent
      true
      iex> UnicodeString.equals? no_accent, unicode_accent
      true
      iex> UnicodeString.equals? latin_accent, unicode_accent
      true

  """
  def equals?(a, b) do
    Unicode.String.equals_ignoring_case? normalize(a), normalize(b)
  end

  def normalize(string) do
    Unicode.Transform.LatinAscii.transform(string)
  end
end

Looking forward to the next couple of parts of the article series.

1 Like