How to pattern match (or guards) if a string is alphanumeric

henrique-marcomini-m · May 20, 2019, 8:31pm

I have the following code:


defp has_char_in_string?(value), do: Regex.match?(~r/[^\d]/, value)

def somefun(arg) do
  
  case has_char_in_string?(arg) do
    true -> foo()
    false -> bar()
  end

end

And I really want to keep this regex within this module and do not externalize this logic. But I also want to use pattern matching or guards instead of a case. Is this even possible? And if it is, how?

Thanks in advance

NobbZ · May 20, 2019, 8:49pm

No, there is no way to make this regex or even something semantically equivalent into a guardsave predicate.

peerreynders · May 20, 2019, 9:06pm

Probably not what you are looking for:

def somefun(true),
  do: foo()

def somefun(false),
  do: bar()

def somefun(arg) when is_binary(arg),
  do: Regex.match?(~r/[^\d]/, arg)
      |> somefun()

or

defp p_somefun(true),
  do: foo()

defp p_somefun(false),
  do: bar()

def somefun(arg) when is_binary(arg) do
  Regex.match?(~r/[^\d]/, arg)
  |> p_somefun()
end

al2o3cr · May 20, 2019, 11:40pm

It’s possible, especially if the regular expression is simple - you can translate the regex into its corresponding finite state machine and represent that with pattern matching:

defmodule RegexRecursion do
  def somefun(arg) do
    call_bar(arg)
  end

  defp call_bar("0" <> rest), do: call_bar(rest)
  defp call_bar("1" <> rest), do: call_bar(rest)
  defp call_bar("2" <> rest), do: call_bar(rest)
  defp call_bar("3" <> rest), do: call_bar(rest)
  defp call_bar("4" <> rest), do: call_bar(rest)
  defp call_bar("5" <> rest), do: call_bar(rest)
  defp call_bar("6" <> rest), do: call_bar(rest)
  defp call_bar("7" <> rest), do: call_bar(rest)
  defp call_bar("8" <> rest), do: call_bar(rest)
  defp call_bar("9" <> rest), do: call_bar(rest)

  defp call_bar(""), do: bar()
  defp call_bar(x) when is_binary(x), do: foo()

  defp foo(), do: IO.puts("foo")
  defp bar(), do: IO.puts("bar")
end

Here, the regex [^\d] translates to a state machine that stays in call_bar as long as each character is 0-9, actually calls bar given an empty string, and calls foo otherwise.

If the regex doesn’t involve backreferences or lookaheads (so it’s a theory-of-languages regular expression), it’s always possible to do this.

HOWEVER

The example above is a good example of how this approach obfuscates what should have been code like this:

defmodule RegexRecursionSimple do
  def somefun(arg) when is_binary(arg) do
    if Regex.match?(~r/[^\d]/, arg) do
      foo()
    else
      bar()
    end
  end

  defp foo(), do: IO.puts("foo")
  defp bar(), do: IO.puts("bar")
end

I’m very curious what’s motivating the preference for pattern matching here; it’s not the right tool for the job.

hauleth · May 21, 2019, 8:26pm

TBH I would say that Regex is overkill there. And your pattern match can be improved:

def somefun(<<num>> <> rest) when num in ?0..?9, do: call_bar(rest)
def somefun(""), do: bar()
def somefun(bin) when is_binary(bin), do: foo()

The problem with regular expressions the use backtracking engine (and PCRE is such engine) is that it can explode to O(n^2) with some expressions (namely nested wildcard matches). And such cases have been spotted on the wild with pretty simple expressions.

henrique-marcomini-m · May 30, 2019, 5:25am

Well I think there is no pretty solution here. Though Hauleth presented something that would work ,I now see that the best way is through a well crafted regex. Thank you all for the answers.

kip · May 30, 2019, 9:59pm

I have a package called ex_cldr_unicode that includes guards for

These all operate on the Unicode character classes so it covers what passes for a digit in a more complete sense. It might help or give you some ideas.

Note that it works on code points since there is a limited set of underlying functions that can be used in guards.

There is another bunch of functions that might be helpful, including Cldr.Unicode.alphanumeric?/1 which will return a boolean and also uses the full Unicode definitions (not just Latin1):

iex> Cldr.Unicode.alphanumeric? "1st"
true

iex> Cldr.Unicode.alphanumeric? "KeyserSöze1995"
true

iex> Cldr.Unicode.alphanumeric? "3段"                 
true

mudasobwa · August 15, 2019, 12:32pm

al2o3cr:

defmodule RegexRecursion do
  defp call_bar("0" <> rest), do: call_bar(rest)
  defp call_bar("1" <> rest), do: call_bar(rest)
  defp call_bar("2" <> rest), do: call_bar(rest)
  defp call_bar("3" <> rest), do: call_bar(rest)
  defp call_bar("4" <> rest), do: call_bar(rest)
  defp call_bar("5" <> rest), do: call_bar(rest)
  defp call_bar("6" <> rest), do: call_bar(rest)
  defp call_bar("7" <> rest), do: call_bar(rest)
  defp call_bar("8" <> rest), do: call_bar(rest)
  defp call_bar("9" <> rest), do: call_bar(rest)
end

The above might be written as:

defmodule RegexRecursion do
  Enum.each(?0..?9, fn char ->
    defp call_bar(<<unquote(char), rest :: binary>>), do: call_bar(rest)
  end)
end

dkuku · November 12, 2019, 8:00pm

pm-erlang
is there a way to use guard clauses from erlang ??
I just found this in erlang masterclass course from kent university and I’m trying to solve the problems in elixir - looks similiar as op

peerreynders · November 12, 2019, 8:09pm

https://hexdocs.pm/elixir/guards.html

e.g.

def parse([ch|rest]) when ?a =< ch and ch =< ?z do
  {succeeds, remainder} = get_while(&is_alpha/1, rest)
  {{:var, List.to_atom([ch|succeeds])}, remainder}
end

dkuku · November 12, 2019, 8:54pm

thanks @peerreynders - I found a workaround which probably also can be used:
(I’m using double quoted strings for that)

defmodule Guards do
  defguard is_lower(ch) when ch in ~w(q w e r t y u i o p a s d f g h j k l z x c v b n m)
  defguard is_digit(ch) when ch in ~w(1 2 3 4 5 6 7 8 9 0)
end

OvermindDL1 · November 12, 2019, 9:23pm

That will check if a word is one of those characters, not that if a character is one of those characters.

iex(1)> ~w(q w e r t y u i o p a s d f g h j k l z x c v b n m)
["q", "w", "e", "r", "t", "y", "u", "i", "o", "p", "a", "s", "d", "f", "g", "h",
 "j", "k", "l", "z", "x", "c", "v", "b", "n", "m"]
iex(2)> ?a in ~w(q w e r t y u i o p a s d f g h j k l z x c v b n m)
false

dkuku · November 12, 2019, 9:49pm

I’m was comparing just the first letter in a string and parsed it using recursion. This can be done:

iex(14)> ?a in ~c(a b c)
true

but I need to switch from strings to charlists because its easier to follow the course

hauleth · November 12, 2019, 10:18pm

It would be more readable as:

defmodule Guards do
  defguard is_lower(ch) when ch in ?a..?z
  defguard is_digit(ch) when ch in ?0..?9
end

OvermindDL1 · November 12, 2019, 10:23pm

Technically that’s different too, they had a list of binaries, your’s is a list of characters.

NobbZ · November 12, 2019, 10:50pm

That list also has two spaces…

kip · November 13, 2019, 5:22am

I know I’m probably more focused on i18n than many but Elixir strings are Unicode strings. So if the use case is only ascii then i think intent would be clearer to call the guards is_ascii_lower/1 and is_ascii_digit/1.

Unicode has 2,151 lower case and 630 digit characters as of Unicode 12.1.

mudasobwa · November 13, 2019, 8:08am

Oh yeah! Plus we should not forget about combined diacritics, should we?

String.normalize("ä", :nfc) == String.normalize("ä", :nfd)
#⇒ false

While technically naïve should be considered a word, despite whether it’s composed or decomposed ï.

kip · November 13, 2019, 8:33am

Not forgotten! Both forms have canonical equivalence but they’re not identical as you say. Hence why its quite important to normalise to :nfc before checking casing for consistent results (unless implementing a full casing algorithm that is normal form independent). By the way, not only an issue for diacritics. Also for hangul. And not all decompositions are canonically equalivent.

For example half-width and full-width katakana characters will have the same compatibility decomposition and are thus compatibility equivalents; however, they are not canonical equivalents. They also aren’t cased so at least that’s not an issue here