Create a regexp from a string

Hello.
I need to create a RegExp from any string. I tried to use strings with special characters, e.g. "\d"
But when I try to compile this string, it escapes the symbols and shows something like ~r/\x7F/
The code to try:

regex_str = "^\d+,$"
Regex.compile(regex_str) # {:ok, ~r/^\x7F+,$/}

Thanks in advance for any suggestions!

2 Likes

That is because just using " as the string delimiters turns on special characters like that. :slight_smile:

To turn it off there are a few ways, the usual one is via using the ~S (capital S) sigil, which turns interpolation and special characters off, thus:

iex> "\d"
"\d"
iex> ~s"\d"
"\d"
iex> ~S"\d"
"\\d"

:slight_smile:

EDIT1: For note, " basically is like having an implied ~s before it. You can change the delimiter to a certain set if you want, ", ', / and a half dozen others are all valid, so ~s/blah/ == "blah" is true. :slight_smile:

EDIT2: Also, the r sigil both defines a string and passes it to regex compile all in the same step, so your original example could be done like:

iex> regex_str = "^\\d+,$"
"^\\d+,$"
iex> Regex.compile(regex_str)
{:ok, ~r/^\d+,$/}
iex> ~r"\\d+,$"
~r/\\d+,$/
iex> ~r/\\d+,$/
~r/\\d+,$/
iex> ~r{\\d+,$}
~r/\\d+,$/

The inspection protocol for compiled regex’s just converts the compiles regex into the sigil form for easy copy/pasting into the shell, but that is not how it really is internally. The inspection protocol is for ease of ‘you’ reading it, not how it really is. :slight_smile:

EDIT3: There are lots of sigils, you can even make your own, all documented at:

2 Likes

In my case I get a string from users input, and they expect to use a ‘normal’ regexp.
So I can’t use sigils, like ~S , because I can’t pass a variable to the macro.
I have this string “^\d” as an input from users (or a record from a DB) and I need to convert it to regexp.

Well doing regex_str = "^\d+,$" is most certainly not user input. ^.^
It would be more like regex_str = get_user_blah() or so. Strings are only escaped in ‘source code’, not from anywhere else. So if it is user input then it is already fine. :slight_smile:

The issue is that your binding content in regex_str was escaped, it was not theRegex.compile/1 escaping it, it was the " parts above it.

It is exactly the same in javascript, C, C++, ocaml, etc… etc… etc…, almost every language out. :slight_smile:

If you get \d from an external source, it will be represented as "\\d" as a string, and this will work fine with Regex.compile.

Elixir string "\d" does not represent two characters, but a single one. A user entering text into, say, text field, does not use Elixir syntax, though, so when they type \d, this will result in a two-byte string, represented as "\\d" in Elixir. When you use a string in a test, you use Elixir syntax, so additional escaping is necessary there.

3 Likes

Thanks for the explanation @michalmuskala.
Indeed, I used it in tests and faced this problem. Now, it’s clear for me.

As a side note, be careful running user input as a regex. It’s very easy to create a pathologically expensive regex that will DOS your server.

1 Like

You know I just got to thinking, is it possible to create a function to do a rough calculation of a ‘cost’ of a compiled regex? That way we could deny user inputted regex that goes above a certain ‘cost’? Sounds like a hard problem to catch all the detrimental cases… I wonder…

That sounds to me an awfully lot like the halting problem.

To evaluate it then possibly, I was thinking more of just known detrimental patterns, like ones that can match unbounded and expensive loops and such.