How to port this regular expression from JavaScript

Hi all, I need to port this from JavaScript to Elixir:

/[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]/

When I try:

Regex.compile!("[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]", "u")

I get:

** (ArgumentError) invalid or reserved Unicode codepoint 55296
    (elixir) src/elixir_interpolation.erl:200: :elixir_interpolation.append_codepoint/5
    (elixir) src/elixir_interpolation.erl:81: :elixir_interpolation."-unescape_tokens/2-lc$^0/1-0-"/2
    (elixir) src/elixir_interpolation.erl:81: :elixir_interpolation.unescape_tokens/2
    (elixir) src/elixir_tokenizer.erl:673: :elixir_tokenizer.handle_strings/6
    (elixir) lib/code.ex:669: Code.string_to_quoted/2

and when I try

~r/[\0-\x{D7FF}\x{E000}-\x{FFFF}]|[\x{D800}-\x{DBFF}][\x{DC00}-\x{DFFF}]|[\x{D800}-\x{DBFF}](?![\x{DC00}-\x{DFFF}])|(?:[^\x{D800}-\x{DBFF}]|^)[\x{DC00}-\x{DFFF}]/u

I get

** (Regex.CompileError) disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at position 39
    (elixir) lib/regex.ex:172: Regex.compile!/2
    (elixir) expanding macro: Kernel.sigil_r/2
    iex:43: (file)

Looks like the problem is with \uD800. Does anybody has an idea how to solve it?

Looks like you are trying to find surrogate pairs? Elixir uses UTF-8 strings and that is AFAIK not allowed to contain surrogate pairs, so binaries containing them would not be valid strings. And since Regex operates on Elixir strings, it does not make sense to ask it to find them.

What is the actual thing you are trying to accomplish here?

3 Likes

Thanks!

What is the actual thing you are trying to accomplish here?

I need to linkify plain text links into html. I’m trying to port https://github.com/markdown-it/linkify-it.

Where is that regex there?

I think you can avoid running it if it only tries to detect and skip surrogate pairs (since they can’t be in Elixir strings), but would be nice if someone more knowledgeable than me replies also. :slight_smile:

Where is that regex there?

https://github.com/markdown-it/linkify-it/blob/master/lib/re.js#L8

https://github.com/markdown-it/uc.micro/blob/master/properties/Any/regex.js

I think you can avoid running it if it only tries to detect and skip surrogate pairs (since they can’t be in Elixir strings), but would be nice if someone more knowledgeable than me replies also.

Thanks, I hope you’re right.

Perhaps this could be useful?

Thanks! Looks good, but too simple for my case (doesn’t support urls with http://, internationalized domain names, etc).

1 Like

Perhaps a set of PR’s to buff it up could be useful? It could be a great generic library for handling such things. :slight_smile:

1 Like

Opened a PR: https://github.com/smpallen99/auto_linker/pull/1

1 Like

Ooo very cool, hope it gets accepted soon!

1 Like