egor
January 31, 2019, 12:48pm
1
Hi all, I need to port this from JavaScript to Elixir:
/[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]/
When I try:
Regex.compile!("[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]", "u")
I get:
** (ArgumentError) invalid or reserved Unicode codepoint 55296
(elixir) src/elixir_interpolation.erl:200: :elixir_interpolation.append_codepoint/5
(elixir) src/elixir_interpolation.erl:81: :elixir_interpolation."-unescape_tokens/2-lc$^0/1-0-"/2
(elixir) src/elixir_interpolation.erl:81: :elixir_interpolation.unescape_tokens/2
(elixir) src/elixir_tokenizer.erl:673: :elixir_tokenizer.handle_strings/6
(elixir) lib/code.ex:669: Code.string_to_quoted/2
and when I try
~r/[\0-\x{D7FF}\x{E000}-\x{FFFF}]|[\x{D800}-\x{DBFF}][\x{DC00}-\x{DFFF}]|[\x{D800}-\x{DBFF}](?![\x{DC00}-\x{DFFF}])|(?:[^\x{D800}-\x{DBFF}]|^)[\x{DC00}-\x{DFFF}]/u
I get
** (Regex.CompileError) disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at position 39
(elixir) lib/regex.ex:172: Regex.compile!/2
(elixir) expanding macro: Kernel.sigil_r/2
iex:43: (file)
Looks like the problem is with \uD800
. Does anybody has an idea how to solve it?
Nicd
January 31, 2019, 1:15pm
2
Looks like you are trying to find surrogate pairs? Elixir uses UTF-8 strings and that is AFAIK not allowed to contain surrogate pairs, so binaries containing them would not be valid strings. And since Regex operates on Elixir strings, it does not make sense to ask it to find them.
What is the actual thing you are trying to accomplish here?
3 Likes
egor
January 31, 2019, 1:22pm
3
Thanks!
What is the actual thing you are trying to accomplish here?
I need to linkify plain text links into html. I’m trying to port https://github.com/markdown-it/linkify-it .
Nicd
January 31, 2019, 1:24pm
4
Where is that regex there?
I think you can avoid running it if it only tries to detect and skip surrogate pairs (since they can’t be in Elixir strings), but would be nice if someone more knowledgeable than me replies also.
egor
January 31, 2019, 1:33pm
5
Where is that regex there?
https://github.com/markdown-it/linkify-it/blob/master/lib/re.js#L8
https://github.com/markdown-it/uc.micro/blob/master/properties/Any/regex.js
I think you can avoid running it if it only tries to detect and skip surrogate pairs (since they can’t be in Elixir strings), but would be nice if someone more knowledgeable than me replies also.
Thanks, I hope you’re right.
Perhaps this could be useful?
egor
February 1, 2019, 4:57am
7
Thanks! Looks good, but too simple for my case (doesn’t support urls with http://
, internationalized domain names, etc).
1 Like
egor:
Thanks! Looks good, but too simple for my case (doesn’t support urls with http://
, internationalized domain names, etc).
Perhaps a set of PR’s to buff it up could be useful? It could be a great generic library for handling such things.
1 Like
egor
February 5, 2019, 2:05pm
9
1 Like
Ooo very cool, hope it gets accepted soon!
1 Like