Regex and unicode returning wrong indexes

regexp
unicode
regex
Tags: #<Tag:0x00007f8e9d0f5478> #<Tag:0x00007f8e9d0f5338> #<Tag:0x00007f8e9d0f51f8>

#1

Hello,

I need to find the position of a match using Regex.run on a text with unicode characters.

This is an example:

iex> Regex.run(~r/x/u, "áéíóúx", return: :index)
[{10, 1}]

When te expected output should be: [{5, 1}]

Shouldn’t the unicode modificator take care of this? how could I get the right result?

Thanks!


#2

how could I get the right result?

For exactly this example

iex(1)> [before, _after] = :binary.split("áéíóúx", "x")
["áéíóú", ""]
iex(2)> String.length(before)
5

seems to work.


#3

The index returned is the byte index, and this is perfect for further processing it in parallel to the original input, eg, when cutting the matched substring, as indexing per byte is O(1), per character/glyph is O(n).


#4
iex(1)> byte_size("áéíóúx")
11

Looks correct to me. :slight_smile:

It returns byte positions, not character positions, UTF-8 is multibyte. :slight_smile:


#5

Or maybe

defmodule Test do

  @spec find_index(String.t, String.t) :: non_neg_integer | nil
  def find_index(target, <<char::utf8>>) do
    find_index(target, char, 0)
  end

  @spec find_index(String.t, pos_integer, non_neg_integer) :: non_neg_integer | nil
  defp find_index(<<target::utf8, _rest::bytes>>, target, current_index) do
    current_index
  end
  defp find_index(<<_target::utf8, rest::bytes>>, target, current_index) do
    find_index(rest, target, current_index + 1)
  end
  defp find_index(<<>>, _target, _current_index) do
    nil
  end
end
iex(5)> Test.find_index("áéíóúx", "x")
5
iex(6)> Test.find_index("áéíóúx", "í")
2

This approach might be a bit faster than the one I suggested above. There’s probably a cleaner way, though.


#6

Thank you all for your answers. @idiot, your solution looks very good to me, I finally did something inspired on that, but simpler for my usecase.

I needed to replace a tag in the text, so i had to find the tag name, and replace it with the new value. Kind of a template engine where I can substitute “Hello {{ first_name }}”

The engine get’s first all the tag names, store the name and the indexes, and later it get’s replaced by a value.

This is the code I used:

@regex ~r/{{\s*([a-zA-Z0-9_ ]*?)\s*}}/iu

def replace_tags(text, data) when is_binary(text) and is_map(data) do
    result = Regex.run(@regex, text, return: :index)

    case result do
      nil ->
        text

      [{tag_start, tag_len}, {start, len}] ->
        tag_name = slice_and_clean_tags(text, start, len)

        value = Map.get(data, tag_name, "")

        replace_binary(text, tag_start, tag_len, value)
        |> replace_tags(data)
    end
  end

  def replace_tags(text, _) when is_binary(text), do: text
  def replace_tags(_, _), do: ""

  defp slice_and_clean_tags(text, start, len) do
    <<_before::binary-size(start), tag::binary-size(len), _after::binary>> = text

    slugify(tag)
  end

defp replace_binary(text, start, len, value)
     when is_binary(text) and is_integer(start) and is_integer(len) do
  until = start + len
  <<first::binary-size(start), _::binary>> = text
  <<_::binary-size(until), rest::binary>> = text
  first <> value <> rest
end

defp replace_binary(text, _, _, _), do: text

The important line here is:

<<_before::binary-size(start), tag::binary-size(len), _after::binary>> = text

Thanks a lot for your help!


#7

Have you considered using https://github.com/plataformatec/nimble_parsec?


#8

I didn’t know that existed! it’s the way to go probably