What's the best way to slice a string given a codepoint offset/length?

pejrich · August 4, 2022, 2:09am

Let’s take this string: "before 🇨🇴 after"

I can slice the flag out with String.slice(str, 7, 1) or :binary.part(str, 7, 8)

However, the values I’m receiving are based on codepoints. Is there a better way to handle codepoint slicing other than the following?

codepoint_slice = fn string, offset, length ->
  string
  |> to_charlist()
  |> Enum.drop(offset) 
  |> Enum.take(length) 
  |> to_string()
end

codepoint_slice.(string, 7, 2) => "🇨🇴"

pejrich · August 4, 2022, 2:20am

Copying the elixir std lib String.slice implementation and changing unicode_utils.gc to :unicode_utils.cp` do yield much faster results, but due to the private functions I have to have all this code:

defmodule String2 do
  def slice(_, _, 0) do
    ""
  end

  def slice(string, start, length)
      when is_binary(string) and is_integer(start) and is_integer(length) and start >= 0 and
             length >= 0 do
    do_slice(string, start, length)
  end

  def slice(string, start, length)
      when is_binary(string) and is_integer(start) and is_integer(length) and start < 0 and
             length >= 0 do
    start = max(length(string) + start, 0)
    do_slice(string, start, length)
  end

  defp byte_size_remaining_at(unicode, 0) do
    byte_size_unicode(unicode)
  end

  defp byte_size_remaining_at(unicode, n) do
    case :unicode_util.cp(unicode) do                    # <- only change
      [_] -> 0
      [_ | rest] -> byte_size_remaining_at(rest, n - 1)
      [] -> 0
      {:error, <<_, bin::bits>>} -> byte_size_remaining_at(bin, n - 1)
    end
  end

  defp do_slice(string, start, length) do
    from_start = byte_size_remaining_at(string, start)
    rest = binary_part(string, byte_size(string) - from_start, from_start)

    from_end = byte_size_remaining_at(rest, length)
    binary_part(rest, 0, from_start - from_end)
  end

  defp byte_size_unicode(binary) when is_binary(binary), do: byte_size(binary)
  defp byte_size_unicode([head]), do: byte_size_unicode(head)
  defp byte_size_unicode([head | tail]), do: byte_size_unicode(head) + byte_size_unicode(tail)
end

dimitarvp · August 8, 2022, 1:56pm

I don’t get your exact goal though – what is it that you’re after, exactly? Look for national flags?

Also you likely know this but String has the codepoints function which I’d use instead of to_charlist (more explicit IMO).

In terms of performance optimization, have you tried the next_codepoint function?