Let’s take this string: "before 🇨🇴 after"
I can slice the flag out with String.slice(str, 7, 1)
or :binary.part(str, 7, 8)
However, the values I’m receiving are based on codepoints. Is there a better way to handle codepoint slicing other than the following?
codepoint_slice = fn string, offset, length ->
string
|> to_charlist()
|> Enum.drop(offset)
|> Enum.take(length)
|> to_string()
end
codepoint_slice.(string, 7, 2) => "🇨🇴"
Copying the elixir std lib String.slice
implementation and changing unicode_utils.gc
to :unicode_utils.cp` do yield much faster results, but due to the private functions I have to have all this code:
defmodule String2 do
def slice(_, _, 0) do
""
end
def slice(string, start, length)
when is_binary(string) and is_integer(start) and is_integer(length) and start >= 0 and
length >= 0 do
do_slice(string, start, length)
end
def slice(string, start, length)
when is_binary(string) and is_integer(start) and is_integer(length) and start < 0 and
length >= 0 do
start = max(length(string) + start, 0)
do_slice(string, start, length)
end
defp byte_size_remaining_at(unicode, 0) do
byte_size_unicode(unicode)
end
defp byte_size_remaining_at(unicode, n) do
case :unicode_util.cp(unicode) do # <- only change
[_] -> 0
[_ | rest] -> byte_size_remaining_at(rest, n - 1)
[] -> 0
{:error, <<_, bin::bits>>} -> byte_size_remaining_at(bin, n - 1)
end
end
defp do_slice(string, start, length) do
from_start = byte_size_remaining_at(string, start)
rest = binary_part(string, byte_size(string) - from_start, from_start)
from_end = byte_size_remaining_at(rest, length)
binary_part(rest, 0, from_start - from_end)
end
defp byte_size_unicode(binary) when is_binary(binary), do: byte_size(binary)
defp byte_size_unicode([head]), do: byte_size_unicode(head)
defp byte_size_unicode([head | tail]), do: byte_size_unicode(head) + byte_size_unicode(tail)
end
I don’t get your exact goal though – what is it that you’re after, exactly? Look for national flags?
Also you likely know this but String
has the codepoints
function which I’d use instead of to_charlist
(more explicit IMO).
In terms of performance optimization, have you tried the next_codepoint
function?
1 Like