Is this a bug in safe_to_string()?

I wrote this function to capture and convert hash tags into links:

  def link_caption_hashtags(caption) when is_binary(caption) do
    Regex.replace(~r/#([[:alnum:]]+)/, caption, fn tag, word ->
      link(tag, to: "#{@tag_explorer_uri}#{word}")
      |> safe_to_string()
    end)
    |> raw()
  end

If you pass it I am a caption with a #hashtag, you get back I am a caption with a <a href="http://foo.com/hashtag">#hashtag</a>.

However, it breaks down when passed a hashtag with diacritics (accents): . It seems that safe_to_string() converts accents into unknown chars, which are displayed as a diamond with a question mark. So it returns #�.

Is this a bug? Is there a better way to do what I want?

Phoenix.HTML.safe_to_string/1 does call IO.iodata_to_string/1 which is not unicode safe according to its documentation.

But for this purpose it should be enough to write your own small helper:

defp my_safe_to_string({:safe, str}), do: IO.chardata_to_string(str)

But in theory, even though not unicode safe, it shouldnt replace a valid codepoint with U+FFFD… Have you tried to use some IO.inspects in your Regex function to check if the problem is already there?

Also which version of erlang, elixir and phoenix are you using? Perhaps that might give a clue as well.

Maybe the documentation needs clarification but iodata_to_binary is unicode safe for binaries, since binaries are raw, but not for charlists (integers are considered bytes, not codepoints). If you think the documentation could be better, a PR is definitely appreciated. :smiley:

Right. At home I’m using Elixir 1.5, Phoenix 1.3. On this computer, Elixir 1.5.2, Phoenix 1.3. Here’s an example from this machine. For the string:

Hashtag with #áccentš

    res =
      Regex.replace(~r/#([[:alnum:]]+)/, caption, fn tag, word ->
        link(tag, to: "#{@tag_explorer_uri}#{word}")
      end)
    IO.inspect res

Outputs:

["Hashtag with ", {:safe, [60, "a", [[32, "href", 61, 34, <<104, 116, 116, 112, 115, 58, 47, 47, 119, 119, 119, 46, 105, 110, 115, 116, 97, 103, 114, 97, 109, 46, 99, 111, 109, 47, 101, 120, 112, 108, 111, 114, 101, 47, 116, 97, 103, 115, 47, ...>>, 34]], 62, <<35, 195>>, 60, 47, "a", 62]} | <<161, 99, 99, 101, 110, 116, 197, 161>>]

Once it’s fed to safe_to_string(), this is the output:

<<72, 97, 115, 104, 116, 97, 103, 32, 119, 105, 116, 104, 32, 60, 97, 32, 104,
  114, 101, 102, 61, 34, 104, 116, 116, 112, 115, 58, 47, 47, 119, 119, 119, 46,
  105, 110, 115, 116, 97, 103, 114, 97, 109, 46, 99, 111, 109, 47, 101, 120,
  ...>>

And when it’s passed to raw(), I get this:

Hashtag with <a href="https://www.instagram.com/explore/tags/�">#�</a>�ccentš

Interestingly, only the á is mangled. The š has survived. Why?

If you’re using a regex with unicode replacement don’t you need to use the u flag at the end? or is that not relevant here?

3 Likes