Converting hex chars to unicode equivalent

cblavier · May 7, 2024, 9:04am

Hey there ,

I fetch some HTML content with Req and extract plain text with Floki.

The retrieved text contains hex characters such as "Bient\xf4t" where I would expect "Bientôt" or the unicode equivalent "Bient\u00f4t"

I need the following test to pass:

"Bient\xf4t" |> some_magic() |> String.contains?("Bientôt") == true

kip · May 7, 2024, 9:29am

\xf4 is a byte representation which Elixir understands, but in this example is not UTF8 encoded, which Elixir does expect. This type of representation happens in some languages when they only support ASCII but need to represent other encodings.

Something like this might help:

defmodule XDecode do
  def decode do
    "Bient\xf4t"
    |> String.chunk(:valid)
    |> decode_codepoints()
  end
  
  def decode_codepoints([utf8, codepoints | rest]) do
    utf8 <> List.to_string(:binary.bin_to_list(codepoints)) <> decode_codepoints(rest)
  end
  
  def decode_codepoints([utf8]) do
    utf8
  end
  
  def decode_codepoints([]) do
    ""
  end
end

As you would expect, this will only work if in fact the hex bytes resolve to valid Unicode code points. Otherwise, as @hauleth says, you need a more generalised converter.

hauleth · May 7, 2024, 9:30am

You need to use some library that provides translation between different codepages (you also need to know what is the codepage of the input). If it is Latin1 then there is some stuff in OTP that can help you, but if it is other page, then you need to find `i

cblavier · May 7, 2024, 9:39am

this is exactly what I need, thank you!

hauleth · May 7, 2024, 10:11am

In this particular case (where it is Latin1) you can use built in function instead of @kip’s code:

:unicode.characters_to_binary("Bient\xf4t", :latin1)

And you are good to go.

cblavier · May 7, 2024, 10:46am

even better, thanks!