Fix the wrong encoding in a sub-string

monte_claro · July 19, 2024, 6:33am

My project is in Erlang, however, this should make no difference.

I have a black-box function:

my_format_error(input1, input2) ->
  % [...................]

which returns something like “some error: msg1”. And sometimes it will something like “some error: Ð´Ð°Ð´Ð°Ð´Ð°Ð´Ð°Ð´Ð°Ð´Ð°” instead of “some error: дадададада”. That is, a sub-string with invalid encoding, which is a bug.

For now, to make it simple, I need to create a function that’ll fix it:

    fix_encoding_for_my_format_error(input3)

which will be fed the output of the 1st function.

However, I haven’t been able to do it. I’ve been trying multiply combination of

unicode:characters_to_binary(…)
unicode:characters_to_list(…)
binary_to_list(…)
list_to_binary(…)

To no avail.

How to do it? Reliably convert a string that may, or may not, contain a sub-string in the wrong encoding into a properly-encoded string.

NobbZ · July 19, 2024, 6:40am

First of all, you have to find a reliable way to distinguish the “broken” and the “correct” encoding from each other without having a human look at it.

Once you know, that you have to fix a substring, you have to extract it, and then hand it over to a converter, iconv is quite popular for that. I do not know if a native implementation in Erlang or Elixir exists, though NIFs exist, and it seems as if iconv | Hex should work with Erlang as well as Elixir.

Then assemble the string back together.

Be aware, that there are a lot of encodings out there, that you can not distinguish between each other. As a byte is just a byte, and there is usually no metadata available when working with those bytes.

If your expected encoding is UTF-8, then you can at least assume that any invalid UTF-8 bytesequence is very likely in some encoding you want to transform. Which that is? I hope you know…

dimitarvp · July 19, 2024, 7:20am

Where does this string come from? External system that produces NUL-terminated strings and does not give it to you in any UTF-* form?

NobbZ · July 19, 2024, 8:02am

The alternating pattern in both the “wrong” and the “correct” string, make me assume, that the source has some strict 8 bit encoding, where actual utf-8 is expected.

The look of it is already common when using the wrong ISO-8859 variant for German texts, or when trying to interpret utf-8 as 8859-1/15 or vice versa.

Though here we don’t have occasional umlauts as in most European languages, but a completely different alphabet (kyrillic).

Therefore the problem is much more visible.

dimitarvp · July 19, 2024, 8:13am

I assumed the same but it’s always best to ask, sometimes people just forget to configure something somewhere. Happened to the best of us.

monte_claro · July 19, 2024, 12:55pm

the 1st, black-box, function produces it.

Also, the 1st function takes the valid, uft-8 input always. However, it may produces a sub-string with incorrect encoding, whenever it’s fed the input with with something like “дададада”.

The task then for now is create a function that fixes the output.

monte_claro · July 19, 2024, 12:58pm

Most likely the issue is in the 1st black-fox function. However, for now I need to fix the output rather than the original function itself. Hence, black-box

monte_claro · July 19, 2024, 1:03pm

That is,

fix_the_encoding_output_of_black_box(...)

when fed something like “some error: Ð´Ð°Ð´Ð°Ð´Ð°Ð´Ð°Ð´Ð°Ð´Ð°” should produce “some error: дадададада”.

when fed “some error: jajajaja” should produce “some error: jajajaja”.

will it be possible at all, though, without having to fix the 1st function itself?

dimitarvp · July 19, 2024, 1:49pm

OK, so the black box part is apparently non-negotiable.

So is it always the case that you need a proper UTF-8 Cyrillic string but get this garbled input instead? Or are there other scenarios?

monte_claro · July 19, 2024, 2:41pm

…Yes…

NobbZ · July 19, 2024, 4:48pm

Do you get the result back as binary string or as a char list?

And how does the “internal” representation actually look like?

Is it [208, 180, 208, 176] or <<208, 180, 208, 176>>?

This difference is important.

Also, do you need the “fixed” string as a char list or as a binary?

And last but not least… The abstractness of the problem sounds pretty much like it was a homework assignment. Can you please verify whether this is or is not a homework or other kind of assignment?

monte_claro · July 22, 2024, 2:35pm

Let’s say, I said “no”. Would you be able to verify my word? If not, what’s the point of your question? If yes - how?

monte_claro · July 22, 2024, 2:38pm

If the form I’ve presented earlier:

“some error: Ð´Ð°Ð´Ð°Ð´Ð°Ð´Ð°Ð´Ð°Ð´Ð°”

which is a list of integers with incorrect encoding.

NobbZ · July 22, 2024, 2:39pm

In the past we have seen waves of “homework related questions” that generally came with low effort problems and even lower effort solutions (or less), asking for a ready made answer to their problem.

We generally try to avoid “doing the homework” but instead prefer to help poeple to solve it on their own.

So its just a matter of fact, that “homework assignments” have a bitter taste for a lot of us due to what has happened in the past.

And if this is indeed a homework task, I might take a different approach didactically than if it wasn’t.

NobbZ · July 22, 2024, 2:45pm

In which case, you probably just need to iterate and “fuse” individual “characters” with an ordinal above 127 with the next element, according to the rules of UTF-8.

Take a brief look at this table I have taken from wikipedia (link below):

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
U+0000	U+007F	0xxxxxxx
U+0080	U+07FF	110xxxxx	10xxxxxx
U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
U+010000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

My assumption about the blackbox is, that it incorrectly converts between binary and list strings and back at least once, and this the result of this.