Fix the wrong encoding in a sub-string

My project is in Erlang, however, this should make no difference.

I have a black-box function:

my_format_error(input1, input2) ->
  % [...................]

which returns something like “some error: msg1”. And sometimes it will something like “some error: дададададада” instead of “some error: дадададада”. That is, a sub-string with invalid encoding, which is a bug.

For now, to make it simple, I need to create a function that’ll fix it:

    fix_encoding_for_my_format_error(input3)

which will be fed the output of the 1st function.

However, I haven’t been able to do it. I’ve been trying multiply combination of

  • unicode:characters_to_binary(…)
  • unicode:characters_to_list(…)
  • binary_to_list(…)
  • list_to_binary(…)

To no avail.

How to do it? Reliably convert a string that may, or may not, contain a sub-string in the wrong encoding into a properly-encoded string.

First of all, you have to find a reliable way to distinguish the “broken” and the “correct” encoding from each other without having a human look at it.

Once you know, that you have to fix a substring, you have to extract it, and then hand it over to a converter, iconv is quite popular for that. I do not know if a native implementation in Erlang or Elixir exists, though NIFs exist, and it seems as if iconv | Hex should work with Erlang as well as Elixir.

Then assemble the string back together.

Be aware, that there are a lot of encodings out there, that you can not distinguish between each other. As a byte is just a byte, and there is usually no metadata available when working with those bytes.

If your expected encoding is UTF-8, then you can at least assume that any invalid UTF-8 bytesequence is very likely in some encoding you want to transform. Which that is? I hope you know…

Where does this string come from? External system that produces NUL-terminated strings and does not give it to you in any UTF-* form?

The alternating pattern in both the “wrong” and the “correct” string, make me assume, that the source has some strict 8 bit encoding, where actual utf-8 is expected.

The look of it is already common when using the wrong ISO-8859 variant for German texts, or when trying to interpret utf-8 as 8859-1/15 or vice versa.

Though here we don’t have occasional umlauts as in most European languages, but a completely different alphabet (kyrillic).

Therefore the problem is much more visible.

I assumed the same but it’s always best to ask, sometimes people just forget to configure something somewhere. Happened to the best of us. :smiley:

1 Like

the 1st, black-box, function produces it.

Also, the 1st function takes the valid, uft-8 input always. However, it may produces a sub-string with incorrect encoding, whenever it’s fed the input with with something like “дададада”.

The task then for now is create a function that fixes the output.

Most likely the issue is in the 1st black-fox function. However, for now I need to fix the output rather than the original function itself. Hence, black-box

That is,

fix_the_encoding_output_of_black_box(...)

when fed something like “some error: дададададада” should produce “some error: дадададада”.

when fed “some error: jajajaja” should produce “some error: jajajaja”.

will it be possible at all, though, without having to fix the 1st function itself?

OK, so the black box part is apparently non-negotiable.

So is it always the case that you need a proper UTF-8 Cyrillic string but get this garbled input instead? Or are there other scenarios?

…Yes…

Do you get the result back as binary string or as a char list?

And how does the “internal” representation actually look like?

Is it [208, 180, 208, 176] or <<208, 180, 208, 176>>?

This difference is important.

Also, do you need the “fixed” string as a char list or as a binary?

And last but not least… The abstractness of the problem sounds pretty much like it was a homework assignment. Can you please verify whether this is or is not a homework or other kind of assignment?

Let’s say, I said “no”. Would you be able to verify my word? If not, what’s the point of your question? If yes - how?

If the form I’ve presented earlier:

“some error: дададададада”

which is a list of integers with incorrect encoding.

In the past we have seen waves of “homework related questions” that generally came with low effort problems and even lower effort solutions (or less), asking for a ready made answer to their problem.

We generally try to avoid “doing the homework” but instead prefer to help poeple to solve it on their own.

So its just a matter of fact, that “homework assignments” have a bitter taste for a lot of us due to what has happened in the past.

And if this is indeed a homework task, I might take a different approach didactically than if it wasn’t.

In which case, you probably just need to iterate and “fuse” individual “characters” with an ordinal above 127 with the next element, according to the rules of UTF-8.

Take a brief look at this table I have taken from wikipedia (link below):

First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
U+0000 U+007F 0xxxxxxx
U+0080 U+07FF 110xxxxx 10xxxxxx
U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+010000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

My assumption about the blackbox is, that it incorrectly converts between binary and list strings and back at least once, and this the result of this.

2 Likes