which returns something like “some error: msg1”. And sometimes it will something like “some error: дададададада” instead of “some error: дадададада”. That is, a sub-string with invalid encoding, which is a bug.
For now, to make it simple, I need to create a function that’ll fix it:
fix_encoding_for_my_format_error(input3)
which will be fed the output of the 1st function.
However, I haven’t been able to do it. I’ve been trying multiply combination of
unicode:characters_to_binary(…)
unicode:characters_to_list(…)
binary_to_list(…)
list_to_binary(…)
To no avail.
How to do it? Reliably convert a string that may, or may not, contain a sub-string in the wrong encoding into a properly-encoded string.
First of all, you have to find a reliable way to distinguish the “broken” and the “correct” encoding from each other without having a human look at it.
Once you know, that you have to fix a substring, you have to extract it, and then hand it over to a converter, iconv is quite popular for that. I do not know if a native implementation in Erlang or Elixir exists, though NIFs exist, and it seems as if iconv | Hex should work with Erlang as well as Elixir.
Then assemble the string back together.
Be aware, that there are a lot of encodings out there, that you can not distinguish between each other. As a byte is just a byte, and there is usually no metadata available when working with those bytes.
If your expected encoding is UTF-8, then you can at least assume that any invalid UTF-8 bytesequence is very likely in some encoding you want to transform. Which that is? I hope you know…
The alternating pattern in both the “wrong” and the “correct” string, make me assume, that the source has some strict 8 bit encoding, where actual utf-8 is expected.
The look of it is already common when using the wrong ISO-8859 variant for German texts, or when trying to interpret utf-8 as 8859-1/15 or vice versa.
Though here we don’t have occasional umlauts as in most European languages, but a completely different alphabet (kyrillic).
Also, the 1st function takes the valid, uft-8 input always. However, it may produces a sub-string with incorrect encoding, whenever it’s fed the input with with something like “дададада”.
The task then for now is create a function that fixes the output.
Most likely the issue is in the 1st black-fox function. However, for now I need to fix the output rather than the original function itself. Hence, black-box
Do you get the result back as binary string or as a char list?
And how does the “internal” representation actually look like?
Is it [208, 180, 208, 176] or <<208, 180, 208, 176>>?
This difference is important.
Also, do you need the “fixed” string as a char list or as a binary?
And last but not least… The abstractness of the problem sounds pretty much like it was a homework assignment. Can you please verify whether this is or is not a homework or other kind of assignment?
In the past we have seen waves of “homework related questions” that generally came with low effort problems and even lower effort solutions (or less), asking for a ready made answer to their problem.
We generally try to avoid “doing the homework” but instead prefer to help poeple to solve it on their own.
So its just a matter of fact, that “homework assignments” have a bitter taste for a lot of us due to what has happened in the past.
And if this is indeed a homework task, I might take a different approach didactically than if it wasn’t.
In which case, you probably just need to iterate and “fuse” individual “characters” with an ordinal above 127 with the next element, according to the rules of UTF-8.
Take a brief look at this table I have taken from wikipedia (link below):
First code point
Last code point
Byte 1
Byte 2
Byte 3
Byte 4
U+0000
U+007F
0xxxxxxx
U+0080
U+07FF
110xxxxx
10xxxxxx
U+0800
U+FFFF
1110xxxx
10xxxxxx
10xxxxxx
U+010000
U+10FFFF
11110xxx
10xxxxxx
10xxxxxx
10xxxxxx
My assumption about the blackbox is, that it incorrectly converts between binary and list strings and back at least once, and this the result of this.