Elixir UTF-8 encoded string back to normal values in production

Due to some requirements my colleague used to_string for numbers and arrays(basically any value) to store as map like this %{value: value |> to_string} But now these values are not readable eg I see {value: ଷ} , {value: ϊ}. I need to convert back to normal values for readability and to perform certain json operations. I need back {value: [324]} etc, not some special characters. How can I do this, I have been trying different ways from 4/5 hours? Mostly integer array values got converted to special symbols eg [314] |> to_string will give "ĺ".

Any help/workaround is really appreciated. Thanks

I can do ?map[:value] but it will not work for values which are some arrays or strings or true/false. it Just breaks

iex(1)> "ĺ" |> String.to_charlist()
[314]

I think this should do the trick.

1 Like

Thank you so much. its very close.
How can I distinguish between following

iex(93)> i = "ĺ" |> String.to_charlist()
[314]
iex(94)> is_list i
true
iex(95)> a = "AKA" |> String.to_charlist()
'AKA'
iex(96)> is_list a
true
iex(97)> k = "ଷ" |> String.to_charlist()
[2871]
iex(98)> is_list k

"AKA" is string but list after to_charlist applied but I dont wanna convert this to array as its not supposed to be converted. I only wanna filter and update values in database which were corrupted eg value

Not corrupted, a to_string on a char_list returns the string that the char_list represents, so that is what they are. As strings and char_lists are convertible between each other then it is impossible to know what originally it was before a to_string as to_string is inherently a lossy conversion (it loses the original type information to make everything just a ‘binary’).

If your normal strings are constrained to, say, the 32-127 range then you can just test if the char values are outside of that and if so then leave it as a char-list or else leave it as a binary or so.

Remember, a list with all elements being integers is a char_list and can be treated as a string in many places, and is treated as such when converting it to a binary. If someone wants to store arbitrary data in a string/binary field then they should encode it somehow, such as via :erlang.term_to_binary/1/:erlang/binary_to_term/2 or via Jason for json encoding or something. Just converting anything to string generally makes it inaccurately reversible at best and makes it unreversible for the great great majority of datatypes.

"AKA" is string but list after to_charlist applied but I dont wanna convert this to array as its not supposed to be converted. I only wanna filter and update values in database which were corrupted eg value

They aren’t being converted to an array but rather to character lists, which is just a list of exclusively integral parts. You can see more information in the repl:

iex(1)> i 'AKA'
Term
  'AKA'
Data type
  List
Description
  This is a list of integers that is printed as a sequence of characters
  delimited by single quotes because all the integers in it represent valid
  ASCII characters. Conventionally, such lists of integers are referred to
  as "charlists" (more precisely, a charlist is a list of Unicode codepoints,
  and ASCII is a subset of Unicode).
Raw representation
  [65, 75, 65]
Reference modules
  List
Implemented protocols
  IEx.Info, List.Chars, Inspect, Collectable, String.Chars, Enumerable

Note the Raw representation section, a charlist string is a list of integers, just like a binary string is an array of 8-bit integers:

iex(3)> i "AKA"
Term
  "AKA"
Data type
  BitString
Byte size
  3
Description
  This is a string: a UTF-8 encoded binary. It's printed surrounded by
  "double quotes" because all UTF-8 encoded codepoints in it are printable.
Raw representation
  <<65, 75, 65>>
Reference modules
  String, :binary
Implemented protocols
  IEx.Info, List.Chars, Inspect, Collectable, String.Chars

Note it’s Raw representation as well. Same numbers, one is encoded as a list, the other as a binary array.

4 Likes