Elixir, unicode and odbc - what is the encoding value used by odbc?

blackrez · December 16, 2022, 11:35am

Hello,

I’m a newbie in Elixir and I’m trying to build a data application using odbc.
The connection works great

iex(19)> r = DuckDBex.ODBC.query(pid, q)
%{conn: #PID<0.159.0>}
{:selected, ['ville', 'type_carburant', 'cp', 'p'],
 [
   {[80, 195, 169, 108, 105, 115, 115, 97, 110, 110, 101], 'E10', '13330',
    1.535}
 ]}
iex(20)> to_string [80, 195, 169, 108, 105, 115, 115, 97, 110, 110, 101]
"PÃ©lissanne"
iex(21)> # instead of Pélisanne
iex(22)> to_charlist 'Pélisanne'
[80, 233, 108, 105, 115, 97, 110, 110, 101]

The string returned by odbc doesn’t look in utf-8 (but in database it is utf-8).

How I can solve this issue and what is the encoding value used by odbc ?

Thanks.

Nicd · December 16, 2022, 11:44am

It seems to be UTF-8 bytes returned as a list of integers. When it’s intrepreted as Unicode codepoints (charlist), it returns that malformed text. The proper way is to interpret each integer as 8 bits uint of a UTF-8 encoded string.

Here are a couple of ways to transform it properly:

iex(2)> for byte <- [80, 195, 169, 108, 105, 115, 115, 97, 110, 110, 101], into: <<>>, do: <<byte>>
"Pélissanne"

and

iex(8)> :unicode.characters_to_binary([80, 195, 169, 108, 105, 115, 115, 97, 110, 110, 101], :utf8, :latin1)
"Pélissanne"

Personally I’d probably use the former because the values for InEncoding/OutEncoding params in the latter kind of seem to be in wrong order (to my intuition). But its performance is probably worse if performance is critical here.