Ruby and Elixir decode64 return a string of different size

denys · March 6, 2020, 1:54pm

Hi everyone
I have an encoded string
“kdscdL/7BlFir6RpPm41X3ynKp20eDl0nmIkS7bQjLgtqlUyhV/zQREjakFsUdnL3Lc9sqhVsR7czVLl4tYQNRZRkzcyKyv3c0g8OVgfoopwYYwc027enDroSlduCRy15QEllvmb6kItbzgS93DLQ81xBqvVtFc8rJLf2/8Ij6bvPf4XaMI2CMgyxmzvw/AKalJKYVFjCcm2rkZ4Hh2JyoyJwhdQ+Oec65axuugA6Kmo1PYNOUR12Ha2Y3938CFHael3mC6YRvZr”

Ruby’s Base64.decode64 return a string of size 201 (encoded.size = 201)
Elixirs Base.decode64! returns a string of size 191 (String.length(encoded) = 191)

Does Elixir looses something?

lucaong · March 6, 2020, 2:00pm

That’s because String.length\1 does not give the size in bytes, but the number of graphemes. If you call byte_size(Base.decode64!(encoded)) you get the expected value of 201.

denys · March 6, 2020, 2:10pm

Luca, thank you

Sanjibukai · March 6, 2020, 2:30pm

Just checked in IEx:

iex> h String.length
...
Returns the number of Unicode graphemes in a UTF-8 string.
...

So TIL to always look at the doc for obviously named functions to be sure that they do what ~~they claim~~ we might think they do… (edited)

lucaong · March 6, 2020, 2:34pm

Good point about reading docs, but about doing what they claim, to be fair, the String module contains functions to deal with UTF-8 encoded strings, as the name suggests, not with arbitrary binaries.

As recent posts have shown, bitstrings vs. binaries vs. strings can be a tricky topic that deserves attention. But I don’t think this is a case of surprising behavior of the String module

Sanjibukai · March 6, 2020, 2:40pm

I just noticed that my wording is confusing… I wanted to point out that (at least for me) well-named functions since they are well, well-named (no pun intended) could lead to not have to check the doc… I’d never imagine to check the doc for String.length… Which TIL to do it anyway.

And for sure, the doc and the behavior is the right one…

lucaong · March 6, 2020, 2:47pm

Yes, and don’t get me wrong, I definitely fell into this trap too at some point. Unicode and encoding are difficult topics, but also important in a world where English is not the only language

dimitarvp · March 6, 2020, 3:01pm

This is more about wrong assumptions than library docs IMO.

People are used to strings being byte/ASCII arrays while several languages, Elixir and Rust included, immediately tell you that their strings are a list of UTF-8 characters (graphemes and codepoints can be extracted separately as well which is very useful).