Ruby and Elixir decode64 return a string of different size

Hi everyone
I have an encoded string
“kdscdL/7BlFir6RpPm41X3ynKp20eDl0nmIkS7bQjLgtqlUyhV/zQREjakFsUdnL3Lc9sqhVsR7czVLl4tYQNRZRkzcyKyv3c0g8OVgfoopwYYwc027enDroSlduCRy15QEllvmb6kItbzgS93DLQ81xBqvVtFc8rJLf2/8Ij6bvPf4XaMI2CMgyxmzvw/AKalJKYVFjCcm2rkZ4Hh2JyoyJwhdQ+Oec65axuugA6Kmo1PYNOUR12Ha2Y3938CFHael3mC6YRvZr”

Ruby’s Base64.decode64 return a string of size 201 (encoded.size = 201)
Elixirs Base.decode64! returns a string of size 191 (String.length(encoded) = 191)

Does Elixir looses something?

That’s because String.length\1 does not give the size in bytes, but the number of graphemes. If you call byte_size(Base.decode64!(encoded)) you get the expected value of 201.

7 Likes

Luca, thank you

1 Like

Just checked in IEx:

iex> h String.length
...
Returns the number of Unicode graphemes in a UTF-8 string.
...

So TIL to always look at the doc for obviously named functions to be sure that they do what they claim we might think they do… (edited)

2 Likes

Good point about reading docs, but about doing what they claim, to be fair, the String module contains functions to deal with UTF-8 encoded strings, as the name suggests, not with arbitrary binaries.

As recent posts have shown, bitstrings vs. binaries vs. strings can be a tricky topic that deserves attention. But I don’t think this is a case of surprising behavior of the String module :slightly_smiling_face:

3 Likes

I just noticed that my wording is confusing… I wanted to point out that (at least for me) well-named functions since they are well, well-named (no pun intended) could lead to not have to check the doc… I’d never imagine to check the doc for String.length… Which TIL to do it anyway.

And for sure, the doc and the behavior is the right one…

Yes, and don’t get me wrong, I definitely fell into this trap too at some point. Unicode and encoding are difficult topics, but also important in a world where English is not the only language :slightly_smiling_face:

1 Like

This is more about wrong assumptions than library docs IMO.

People are used to strings being byte/ASCII arrays while several languages, Elixir and Rust included, immediately tell you that their strings are a list of UTF-8 characters (graphemes and codepoints can be extracted separately as well which is very useful).

2 Likes