Hi everyone
I have an encoded string
“kdscdL/7BlFir6RpPm41X3ynKp20eDl0nmIkS7bQjLgtqlUyhV/zQREjakFsUdnL3Lc9sqhVsR7czVLl4tYQNRZRkzcyKyv3c0g8OVgfoopwYYwc027enDroSlduCRy15QEllvmb6kItbzgS93DLQ81xBqvVtFc8rJLf2/8Ij6bvPf4XaMI2CMgyxmzvw/AKalJKYVFjCcm2rkZ4Hh2JyoyJwhdQ+Oec65axuugA6Kmo1PYNOUR12Ha2Y3938CFHael3mC6YRvZr”
Ruby’s Base64.decode64 return a string of size 201 (encoded.size = 201)
Elixirs Base.decode64! returns a string of size 191 (String.length(encoded) = 191)
That’s because String.length\1 does not give the size in bytes, but the number of graphemes. If you call byte_size(Base.decode64!(encoded)) you get the expected value of 201.
Good point about reading docs, but about doing what they claim, to be fair, the String module contains functions to deal with UTF-8 encoded strings, as the name suggests, not with arbitrary binaries.
As recent posts have shown, bitstrings vs. binaries vs. strings can be a tricky topic that deserves attention. But I don’t think this is a case of surprising behavior of the String module
I just noticed that my wording is confusing… I wanted to point out that (at least for me) well-named functions since they are well, well-named (no pun intended) could lead to not have to check the doc… I’d never imagine to check the doc for String.length… Which TIL to do it anyway.
And for sure, the doc and the behavior is the right one…
Yes, and don’t get me wrong, I definitely fell into this trap too at some point. Unicode and encoding are difficult topics, but also important in a world where English is not the only language
This is more about wrong assumptions than library docs IMO.
People are used to strings being byte/ASCII arrays while several languages, Elixir and Rust included, immediately tell you that their strings are a list of UTF-8 characters (graphemes and codepoints can be extracted separately as well which is very useful).