Why do Base.encode16 & Base.decode16 require padding strings with leading 0's to force even character counts?

mikejm · November 8, 2024, 1:21am

I have observed something peculiar with the Elixir Base module. Typically if you have an integer like:

int64 = 586233761406910463

You may wish to convert this to a hexadecimal string like: 822ca7fffffffff

For example, you can do this here:

https://www.binaryhexconverter.com/hex-to-decimal-converter

https://www.binaryhexconverter.com/decimal-to-hex-converter

This site will provide the conventional conversion with these two results. However, Elixir’s Base.encode16 and Base.decode16 do not provide this conventional function.

Unlike also encode32 and decode32 or encode64 and decode64, the base 16 functions seem uniquely strange in that they pad the resulting string input or output with 0’s to require an even number of string characters.

For example, if you go:

bytes = <<586233761406910463::64>>
# <<8, 34, 184, 127, 255, 255, 255, 255>>
 Base.encode16(bytes, padding: false, case: :lower)
# "0822b87fffffffff"

This returns a 0 padded string. If you try to run this in reverse, the 0 pad is also required in reverse:

Base.decode16("822b87fffffffff", padding: false, case: :lower)
# :error
Base.decode16("0822b87fffffffff", padding: false, case: :lower)
#{:ok, <<8, 34, 184, 127, 255, 255, 255, 255>>}

I have casually checked and it seems Base.decode16 requires any input string to have an even number of characters. You must pad any non-even strings with a 0 in front for them to process. Similarly it will always pad any non-even strings with 0 when it returns them.

This is not typical or expected behavior from anything I have encountered before. One can strip the leading 0 when it exists and add leading 0 when string length is odd, but it is peculiar to me. This also does not occur in base32 or base64 processing of this module so it is not expected or consistent.

I cannot find any GitHub for this module to post a bug question/report. It seems perhaps to be part of the Elixir core.

encode32 / decode32 / encode64 / decode64 all have no problem making or receiving odd string lengths as expected.

Why are encode16 and decode16 enforcing even string lengths? If this is a bug, is it anything that can be fixed?

Is this intentional? Thanks for any thoughts.

benwilson512 · November 8, 2024, 2:49am

This isn’t doing what you think it’s doing. You can see this at the start by just looking at the binary you make:

iex(1)> <<586233761406910463::64>>
<<8, 34, 184, 127, 255, 255, 255, 255>>

See the 255 in there from the start? That’s what’s turning into ff, it isn’t being added by Base.encode16 at all.

NobbZ · November 8, 2024, 8:50am

The Base module is not about converting numbers between different bases. You have Integer.parse/2 and Integer.to_string/2 for that.

The Base module provides functionality for encoding arbitrary binary into their baseX encoding according to RFC 4648.

This RFC explicitely requires some padding.

The most well known usecases for this, is to have attachments in emails, binary data in JSON and YAML, basic authentication in HTTP.

edit: add note about usecases

mikejm · November 11, 2024, 3:43am

It’s not the ff that is the “problem.”. The “correct” value Is;

822b87fffffffff

This is reporting

0822b87fffffffff

Ie. same thing but with a zero at the start.

I am not sure @NobbZ why then Base.encode32 and Base.encode64 don’t do the same. No extra zeros or ‘a’'s are added at the beginning with them.

Either way, if it’s intentional that’s fine. The inconsistency made me think it is an error. No worries.

kip · November 11, 2024, 4:15am

Using your example, simplified, we can see:

iex> Base.encode16 <<8::8>>
"08"

And the same is true for any 8-bit value. So here one byte is converted to 2 encoding characters.

The RFC says:

The encoding process represents 8-bit groups (octets) of input bits
as output strings of 2 encoded characters. Proceeding from left to
right, an 8-bit input is taken from the input data. These 8 bits are
then treated as 2 concatenated 4-bit groups, each of which is
translated into a single character in the base 16 alphabet.

Which does seem to suggest that this is correct. Each octet (8 bits) is converted into two base16 character indices.

Further, the RFC says:

Unlike base 32 and base 64, no special padding is necessary since a
full code word is always available.

Which given that each octet is represented in the output stream as 2 encoded characters make sense.

I ran all the example assertions in the spec for Base16, all of which passed. You’ll note that "f" is encoded as two characters "66" and all the encoded strings are of an even number of characters.

iex> Base.encode16("") == ""
true
iex> Base.encode16("f") == "66"
true
iex> Base.encode16("fo") == "666F"
true
iex> Base.encode16("foo") == "666F6F"
true
iex> Base.encode16("foob") == "666F6F62"
true
iex> Base.encode16("fooba") == "666F6F6261"
true
iex> Base.encode16("foobar") == "666F6F626172"
true

Therefore I think that Elixir is spec compliant. Perhaps other libraries imply a zero-prefix on encoded strings that are not of an even number of characters?

kip · November 11, 2024, 4:19am

Now back to your original question, and the comment that @benwilson512 already made. Base.encode is not the same as base conversion. For your specific example, the following would be appropriate and it returns the answer you are expecting:

iex> int64 = 586233761406910463
586233761406910463
iex> hex = Integer.to_string(int64, 16)
"822B87FFFFFFFFF"
iex> String.to_integer(hex, 16)
586233761406910463

mikejm · November 15, 2024, 3:26am

Just for the sake of clarity, I actually think these are the same thing.

Ie. Base.encode is the same as base conversion.

For example, here is the code I wrote for conversion of int 64 to base 16 strings and back:

# INT U64 TO BASE 16
defmodule My.IntToBase16 do # Hexadecimal (also known as base-16 or simply hex)

    def convert(int_64) do
        bytes = <<int_64::64>>
        #strip_leading_zeros(bytes);
        string = Base.encode16(bytes, padding: false, case: :lower)
        String.trim_leading(string, "0") 
    end

    # Function to strip leading zeros from a binary
    def strip_leading_zeros(<<0, rest::binary>>) do
        strip_leading_zeros(rest)  # Recurse with the remaining binary
    end

    # Base case: return the binary when no leading zeros are left
    def strip_leading_zeros(binary) do
        binary
    end
end

And:

# BASE 16 STRING TO INT U64
defmodule My.Base16ToInt do
    def convert(hex_string) do
        hex_string = String.trim_leading(hex_string, "0x") # trim if given as will glitch decode16 if left in
        hex_string = if rem(String.length(hex_string), 2) == 0 do
            #IO.puts("Even length")
            hex_string
        else
            #IO.puts("Odd length")
            "0"<>hex_string # for issue requiring leading 0 # https://elixirforum.com/t/why-do-base-encode16-base-decode16-require-padding-strings-with-leading-0s-to-force-even-character-counts-bug/67334
        end
        decoded = Base.decode16(hex_string, padding: false, case: :lower) #this expects without the 0x at the start
        bytes = case decoded do
            {:ok, val}->
                val
            _->
                nil
        end

        if is_binary(bytes) do
            :binary.decode_unsigned(bytes, :big) # :big: This indicates that the binary data is in big-endian byte order.
        else
            nil
        end

    end
end

So far as I can tell these work identically to Integer.to_string(int64, 16) and String.to_integer(hex, 16) because that is what those functions are likely doing under the hood.

Is this wrong somehow?

The only caveat is I’m only looking at positive integers so I perhaps have not handled negatives? Not sure how those work as I’m not needing them.

But it appears the only difference between formal Base16 specifications and “Base 16 string” is that with the “string” we accept these can have odd character counts while the true specifications require even counts and leading 0’s.

kip · November 15, 2024, 4:08am

I think there are two differences I see:

Base.encode16/1 will encode arbitrary binaries whereas base conversion is focused on numbers, For example:

iex(3)> Base.encode16 "Thanks 😊"
"5468616E6B7320F09F988A"

The RFC requires that the number of output octets is even whereas base conversion does not.

So perhaps better to say that base conversion (for numbers) as a subset of base16 encoding?