Parsing Git Packfile variant length headers

I’m implementing a pure Elixir Git server (supporting SSH and HTTP) for fun an profit.

One of the tedious tasks i’m currently working on is the parsing of Git packfiles.

The pack-format documentation states following:

  packed object header:
      1-byte size extension bit (MSB)
           type (next 3 bit)
           size0 (lower 4-bit)
      n-byte sizeN (as long as MSB is set, each 7-bit)
   		   size0..sizeN form 4+7+7+..+7 bit integer, size0
    	   is the least significant part, and sizeN is the
    	   most significant part.

With Elixir we have the possibility to pattern-match on bitstrings which is great. In order to find the variant size of an object, i came with following implementation:

  # MSB is not set, size is contained in four bits.
  defp unpack_obj_head(<<0::1, type::3, size::4, rest::binary>>) do
    {type, size, rest}
  end

  # MSB is set, read next byte
  defp unpack_obj_head(<<1::1, type::3, obj_num::bitstring-4, rest::binary>>) do
    {size, rest} = unpack_obj_size(rest, obj_num)
    {type, size, rest}
  end

  # MSB is not set, calculate size based on acc_num and obj_num
  defp unpack_obj_size(<<0::1, obj_num::bitstring-7, rest::binary>>, acc_num) do
    with acc <- <<acc_num::bitstring, obj_num::bitstring>>,
         len <- bit_size(acc),
       <<num::integer-size(len)>> <- acc, do: {num, rest}
  end

  # MSB is set, read next byte
  defp unpack_obj_size(<<1::1, obj_num::bitstring-7, rest::binary>>, acc_num) do
    unpack_obj_size(rest, <<acc_num::bitstring, obj_num::bitstring>>)
  end

But I’m missing something here, the result size does not match the size object body…
Clearly, I’m doing something wrong here:

with acc <- <<acc_num::bitstring, obj_num::bitstring>>,
     len <- bit_size(acc),
   # next line is not correct, we need to shift things here
   <<num::integer-size(len)>> <- acc, do: {num, rest}

I came across a great article, git clone in Haskell from the bottom up. It states:

The overall length is then: size0 + size1 + … + sizeN.
After shifting each part 0, 4 + (n-1) * 7 to the left. size0 is the least, sizeN the most significant part.

Is there a simple way to achieve this with Elixir?

3 Likes

With the help of @nox on the #elixir IRC channel, I finally got it working properly.

The solution was to

  1. left-shift each length part with 4+7*i.
  2. add the result to the accumulator.
  defp unpack_obj_head(<<0::1, type::3, size::4, rest::binary>>) do
    {type, size, rest}
  end

  defp unpack_obj_head(<<1::1, type::3, num::4, rest::binary>>) do
    {size, rest} = unpack_obj_size(rest, num, 0)
    {type, size, rest}
  end

  defp unpack_obj_size(<<0::1, num::7, rest::binary>>, acc, i) do
    {acc + (num <<< 4+7*i), rest}
  end

  defp unpack_obj_size(<<1::1, num::7, rest::binary>>, acc, i) do
    unpack_obj_size(rest, acc + (num <<< (4+7*i)), i+1)
  end
3 Likes

If someone has some Haskell knowledges, I’m stuck with the next parsing problem: Extract the offset and size length in bits for delta copy commands.

Git deltas are a chain of copy/insert commands to apply to an existing Git object. Git uses a variant length to store offset and size.

Basically, my Elixir implementation so far looks like this:

  # MSB is not set, this is an INSERT command.
  defp unpack_obj_delta_hunk(<<0::1, size::7, data::binary-size(size), rest::binary>>, cmds) do
    IO.puts "insert #{inspect data}"
    unpack_obj_delta_hunk(rest, [{:insert, data}|cmds])
  end

  # MSB is set, this is a COPY command.
  defp unpack_obj_delta_hunk(<<1::1, _::1, 0::1, 1::1, offset::4, size::8, rest::binary>>, cmds) do
    IO.puts "copy from byte #{offset} to #{size}"
    unpack_obj_delta_hunk(rest, [{:copy, {offset, size}}|cmds])
  end

In order to parse correctly a COPY command, i need to parse offset and size length in bits.
This information is stored in the first byte:

Bit 1: MSB, used to check if the command is COPY or INSERT.
Bit 2: Is always 0
Bit 3-4: Big question mark, this seems to tell how long (in bits) offset and size are.
Bit 4-8: Offset (or not, depending on bits 3-4).

I do have an Haskell implementation code but I do not fully understand it:

I someone can help me on this, I would be more than happy =)

1 Like