Why does a bitstring use more words than a binary

denvaar · August 30, 2021, 2:56pm

I’ve been reading this article: How Elixir Lays Out Your Data in Memory . From what I understand, :erts_debug.size/1 can be used to see how many “words” of memory that a given term takes up. For example:

iex(3)> :erts_debug.size(:a)
0
iex(4)> :erts_debug.size("abc")
6
iex(5)> :erts_debug.size([1])
2
iex(6)> :erts_debug.size([1, 2])
4
iex(7)> :erts_debug.size([])
0
iex(8)> :erts_debug.size(1)
0
iex(9)> :erts_debug.size(1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000)
12
iex(10)> :erts_debug.size(%{"a" => 1})
12

But what I can’t figure out is why a single bit uses 11 words, while a byte uses only 6:

iex(1)> :erts_debug.size(<<1::1>>)
11
iex(2)> :erts_debug.size(<<97>>)
6

Could someone explain why this is so?

hauleth · August 30, 2021, 7:11pm

I haven’t (yet) searched in code, but I see 2 possible explanations:

there was "abc" string somewhere in shared heap
small string optimisation

The latter is quite interesting beast. It works due to fact, that on 64-bit machine (most of used nowadays) when you store a pointer, then some of the top bits will not be ever set. This mean, that if these are set, then it will 100% not point to valid memory. That mean that we can do hack like storing tag in few top bits, and data in rest. For example if binary is shorter than 7 bytes, then it will be stored directly inside 64-bit pointer. I assume that if such optimisation is used, then it may be, that it is not used for bitstring, and only for “full” binaries.

dom · August 31, 2021, 1:54pm

I think the struct that represents a binary on the heap can only have a size in bytes, whereas sub-binary references can also have bit size/offset, so <<1::1>> is stored as a binary + a sub-binary. See the branch here :

github.com

erlang/otp/blob/OTP-24.0.5/erts/emulator/beam/emu/bs_instrs.tab#L562

    
      
          	 erts_bin_offset = 0;
          	 erts_writable_bin = 0;
          	 hb = (ErlHeapBin *) HTOP;
          	 HTOP += heap_bin_size(num_bytes);
          	 hb->thing_word = header_heap_bin(num_bytes);
          	 hb->size = num_bytes;
          	 erts_current_bin = (byte *) hb->data;
          	 new_binary = make_binary(hb);
          
          
     do_bits_sub_bin:
          	 if (num_bits & 7) {
          	     ErlSubBin* sb;
          
          
	     sb = (ErlSubBin *) HTOP;
          	     HTOP += ERL_SUB_BIN_SIZE;
          	     sb->thing_word = HEADER_SUB_BIN;
          	     sb->size = num_bytes - 1;
          	     sb->bitsize = num_bits & 7;
          	     sb->offs = 0;
          	     sb->bitoffs = 0;
          	     sb->is_writable = 0;

And the struct declarations:

github.com

erlang/otp/blob/OTP-24.0.5/erts/emulator/beam/erl_binary.h#L174

    
      
          
          
#define ERL_ONHEAP_BIN_LIMIT 64
          
          
#define ERL_SUB_BIN_SIZE (sizeof(ErlSubBin)/sizeof(Eterm))
          #define HEADER_SUB_BIN	_make_header(ERL_SUB_BIN_SIZE-2,_TAG_HEADER_SUB_BIN)
          
          
/*
           * This structure represents a HEAP_BINARY.
           */
          
          
typedef struct erl_heap_bin {
              Eterm thing_word;		/* Subtag HEAP_BINARY_SUBTAG. */
              Uint size;			/* Binary size in bytes. */
              Eterm data[1];		/* The data in the binary. */
          } ErlHeapBin;
          
          
#define heap_bin_size(num_bytes)		\
            (sizeof(ErlHeapBin)/sizeof(Eterm) - 1 +	\
             ((num_bytes)+sizeof(Eterm)-1)/sizeof(Eterm))
          
          
#define header_heap_bin(num_bytes) \

github.com

erlang/otp/blob/OTP-24.0.5/erts/emulator/beam/erl_bits.h#L31

    
      
          #ifndef __ERL_BITS_H__
          #define __ERL_BITS_H__
          
          
/*
           * This structure represents a SUB_BINARY.
           *
           * Note: The last field (orig) is not counted in arityval in the header.
           * This simplifies garbage collection.
           */
          
          
typedef struct erl_sub_bin {
              Eterm thing_word;		/* Subtag SUB_BINARY_SUBTAG. */
              Uint size;			/* Binary size in bytes. */
              Uint offs;			/* Offset into original binary. */
              byte bitsize;
              byte bitoffs;
              byte is_writable;		/* The underlying binary is writable */
              Eterm orig;			/* Original binary (REFC or HEAP binary). */
          } ErlSubBin;
          
          
/*

denvaar · August 31, 2021, 3:12pm

This is really interesting. I guess the lesson here is to remember that binaries/bitstrings are first and foremost a way to work with bytes. Sure, you can operate on bits too, but that doesn’t make the memory allocated any less (as we can see it does the opposite ) Is this a correct way to think about it?

dimitarvp · August 31, 2021, 7:47pm

Not precisely. That extra overhead for working with bitstrings is fixed and never goes beyond what you saw. It’s just that there’s no way to just store only 3 bits somewhere and have that only take a byte at most. You still need some management metadata for such a data structure but that metadata is very small and quickly amortized if you manage only several bytes worth of bit data.