Int_set - A time- and memory-efficient data structure for positive integers

I’ve written a data structure that might be helpful for people working with collections of integers.

It stores integers in a binary via their index, so zero is stored as <<1::1>>, one is stored as <<0::1, 1::1>>, etc. This means that set operations like union and intersection use bitwise operations and are blazing fast. I benchmarked it against MapSet and found that unions are six times faster, and intersections are sixty-four times faster!! Other operations like put and membership checking are a little slower, but if you’re doing set operations on collections of numbers, I think you’ll find a lot of memory and time savings using this library.

This library also supports converting to and from raw binaries, so it can be serialized quite easily. Serialization also benefits from the memory savings, since each integer only takes up a single bit.

The drawback with this library is that memory size scales with the largest integer in the collection. I plan on researching how binaries are handled by the VM to see what optimizations I can do to save on some memory in sparse sets. In most cases, however, as long as your inputs are properly bounded, you’re saving on memory and winning some really fast set operations.

You can use IntSet as a drop-in replacement for MapSet. Check out the README for examples: GitHub - Cantido/int_set: A time- and memory-efficient data structure for positive integers.

10 Likes