Exor_filter, an xor_filter NIF. 'Faster and Smaller Than Bloom and Cuckoo Filters'

May I present exor_filter! You can find the library on my github. It is a nif wrapper for the xor_filter. A xor_filter is like a bloom filter, but is faster, has a smaller memory footprint, and has a smaller false-positive rate than bloom filters!

There are a good amount of use cases, but I was planning on using it for a bad-word filter for the commenting system on my personal blog. I think this one will cover l33t speak better than a bloom filter, because it is less memory constrained.

An issue that I believe exists is loading a large volume of values on initialization. If this is done other on startup, it could delay the VM due to it not being a dirty nif. However, it wouldnā€™t make sense to make it dirty due to the low access time of the filter. Its a bit of a catch 22. Using the xor8_buffered_initialize function could help, as the original repo says the buffered initialization is faster.

More information could be found in the linked original repository.

5 Likes

This is excellent. I think Iā€™m going to use your code as a template for understanding erlang resources and to make sure I can have good support for them in my zigler library.

2 Likes

Awesome! This is my first nif so it may not be 100% correct, I have yet to profile the memory for leaks. I found the official documentation a bit confusing, so I worked off the jiffy library as an example on how to do it.

I have pushed an update to his library. It now includes 2 default hashing mechanisms, as well as an easier to use interface. Also, some issues with the C nif code have been resolved. Here is an example usage of the library (Iā€™d add it to the first post but it seems that I canā€™t edit it anymore?)

Filter = exor_filter:xor8(["test1", "test2", "test3"]),
true   = exor_filter:xor8_contain(Filter, "test1"),
false  = exor_filter:xor8_contain(Filter, "test6"),
ok     = exor_filter:xor8_free(Filter).

Passing a list of any erlang term works now, instead of having to pass pre-hashed data. But the option to pass pre-hashed data or a special hashing function still is an option.

Next up is benchee to compare against ETS and a bloom filter nif!

Also, it seems that 10M records is quite slow to initialize, this might need to be split into two nifs. A dirty nif that does the loading if the list size is over 100K (?), then the second regular nif for just for access. The pass filter references would need to be passed between themā€¦

Also, it seems there is no mechanism to send a fun to a nif and call it (besides coordination between a nif and genserver). Is this possible? Itā€™d reduce the memory footprint for the custom hash functions as I could call the function as the filter is being assigned in the C code. A copy of the original list needs to be made at the current timeā€¦

As I recall a dirty set or not can be per call?

Calling back into the VM is not efficient, message passing is expected to be used. Could have the custom function be Lua via luajit or support wasm or so?

1 Like

I actually just pushed an update that chooses a dirty version of the nif if the passed list is over 10K records. So it was simpler than I thought :smiley: Just exported two versions of the nif function.

2 Likes

Iā€™ve updated this module! It can be found here: https://github.com/mpope9/exor_filter/releases/tag/v0.3.0

Much has changed! Here is a short list:

  1. Error when duplicate keys are present in the hash function and pre-hashed options, there is no longer spinning (hanging forever on initialization, which was a bit of an issue)!
  2. A ā€˜rawā€™ interface to avoid this, use with caution. It is faster, because it doesnā€™t check for duplicates but doesnā€™t protect from the infinite initialization.
  3. Convenience modules xor8 and xor16 .
  4. Dialyzer fixes
  5. Moar tests
  6. Documentation updates.
  7. Simplified C and erlang code.
  8. Removal of fast_hash option, as duplication checking would have voided the usefulness.

With the convenience modules added, usage is now simplified:

Filter = xor8:new(["cat", "dog", "mouse"]),
true   = xor8:contain(Filter, "cat"),
false  = xor8:contain(Filter, "goose"),
ok     = xor8:free(Filter).
2 Likes

perhaps someone can check me if Iā€™m wrong, but your filter selector https://github.com/mpope9/exor_filter/blob/v0.3.0/src/exor_filter.erl#L296 is O(N) in list length, due to the length/1 call, which you may or may not want. I mean obviously you have to eventually traverse the entire list to unmarshal it, but you will be going over potentially super-long lists more than once.

2 Likes

So it is! https://erlang.org/doc/efficiency_guide/commoncaveats.html#length-1 I always assumed the function was constant timeā€¦I will look into alternatives, but it could be the fastest way anyways, due to it being implemented in C code.

in theory since itā€™s boxed anyways, thereā€™s no real reason why erlang couldnā€™t have tagged their list indices :man_shrugging:. You might be able to record that length once outside and pass it in? But this might be premature optimization. I imagine the primary use case will be to prepopulate it once, when itā€™s okay to be slow (like on boot, or like daily updates)

1 Like

I think its because the space cost. It would increase the size of each cons-cell by ~30%, but only as long as lists remain ā€œshortā€ (shorter than an unsigned pointer sized integer could represent), then all of a sudden we have big ints in the conscell, and those get expensive quicklyā€¦

As soon as this limit has been passed, you wouldnā€™t even be able anymore to cons in constant time, as increasing a big int is not O(1) anymore!

2 Likes

thatā€™s a good point! But I donā€™t think you can realistically cons a list that is bigger than 2^64, since even at one byte per thatā€™s probably about 10-100x more data than the NSA has in utah.

1 Like

There are still systems in use that have only 32 bit pointer size.

2 Likes

Actually just thought of a workaround, is_greater_than_10_thousand, which would return true at 10001 iterations. Will work that next. There is also an improvement to the underlying library which will increase initialization speed in some cases. Will bundle these together when Iā€™m free.

New release, v0.4.1! Can be found here:

A few things that this addresses:

  • Ran cppcheck on underlying xorfilter library and on nif code, fixed minor errors.
  • Performance increases, as the underlying library now handles dupes. Can be removed from the erlang code.
  • Removed call to legnth/1 and added a function that has max 10001 iterations for checking list length. Better for large lists.
3 Likes

New release, v0.5.2! It can be found here:

A few new features have been added, included serialization of the filter to binary. This is useful for sending across networks or storing in databases. Also fixed a compiler error on Erlang 23. If below Erlang 23 is needed, version v.0.5.1 is recommended.

Also got around to benchmarking the library! Here is the benchmark suite. It compares results to several bloom filter libraries, one that is also a nif. The results look great, so if youā€™re using a bloom filter you should consider switching!

3 Likes

New release, v0.6.0! You can find it here: https://github.com/mpope9/exor_filter/releases/tag/v0.6.0

This version adds an incremental initialization api. The previous api required you to pass the entire list. Now, the list can be assembled in the nif over time, which should be more memory efficient. Here is an example:

Filter0 = xor8:new_empty(2),           %% new_empty/0 defaults to 64 elements.  Either function
                                       %% will dynamically allocate more space as 
                                       %% needed while elements are added.
Filter1 = xor8:add(Filter0, [1, 2]),
Filter2 = xor8:add(Filter1, [3, 4]),   %% More space allocated here.
Filter3 = xor8:finalize(Filter3),      %% finalize/1 MUST be called to actually intialize the filter.
true    = xor8:contain(Filter3, 1),
false   = xor8:contain(Filter3, 6).

Deduplication of input data for the incremental initialization api isnā€™t supported yet, nor is a dirty nif api for large input initialization. finalize will fail if dups are present. The next release will add support for those.

1 Like

Version v0.7.0 has just been released! This version adds deduplication of input to the incremental initialization API. This means there is a stable API to load data over time without having to do any tracking of data outside of the filter. Example:

Filter0 = xor8:new_empty(),
Filter1 = xor8:add(Filter0, ["test1", "test2", "test3", "test4"]),
Filter2 = xor8:add(Filter1, ["test3", "test4"]), 
Filter3 = xor8:finalize(Filter3),
true    = xor8:contain(Filter3, "test1"),
false   = xor8:contain(Filter3, "test6").

It can be found here: https://github.com/mpope9/exor_filter/releases/tag/v0.7.0

Under the hood, it uses an ewah compressed bitmap to prevent duplicate elements.

Also available on hex.pm now :slight_smile: https://hex.pm/packages/exor_filter