Similarity - A Library for easy cosine similarity calculation

Sum:
I have used in production the folllowing formula for cosine similarity calculation:
(list_a and list_b are both the same length)
cosine_similarity(list_a, list_b) * :math.sqrt(length(list_a))

This is good because it ignores the scale of the attributes but the :math.sqrt(...) part takes into account the number of common attributes. (first you want to extract ordered common attributes from pairs of elements)

This worked out so great actually I was surprised how good it was.

The library: https://github.com/preciz/similarity

If you have >10k lists with length > 200 and you need to calculate this frequently between all then this might not cut it. (I would probably just check the library how to do it and use the db to calculate it).

If you don’t know what is this for then here is an
Example pseudo code use case:

people
|> map(pictures)
|> map(image_labels)
|> calculate_similarities
|> save_to_db
|> power_suggestions_based_on_similarities

Thanks for checking out! (I wrote this today quickly, any corrections welcome.)

4 Likes

Nice API :+1:

One suggestion is to set the source_ref option in the mix docs config to the git tag for the release. It ensures the links from the docs to the source code always gets to the correct place. docs

1 Like

Thank you, done.

1 Like

Nice! For those larger lists (>10k) you could add an option to use Matrex for the dot products / cosines. I’m doing some weightings that way and it seems fast. :man_shrugging:

1 Like

That is a nice suggestion. I didn’t think about that actually. I also used Matrex locally to speed up some genetic programming experiment in January. I like that lib, was fast.

However that needs an install of OS packages if I include it as dependency if I’m correct?

I would like to keep the main Similarity module as clean as it is now, for others to learn how this all works, just so somebody can try out this use case, evaluate his own business benefits, then figure out how to scale it on his own.

I had to calculate with ~5 million records in prod, and if I would have to redo it, I would move it to the db for sure. Why load all that data in memory?

What you think?

Also you can use for accelerating a flow lib or stream. For example:

  def dot_product(list_a, list_b) when length(list_a) == length(list_b) do
    [list_a, list_b]
    |> Stream.zip()
    |> Stream.map(fn {x, y} -> x * y end)
    |> Enum.sum()
  end

  def magnitude(list) do
    list
    |> dot_product(list)
    |> :math.sqrt()
  end

and compare via Benchee which solution is better…

You can pass a configuration option to compile without BLAS and using C fallbacks.

It makes sense to keep it simple if you’re going for a example code. Unfortunately making the code use either probably would complicate it a lot.

I had to calculate with ~5 million records in prod, and if I would have to redo it, I would move it to the db for sure. Why load all that data in memory?

Oh true, guess that depends on whether you can peg the DB with that much load. With Elixir you could use rate limiters and spin it up on a temporary server.

1 Like

There were 2 simhash libraries on hex.pm so far.

The “spirit_fingers” library which say it was implemented cause Elixir implementation was slow. This one fails to compile for me with a Rust NIF compile error.

The “simhash” package (probably author of spirit_fingers pointed to this one when saying it’s slow) I did submit a few PR to make it faster, I have included benchmarks etc. but so far these were ignored.
https://github.com/UniversalAvenue/simhash-ex/pulls

Based on the simhash library I have implemented a 2x faster version (in Elixir but uses the same NIF for hashing as “simhash”).

You can use it like:

iex(1)> Similarity.simhash("looking for work", "in elixir", ngram_size: 3)     
0.515625

Performance comparison:

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking simhash-ex...
Benchmarking similarity.simhash...

Name                         ips        average  deviation         median         99th %
similarity.simhash        3.67 K      272.69 μs     ±6.50%      267.84 μs      353.05 μs
simhash-ex                1.75 K      572.14 μs    ±12.31%      552.22 μs      781.02 μs

Comparison:
similarity.simhash        3.67 K
simhash-ex                1.75 K - 2.10x slower +299.46 μs

Hope you like it. I’m open for any feedback. Thanks.

2 Likes

Spirit Fingers 0.4.1 released yesterday upgrading to elixir 1.14, rustler 0.27 and Rust 2021 stable 1.67.

Benchmarking against simhash-ex and similarity the underlying NIFs are showing the kind of performance numbers I needed and again need now I am back looking at this. I also need to efficiently handle large binaries too, processing them rapidly.

Comparison:
holsee/spirit_fingers           860.25 K
UniversalAvenue/simhash-ex        2.15 K - 399.92x slower +463.72 μs
preciz/similarity                 1.83 K - 470.50x slower +545.78 μs

I did notice some excessive memory consumption when passing > 1MB binaries to be compared which sadly ruled our similarity :frowning: and I didn’t run against simhash-ex as I experienced crashes against master.

Similarity is an awesome library @preciz very impressive set of APIs available, I will 100% keep it in mind if my use cases extend to cosine similarity calculations.

The same author as the simhash-ex wrote the rust lib I am leveraging which does all the heavy lifting. I merely put the thinnest layer on top and provided API which allows me to compute hashes out of band of the similarity analysis as it suited my app to do so, however in the benchmarks I used similar API for comparison.

If you see anything awry with the benchmarks or if there are versions of either lib available not on hex or master branches, be glad to update. I am looking to flame graph the rust code to understand its memory allocations further but seemingly it is benefiting from the direct native memory access and allocations.

Bechee:

====================================
=== Using binary size: 115 bytes ===
====================================

Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 16 GB
Elixir 1.14.3
Erlang 25.0.4

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 42 s

Benchmarking UniversalAvenue/simhash-ex ...
Benchmarking holsee/spirit_fingers ...
Benchmarking preciz/similarity ...

Name                                 ips        average  deviation         median         99th %
holsee/spirit_fingers           860.25 K        1.16 μs   ±221.80%        1.13 μs        1.42 μs
UniversalAvenue/simhash-ex        2.15 K      464.88 μs    ±13.30%      454.50 μs      577.06 μs
preciz/similarity                 1.83 K      546.94 μs    ±56.76%      512.75 μs      835.84 μs

Comparison:
holsee/spirit_fingers           860.25 K
UniversalAvenue/simhash-ex        2.15 K - 399.92x slower +463.72 μs
preciz/similarity                 1.83 K - 470.50x slower +545.78 μs

Memory usage statistics:

Name                          Memory usage
holsee/spirit_fingers            0.0391 KB
UniversalAvenue/simhash-ex       542.38 KB - 13884.80x memory usage +542.34 KB
preciz/similarity                588.42 KB - 15063.60x memory usage +588.38 KB

**All measurements for memory usage were the same**

========================================
=== Using binary size: 1150000 bytes ===
========================================

Note: excluding UniversalAvenue/simhash-ex as it errors.

Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 16 GB
Elixir 1.14.3
Erlang 25.0.4

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 28 s

Benchmarking holsee/spirit_fingers ...
Benchmarking preciz/similarity ...

Name                            ips        average  deviation         median         99th %
holsee/spirit_fingers         73.70       0.0136 s     ±3.00%       0.0136 s       0.0144 s
preciz/similarity            0.0805        12.41 s     ±0.00%        12.41 s        12.41 s

Comparison:
holsee/spirit_fingers         73.70
preciz/similarity            0.0805 - 914.98x slower +12.40 s

Memory usage statistics:

Name                     Memory usage
holsee/spirit_fingers      0.00000 GB
preciz/similarity             5.65 GB - 151767819.00x memory usage +5.65 GB

**All measurements for memory usage were the same**

=================================================
=== Spirit Fingers using various binary sizes ===
=================================================

Note: excluding preciz/similarity as memory usage was extreme and results didn't show within 10 mins
Note: excluding UniversalAvenue/simhash-ex as it errors.

Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 16 GB
Elixir 1.14.3
Erlang 25.0.4

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 42 s

Benchmarking holsee/spirit_fingers (1.15 MB) ...
Benchmarking holsee/spirit_fingers (11.5 MB) ...
Benchmarking holsee/spirit_fingers (115 bytes) ...

Name                                        ips        average  deviation         median         99th %
holsee/spirit_fingers (115 bytes)     849385.51     0.00118 ms    ±76.96%     0.00117 ms     0.00133 ms
holsee/spirit_fingers (1.15 MB)           73.25       13.65 ms     ±3.17%       13.77 ms       14.42 ms
holsee/spirit_fingers (11.5 MB)            7.31      136.83 ms     ±2.27%      137.00 ms      143.26 ms

Comparison:
holsee/spirit_fingers (115 bytes)     849385.51
holsee/spirit_fingers (1.15 MB)           73.25 - 11595.12x slower +13.65 ms
holsee/spirit_fingers (11.5 MB)            7.31 - 116218.34x slower +136.83 ms

Memory usage statistics:

Name                                 Memory usage
holsee/spirit_fingers (115 bytes)            40 B
holsee/spirit_fingers (1.15 MB)              40 B - 1.00x memory usage +0 B
holsee/spirit_fingers (11.5 MB)              40 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**
------------------------------------------------------

P.S. apolgises for the necrobump. :zombie:

2 Likes