Similarity - A Library for easy cosine similarity calculation

Sum:
I have used in production the folllowing formula for cosine similarity calculation:
(list_a and list_b are both the same length)
cosine_similarity(list_a, list_b) * :math.sqrt(length(list_a))

This is good because it ignores the scale of the attributes but the :math.sqrt(...) part takes into account the number of common attributes. (first you want to extract ordered common attributes from pairs of elements)

This worked out so great actually I was surprised how good it was.

The library: https://github.com/preciz/similarity

If you have >10k lists with length > 200 and you need to calculate this frequently between all then this might not cut it. (I would probably just check the library how to do it and use the db to calculate it).

If you don’t know what is this for then here is an
Example pseudo code use case:

people
|> map(pictures)
|> map(image_labels)
|> calculate_similarities
|> save_to_db
|> power_suggestions_based_on_similarities

Thanks for checking out! (I wrote this today quickly, any corrections welcome.)

4 Likes

Nice API :+1:

One suggestion is to set the source_ref option in the mix docs config to the git tag for the release. It ensures the links from the docs to the source code always gets to the correct place. docs

1 Like

Thank you, done.

1 Like

Nice! For those larger lists (>10k) you could add an option to use Matrex for the dot products / cosines. I’m doing some weightings that way and it seems fast. :man_shrugging:

1 Like

That is a nice suggestion. I didn’t think about that actually. I also used Matrex locally to speed up some genetic programming experiment in January. I like that lib, was fast.

However that needs an install of OS packages if I include it as dependency if I’m correct?

I would like to keep the main Similarity module as clean as it is now, for others to learn how this all works, just so somebody can try out this use case, evaluate his own business benefits, then figure out how to scale it on his own.

I had to calculate with ~5 million records in prod, and if I would have to redo it, I would move it to the db for sure. Why load all that data in memory?

What you think?

Also you can use for accelerating a flow lib or stream. For example:

  def dot_product(list_a, list_b) when length(list_a) == length(list_b) do
    [list_a, list_b]
    |> Stream.zip()
    |> Stream.map(fn {x, y} -> x * y end)
    |> Enum.sum()
  end

  def magnitude(list) do
    list
    |> dot_product(list)
    |> :math.sqrt()
  end

and compare via Benchee which solution is better…

You can pass a configuration option to compile without BLAS and using C fallbacks.

It makes sense to keep it simple if you’re going for a example code. Unfortunately making the code use either probably would complicate it a lot.

I had to calculate with ~5 million records in prod, and if I would have to redo it, I would move it to the db for sure. Why load all that data in memory?

Oh true, guess that depends on whether you can peg the DB with that much load. With Elixir you could use rate limiters and spin it up on a temporary server.