Efficient way to perform vector dot product

OK, it makes more sense now.

First, it seems that you might profit from running comparisons between different documents concurrently. You could use Task.async_stream for this, which would ensure that you’re not running too many things at once. This should give you a speed up of x, where x is close to the number of CPU cores.

This means that you’d need to store the documents in ETS tables. However, it doesn’t mean you need 1M of ETS tables. In a simple solution, you could have just one table, where keys would be e.g. {table_id, term}, and the values would be frequencies. Since a single table could be very big, you may want to resort to sharding, for example by using 1000 tables for 1M documents.

If you take the ETS approach, when you’re looking up a term, make sure to return only frequency, and not the key. Also, if you’re using it with multiple processes, turn on read_concurrency. You’ll also probably need write_concurrency, but it’s best if you determine this experimentally.

When it comes to recursion, I mentioned it in a context of a single comparison between two documents. In your case you have a situation where each step of the loop is simple and short (a lookup followed by a product), while the loop itself takes a lot of steps. In such scenarios, it can happen that the overhead of Enum becomes visible. So instead, you can try to hand-roll the loop by using recursion. That might shave off a bit of time, though I wouldn’t hold my breath :slight_smile: The fact remains that you need to do 1M lookups (if I understand correctly, a vector can have 1M entries, right?), so this is probably where you’re spending most of your time.

Finally, it’s worth considering if Elixir is really a good tool for the job here. If I got it right, you have 1M documents, and each new document will produce a 1M map which needs to be compared against the existing documents. This means that if comparison between a vector and one document takes 1ms, the total time in a sequential version would be 1000sec (~ 17 minutes). If you take a concurrent approach, you can reduce this to a few minutes on a solid machine.

However, I’m highly skeptical that you can bring down a single comparison to 1ms. Something in the area of few tens, or maybe even few hundreds of milliseconds seems more realistic. This is just gut feeling, I don’t have hard numbers, so you obviously need to experiment. But if I’m right, the total time to process a single document would skyrocket to a few hours, which I presume is not acceptable.

So this is the point where you may want to consider implementing this entire logic in something else. For example, you could have a Rust program which keeps these documents in memory, and performs the comparisons. You could still use Elixir as a tool which manages the entire system. So for example, Elixir could queue pending documents, sending one by one to Rust in a demand driven fashion. Elixir could also be the place where you implement the web server (assuming you have such needs).

So in summary, I’d advise the following course of actions:

  1. Focus on a single comparison between a large vector and a large document. Try to optimize this as much as you can. Try with maps, and try with ETS tables, to see which version works faster, and what are the differences.
  2. Once you have the numbers, you can estimate if Elixir is even a good fit.
  3. If yes, then try to work on making it concurrent.
  4. If not, then try to do it in a faster language. C, C++, or Rust would be my choices.

Hope this helps!

4 Likes

Thanks again for the helpful suggestions. Please allow me to make some clarifications though as I kinda skimped through a few requirements. My very first original design is actually ETS based with the following forma (each row is well, a row in the ETS)t:

{"doc_uuid1", %{"foo" => 0.123, "bar" => 0.777, ...}}
{"doc_uuid2", %{"foz" => 0.223, "zet" => 0.001, ...}}
...
{"doc_uuid1000000", %{"foo" => 0.93, "zet" => 0.1, ...}}

where the ID of the ETS table is the ID of my document and the value is just a plain Map. Now on average the Map only contains about 200-400 KV pairs, as I store it in a “sparse” way, as in if “doc_uuid1” don’t have the words “zet” it won’t even exists in the KV pair. Now my use case is actually, a new document being ingested, say {doc_id, query_vector} = {"doc_uuid1000001", %{"foo" => 0.01, "zot" => 0.1}}, a relatively short one for the sake of the argument, I would like to know out of the 1M documents which one has a high dot product, then I will actually append this to to ETS table as well. But as you well know,

fun = :ets.fun2ms(fn {doc_uuid, vector} when dot_product(query_vector, vector) > 0.5 -> id end)
:ets.select(:stored_vectors, fun)

won’t work. But reading your suggestion, and please correct me if I am wrong, I could potentially store my data in the ETS as such instead:

{{"doc_uuid1", "foo"}, 0.123}
{{"doc_uuid1", "bar"}, 0.777}
{{"doc_uuid2", "foz"}, 0.223}
...

And perhaps I can pull fun2ms retrieving on the entries, like:

query_keys = Map.keys(query_vector)
fun = :ets.fun2ms(fn {{doc_uuid, key}, vector} when key in Map.keys -> {{doc_uuid, key}, vector} end)
:ets.select(:stored_vectors, fun)

Then perhaps I can do a Enum.group_by on the results to compute the final result I want, which is the ID of the documents that have the dot product of > 0.5.

P/S: Off topic, your book if the first book that our company bought when we switched our main stack to Elixir/Erlang and we have all learnt a lot from it.

That’s what you want to do. Storing data in {name, map} doesn’t help you get the benefits of storing data in ETS.

Does the detection of vectors whose product is > k need to be exact, or would it be ok to get an approximate answer? If so, another possibility is to first select a smaller set of candidate vectors that are more likely to satisfy the condition, and then perform the exact dot product only on those candidates. This kind of technique is quite common in recommender systems, and usually relies on some kind of locality-sensitive hashing (LSH).

An LSH is a kind of hashing function with the property that items that are “close to each other” (according to a specific definition of distance) produce similar hashes.

For similar situations (lots of vectors, goal to find nearest neighbors of a given one) in the past I used LSH schemes quite successfully. In my case though, my metric was the cosine distance: depending on your application, that specific LSH scheme might or might not be a good solution for dot products.

Finally, if your vectors are sparse, but you can batch-process them in advance, you might benefit from some dimensionality reduction technique, to transform them into lower-dimentional, less sparse vectors that still retain the original relative distances as much as possible.

I hope this helps, and sorry for the noise in case this doesn’t apply to your case :slight_smile:

That sounds a bit more promising, though doing a million lookups will still require some time.

You could store {{doc_id, term}, frequency} tuples into ETS, and then invoke :ets.match(:stored_vectors, {{doc_id, term}, :"$1"}) to get the frequency.

One question: how frequent are these terms, i.e. how many documents will on average contain the same term?

Happy to hear that!

Well even after all the stop words removal, lemmatization I still find that usually ~85% document would contains at least one overlapped items with the query document. As in, if I have 1M documents already stored and when a new query document is ingested about 800K documents would have at least one overlapped term. Well I will need to optimise my algorithm if I were to use :ets as if I just retrieve the frequency using :ets.match I would lose the term, imagine the following scenario:

ETS
{{"uuid1", "a"}, 0.005},
{{"uuid1", "b"}, 0.800}

New Document Terms
%{"a" => 0.5, "c" => 0.1}

If I just extract the frequency from the ETS I wouldn’t know to multiply it with 0.5 or 0.1, unless I changed my algorithm to handle one term at a time as such probably:

Enum.map(new_document_terms, fn {k, v} -> :ets.match() ...  end)

But yeah thanks for the suggestions I would definitely try it out and see how does it perform against my other solution based in mostly PostgreSQL/MADlib/PL/python3u

Could you analyze the overlap distribution across terms? If most of the terms are mostly not overlapping, then I think you might profit from having mapping organized as term -> list({document_id, frequency}). This could be easily achieved with a bag ets table (duplicate_bag might give you faster insertion times).

The reasoning here is that instead of querying 1M of other documents, you’d only make about 300 lookups (one per each term in the input document). If most terms are present in a relatively small fraction of the documents, this might work much faster. So basically, something like:

new_document_terms
# for the terms of the new document, collect {document_id, frequency_product} pairs
|> Stream.flat_map(fn {term, frequency} ->
  Enum.map(
    frequencies(term), # this function returns list({document_id, frequency})
    fn {document_id, existing_frequency} -> {document_id, frequency * existing_frequency} end
  )
end)
# aggregate by document_id and compute the similarity
|> Enum.reduce(
  %{},
  fn {document_id, frequency_product}, similarities ->
    Map.update(similarities, document_id, frequency_product, &(&1 + frequency_product))
  end
)
# take only similar documents
|> Enum.filter(fn {_document_id, similarity} -> similarity > 0.5 end)
1 Like

Well the thing is some common words does have a huge number of overlap with other documents in the ETS but however:

This design might be just what I need! I will run more tests to confirm this. AFAIK, I will only be doing 300 ETS lookup (which is O(1)) and I just need to gauge the Enum.reduce/filter part.

I faced exactly this issue with bag when I have 1M documents and each have about 200 “rows” to be inserted into the ETS (a 200M rows ETS). Even using public table and write_concurrency: true/false and performing the insertion on both serial/multi process is so excruciatingly slow but duplicate_bag saved the day. Thanks for the heads up beforehand.

I wouldn’t open the champagne quite yet :slight_smile:
If there are a lot of terms which are present in many documents, then this approach might end up being slow, maybe even slower than your original take. But anyway, give it a try, and let us know how it went.

Ahh you’re right, the bottleneck is now at the Map.update part due to too many documents having overlapped terms.

EDIT: I am having a crazy idea in mind where instead of returning a list of {document_id, frequency * existing_frequency} to the next Enum.reduce which is kinda the bottleneck (probably due to immutability), I would push the results of frequency * existing_frequency to a “temporary” ETS or some other storage where I would perform the Enum.reduce there.

Yeah. This is why I asked about term distribution. If you don’t have a lot of highly overlapping terms, then the algo might work just fine.

This would work if the sum was an integer. Then you could use update_counter to quickly aggregate.

However, this isn’t the case here, but I don’t think all is lost. You could use Flow to leverage parallelism and fetch each term from a separate process, and then aggregate each document in a separate process. I’ll see if I can make the sketch later.

But either way, I think that this approach of groupping per terms will only work if most of the terms are fairly scarce.

1 Like

I noticed above it mentions that lookups in ETS are O(1), whereas for Map.get I believe they are O(log N), as per documentation.

Why have this disparity, and in that case, why does Map not use ETS behind-the-scenes?

Because ETS ist a mutable store, a map is a “single” value you can change and pass around without needing to fear that another process changes your map.

1 Like