OK, it makes more sense now.
First, it seems that you might profit from running comparisons between different documents concurrently. You could use Task.async_stream for this, which would ensure that you’re not running too many things at once. This should give you a speed up of x
, where x
is close to the number of CPU cores.
This means that you’d need to store the documents in ETS tables. However, it doesn’t mean you need 1M of ETS tables. In a simple solution, you could have just one table, where keys would be e.g. {table_id, term}
, and the values would be frequencies. Since a single table could be very big, you may want to resort to sharding, for example by using 1000 tables for 1M documents.
If you take the ETS approach, when you’re looking up a term, make sure to return only frequency, and not the key. Also, if you’re using it with multiple processes, turn on read_concurrency. You’ll also probably need write_concurrency, but it’s best if you determine this experimentally.
When it comes to recursion, I mentioned it in a context of a single comparison between two documents. In your case you have a situation where each step of the loop is simple and short (a lookup followed by a product), while the loop itself takes a lot of steps. In such scenarios, it can happen that the overhead of Enum becomes visible. So instead, you can try to hand-roll the loop by using recursion. That might shave off a bit of time, though I wouldn’t hold my breath The fact remains that you need to do 1M lookups (if I understand correctly, a vector can have 1M entries, right?), so this is probably where you’re spending most of your time.
Finally, it’s worth considering if Elixir is really a good tool for the job here. If I got it right, you have 1M documents, and each new document will produce a 1M map which needs to be compared against the existing documents. This means that if comparison between a vector and one document takes 1ms, the total time in a sequential version would be 1000sec (~ 17 minutes). If you take a concurrent approach, you can reduce this to a few minutes on a solid machine.
However, I’m highly skeptical that you can bring down a single comparison to 1ms. Something in the area of few tens, or maybe even few hundreds of milliseconds seems more realistic. This is just gut feeling, I don’t have hard numbers, so you obviously need to experiment. But if I’m right, the total time to process a single document would skyrocket to a few hours, which I presume is not acceptable.
So this is the point where you may want to consider implementing this entire logic in something else. For example, you could have a Rust program which keeps these documents in memory, and performs the comparisons. You could still use Elixir as a tool which manages the entire system. So for example, Elixir could queue pending documents, sending one by one to Rust in a demand driven fashion. Elixir could also be the place where you implement the web server (assuming you have such needs).
So in summary, I’d advise the following course of actions:
- Focus on a single comparison between a large vector and a large document. Try to optimize this as much as you can. Try with maps, and try with ETS tables, to see which version works faster, and what are the differences.
- Once you have the numbers, you can estimate if Elixir is even a good fit.
- If yes, then try to work on making it concurrent.
- If not, then try to do it in a faster language. C, C++, or Rust would be my choices.
Hope this helps!