An implementation of Apriori algorithm and making it fast

thantez · May 4, 2020, 5:06pm

Hi.
I want implement Apriori and Eclat’s algorithm that will working on a large item-set. I wrote Apriori’s code and it will work nice with a tiny testing transaction list but for main transaction list that have 25341 rows, it will be blocked and then crashed because of using data more than my RAM’s capacity!
For concurrency, I tried to use Actor model and spawn_link() that I think it is a wrong approach .
So what can I do to optimize this code?
My codes repository
Apriori’s implementation
Transactions list
I can’t find good resources for this problem, so I post this topic.
I’m a newbie to Elixir’s community and want to learn this amazing language but now I don’t know Erlang or OTP totally, so if you can, help me with good details. Thanks.

al2o3cr · May 4, 2020, 10:44pm

One big change from other languages is that lists are immutable linked lists, so some operations become very expensive and take time (and sometimes space!) proportional to the length of the list.

Some examples:

a = [1]
b = [2 | a]

Prepending to the beginning of a list is fast - it allocates a single new cons cell. Takes constant time.

a = [1]
b = [1] ++ [2]

Concatenating two lists is more expensive - all the cells in the left-hand argument to ++ need to be reallocated, taking time and space proportional to the size of that list.

a = [1,2,3]
b = hd(a)

This is efficient, since the head is the first cons cell. Takes constant time. Note that tl(a) is also efficient, since those are the SAME cons cells as in a.

a = [1,2,3]
b = List.at(a, -1)

in order to access “the end” of a list, all the cons cells need to to be walked. List.at will take time proportional to the size of the list.

You’ll need to modify your algorithm to avoid doing the inefficient kind of operations; for instance, popping from the end might instead be represented with a constant-time functional queue like Erlang’s :queue (docs)

Also note (at least according to Wikipedia) the time complexity of this algorithm is exponential in the number of unique values in the input…

Re: the usage of spawn_link - consider using Task instead, it’s specifically tuned to the “take input, process, send message and then shut down” pattern you’ve implemented. Task.async_stream may prove useful as well.

thantez · May 8, 2020, 8:01am

I tried to optimize my code according to @al2o3cr advices and now Apriori and Eclat will work but I want work with these algorithms with minimum support near 0.001! If I set this value for minimum support, Apriori will work but will lock for a long time (that can be ignorable) but Eclat want big space of RAM, so will killed by Erlang's VM. What should I do?
I removed any expensive works of list and made a linear algorithm as possible But I see a simple code of elixir is not efficient for this job but Why? Please help me to resolve this problem.

michallepicki · May 8, 2020, 9:18am

I didn’t read too deeply so you may already know these but some general tips:

Recursion is often the best tool for writing complicated loops but to allow the compiler to do some optimizations, make sure your functions are tail-recursive (I think yours are but good to double check).

When you use a Stream, you lose the laziness of it (it gets ran) immediately when you pipe it into an Enum function like Enum.reduce. So make sure you don’t mix them by mistake.

Look through what data structures and tools are provided with OTP. Some may have better performance or memory footprint characteristics for the particular job you want to do. Look at :ets for example.

Spawning processes with Flow is not cheap. Make sure doing that actually improves performance on a case-by-case basis and check that you pass only the data that is needed for this chunk of work. Sending big measages to many processes is slow and could result in big memory usage.

Some micro optimizations can definitely be applied to reduce Elixir protocol overhead, e.g. use erlang’s :lists module instead of going through Enum… But it may be not worth much, it depends.

Benchmark what is faster with the Benchee library. Small chunks of code.

Have fun, this is cool stuff!

lucaong · May 8, 2020, 10:00am

Consider that Apriori has a space complexity of O(2^n), so, no matter which language you use, if you have a large dataset, you keep all your data in RAM, and use a very low minimum support, you will encounter memory utilization problems.

lucaong · May 8, 2020, 10:19am

Also, one tip: I see that your items in the itemsets are strings like "POLKADOT". It might make sense to convert them to numeric IDs or atoms (especially if your program is short-lived, so you won’t have issues with “leaking atoms”). That’s in order to make sure that multiple copies of the same item won’t be stored multiple times in memory, and instead stored only once and referenced.

thantez · May 8, 2020, 10:55am

Thank you. Is there any way to use Flow with low cost? Task.async_stream was expensive too!
I need concurrency to increase speed. Any good Idea?

thantez · May 8, 2020, 11:06am

I know these algorithms are not good about space or runtime but I see another projects with another languages that implemented these algorithms that they work with a good performance. This is my problem.

Thanks. I was also worried about the “leaking atoms” problem that I didn’t use the atom, but later I realized that I was wrong, and that’s exactly what you said.

michallepicki · May 8, 2020, 11:22am

It’s actually easy to make things slower by trying to parallelize things. Modern processors are usually really fast but are bottlenecked by memory access and bad data layouts. In Elixir most of this is abstracted away but in general you want to make sure that the overhead of parallelizing is lower than what you gain from running on multiple cores/machines at the same time (I hope that makes sense). So you need to find the balance of how much work you parallelize and how many workers to create. The gains may be bigger on bigger data sets.

lucaong · May 8, 2020, 11:42am

In addition to what @michallepicki noted, when parallelizing, you should avoid sending huge data structures as messages between processes, as that would involve a lot of copying. Often it is a good idea to implement the algorithm without parallelization, using it as a benchmark, and adding parallelization as an optimization.