Efficient comparision of large data structures

Say I have a very large data structure who’s size is in the 10s of MB. This structure is arbitrarily complex, consisting of many tuples and proplists and maps, but in a roughly tree-like structure. What is the most efficient way to compare two of these structures? Was curious before I started running some benchmarks. Is pattern matching breadth first, or depth first, or does it involve a bit to bit comparison of the terms? Would using zlib on a a term_to_binary then comparing the compressed binaries make sense?

I’m having trouble finding documentation on the generally most efficient way of doing this.

What kind of comparison are you trying to do? A simple equality check? A diff?

3 Likes

Yup, a check if the two structures are the same or different. Not checking for the same references, but a deep comparison.

Are you finding that == is not working? == does a structural equality check and should be very efficient.

This would also work.

def same(same, same), do: true
def same(same, not_same), do: false

Depends on how the structures are loaded and whether you need to know which parts are different.

== is working forsure. Was just curious on if == (or pattern matching, pretty sure they compile down to similar things) was definitively the most efficient way.

1 Like

Fair, but in this case we want to check for equality across all parts as all are subject to change in this scenario. I believe a better way to rephrase the question would be: is == / pattern matching the most efficient way to ‘fail fast’ if the structures are unequal.

I’m pretty sure that pattern matching is the most efficient way to check for equality.

There is no difference between == and pattern matching.

3 Likes

They compile down to the same things more or less.

I meant “as opposed to comparing hashes or using other tricks”. As far as matching vs equality operators, I believe there shouldn’t be any difference, although to nitpick I vaguely remember pattern matching using the strict operator (=== or =:= in Erlang terms).

3 Likes

They aren’t.

1 === 1.0
#=> false
1 == 1.0
#=> true
1 = 1.0
# BOOOM!
1 Like

I meant “in matter of performance”.

4 Likes

To follow up on this, I finally gave this a benchmark. When using a record that is tree-like in structure you can easily beat == in speed using ‘early termination’. By this I mean instead of performing an entire tree scan, one can return false at the first node in the tree to change. It would appear that == does not break evaluation early while comparing the underlying data structures.

For example, I was doing a comparison between to records that are much more wide than deep. Benchee reports it takes 1.29ms for == and 0.00033 ms for early termination. However, the cost is dear if the change is deep in the tree. It jumps to 2.42 ms for unequal nodes further from the head child. So == is good for the general case, however if you know that you’re most likely going to be changing a structure early and you have to make many comparisons on the tree then it could be worth it to test some kind of traversal yourself.

I put this off for so long because I thought == would be most efficient!

1 Like

The equality operator terminates as early as it can. Any speed up you got was most likely because you checked the term in an order that happened to be more efficient (equality check uses DFS IIRC).

6 Likes

Can you post your benchmark code somewhere?

Interesting, this method was using a DFS style traversal.

The order can still differ significantly, feel free to post your benchmark here so we can check if it’s just that. It would be nice if we could speed things up :slight_smile:

2 Likes

Somewhat of a follow up: Was there ever a discussion/proposal for storing the value hash on Erlang’s native, immutable data structures to speed up equality comparisons?

Clojure uses this technique; an old article goes into some of the thought and experimentation behind the hashing technique they chose.