Efficient comparision of large data structures

mpope · November 9, 2020, 7:45pm

Say I have a very large data structure who’s size is in the 10s of MB. This structure is arbitrarily complex, consisting of many tuples and proplists and maps, but in a roughly tree-like structure. What is the most efficient way to compare two of these structures? Was curious before I started running some benchmarks. Is pattern matching breadth first, or depth first, or does it involve a bit to bit comparison of the terms? Would using zlib on a a term_to_binary then comparing the compressed binaries make sense?

I’m having trouble finding documentation on the generally most efficient way of doing this.

benwilson512 · November 9, 2020, 8:27pm

What kind of comparison are you trying to do? A simple equality check? A diff?

mpope · November 9, 2020, 8:33pm

Yup, a check if the two structures are the same or different. Not checking for the same references, but a deep comparison.

benwilson512 · November 9, 2020, 8:34pm

Are you finding that == is not working? == does a structural equality check and should be very efficient.

Schultzer · November 9, 2020, 8:34pm

This would also work.

def same(same, same), do: true
def same(same, not_same), do: false

hauleth · November 9, 2020, 8:37pm

Depends on how the structures are loaded and whether you need to know which parts are different.

mpope · November 9, 2020, 8:37pm

== is working forsure. Was just curious on if == (or pattern matching, pretty sure they compile down to similar things) was definitively the most efficient way.

mpope · November 9, 2020, 8:43pm

Fair, but in this case we want to check for equality across all parts as all are subject to change in this scenario. I believe a better way to rephrase the question would be: is == / pattern matching the most efficient way to ‘fail fast’ if the structures are unequal.

aenglisc · November 9, 2020, 8:54pm

I’m pretty sure that pattern matching is the most efficient way to check for equality.

hauleth · November 9, 2020, 8:55pm

There is no difference between == and pattern matching.

mpope · November 9, 2020, 9:00pm

They compile down to the same things more or less.

aenglisc · November 9, 2020, 9:13pm

I meant “as opposed to comparing hashes or using other tricks”. As far as matching vs equality operators, I believe there shouldn’t be any difference, although to nitpick I vaguely remember pattern matching using the strict operator (=== or =:= in Erlang terms).

NobbZ · November 9, 2020, 10:00pm

They aren’t.

1 === 1.0
#=> false
1 == 1.0
#=> true
1 = 1.0
# BOOOM!

hauleth · November 10, 2020, 1:22am

I meant “in matter of performance”.

mpope · August 4, 2023, 4:00am

To follow up on this, I finally gave this a benchmark. When using a record that is tree-like in structure you can easily beat == in speed using ‘early termination’. By this I mean instead of performing an entire tree scan, one can return false at the first node in the tree to change. It would appear that == does not break evaluation early while comparing the underlying data structures.

For example, I was doing a comparison between to records that are much more wide than deep. Benchee reports it takes 1.29ms for == and 0.00033 ms for early termination. However, the cost is dear if the change is deep in the tree. It jumps to 2.42 ms for unequal nodes further from the head child. So == is good for the general case, however if you know that you’re most likely going to be changing a structure early and you have to make many comparisons on the tree then it could be worth it to test some kind of traversal yourself.

I put this off for so long because I thought == would be most efficient!

jhogberg · August 4, 2023, 5:12am

The equality operator terminates as early as it can. Any speed up you got was most likely because you checked the term in an order that happened to be more efficient (equality check uses DFS IIRC).

benwilson512 · August 4, 2023, 10:34am

Can you post your benchmark code somewhere?

mpope · August 4, 2023, 3:30pm

Interesting, this method was using a DFS style traversal.

jhogberg · August 4, 2023, 4:55pm

The order can still differ significantly, feel free to post your benchmark here so we can check if it’s just that. It would be nice if we could speed things up

zachallaun · August 6, 2023, 12:52pm

Somewhat of a follow up: Was there ever a discussion/proposal for storing the value hash on Erlang’s native, immutable data structures to speed up equality comparisons?

Clojure uses this technique; an old article goes into some of the thought and experimentation behind the hashing technique they chose.