Say I have a very large data structure who’s size is in the 10s of MB. This structure is arbitrarily complex, consisting of many tuples and proplists and maps, but in a roughly tree-like structure. What is the most efficient way to compare two of these structures? Was curious before I started running some benchmarks. Is pattern matching breadth first, or depth first, or does it involve a bit to bit comparison of the terms? Would using zlib on a a term_to_binary then comparing the compressed binaries make sense?
I’m having trouble finding documentation on the generally most efficient way of doing this.
== is working forsure. Was just curious on if == (or pattern matching, pretty sure they compile down to similar things) was definitively the most efficient way.
Fair, but in this case we want to check for equality across all parts as all are subject to change in this scenario. I believe a better way to rephrase the question would be: is == / pattern matching the most efficient way to ‘fail fast’ if the structures are unequal.
I meant “as opposed to comparing hashes or using other tricks”. As far as matching vs equality operators, I believe there shouldn’t be any difference, although to nitpick I vaguely remember pattern matching using the strict operator (=== or =:= in Erlang terms).
To follow up on this, I finally gave this a benchmark. When using a record that is tree-like in structure you can easily beat == in speed using ‘early termination’. By this I mean instead of performing an entire tree scan, one can return false at the first node in the tree to change. It would appear that == does not break evaluation early while comparing the underlying data structures.
For example, I was doing a comparison between to records that are much more wide than deep. Benchee reports it takes 1.29ms for == and 0.00033 ms for early termination. However, the cost is dear if the change is deep in the tree. It jumps to 2.42 ms for unequal nodes further from the head child. So == is good for the general case, however if you know that you’re most likely going to be changing a structure early and you have to make many comparisons on the tree then it could be worth it to test some kind of traversal yourself.
I put this off for so long because I thought == would be most efficient!
The equality operator terminates as early as it can. Any speed up you got was most likely because you checked the term in an order that happened to be more efficient (equality check uses DFS IIRC).
The order can still differ significantly, feel free to post your benchmark here so we can check if it’s just that. It would be nice if we could speed things up
Somewhat of a follow up: Was there ever a discussion/proposal for storing the value hash on Erlang’s native, immutable data structures to speed up equality comparisons?
Clojure uses this technique; an old article goes into some of the thought and experimentation behind the hashing technique they chose.