Implement custom 'compare' for Ecto models used in MapSet.difference()

jjabba · May 22, 2019, 5:38pm

I’m working with Ecto modules and want to use MapSet and in particular MapSet.difference(s2, s1) to find which members of prime number collectors cards first edition I’m missing.
Let me exemplify:

my_cards = MapSet.new([{1, 2},{1, 3}, {1, 7}, {1, 11}])
full_collection = MapSet.new([{1, 2},{1, 3},{1, 5},{1, 7}, {1, 11}, {1, 13}])
missing_cards = MapSet.difference(full_collection, my_cards)
# MapSet<[{1, 5}, {1, 13}]>

Now I persist the full collection using an Ecto schema.
When trying to do the same math using instances of this model, it will fail, as the schema will have other members such as primary key {id}, timestamps and other.

How Can I augment either MapSet or the schema so that &MapSet.difference/2 is still useful using %PrimeNumberCollectorsCard{edition: 1, prime: 2} instances???

OvermindDL1 · May 22, 2019, 5:57pm

Are you wanting to treat your %PrimeNumberCollectorsCard{} type ‘as’ things in the set? Or are you wanting to extract the numbers and put those in the set? Or are you wanting the set to be inside of %PrimeNumberCollectorsCard{}'s?

Qqwy · May 22, 2019, 6:02pm

Interesting question!

Only indirectly, by transforming your %PrimeNumberCollectorsCard{}-struct into something that has an identical structure for equivalent datatypes.

Let me first show you a simple ‘solution’ that directly uses MapSet.difference, but still has a problem:

def prime_card_identity(card = %PrimeNumberCollectorsCard{}) do
  {card.edition, card.prime}
end

def missing_cards(your_collection = %MapSet{}, full_collection = %MapSet{}) do
  your_collection_identities = MapSet.new(your_collection, &prime_card_identity/1)
  full_collection_identities = MapSet.new(full_collection, &prime_card_identity/1)
  MapSet.difference(your_collection_identities, full_collection_identities)
end

This works mostly, but the return result of missing_cards/2 will be a MapSet containing identity-tuples like{1, 2}. You could write a function to recover the original %PrimeNumberCollectorsCard{}-struct, by constructing one (but then you do not have the fields that are missing from the identity representation), or by looking it up again in the database (which is slow since we basically discard results from an earlier query, and now do at least one (but possibly N) new queries).

Instead of using MapSet.difference, we might be able to use a Map directly: In a map, the keys have to be unique, but under each key we can track any value we like (whose uniqueness is not enforced by the map). This would make our code look as follows:

def missing_cards(your_collection = %MapSet{}, full_collection = %MapSet{}) do
  your_collection_keys = Enum.map(your_collection, &prime_card_identity)
  
  full_collection
  |> Map.new(&({prime_card_identity(&1), &1}))
  |> Map.drop(your_collection_keys)
  |> Map.values
  |> MapSet.new
end

Interestingly, MapSet.difference/2 is implemented much in the same way under the hood.

jjabba · May 22, 2019, 8:29pm

@OvermindDL1 My idea was to piggyback on mapset to sort out which members of s1 and s2 are the “same” but as per my own definition of what is ‘same’.

Blockquote
Are you wanting to treat your %PrimeNumberCollectorsCard{} type ‘as’ things in the set?

Yes

Blockquote
Or are you wanting to extract the numbers and put those in the set?

No, the whole point is to use the set operations in MapSet without having to transform and then resurrect the instances in the result from and to %PrimeNumberCollectorsCard{}

Or are you wanting the set to be inside of %PrimeNumberCollectorsCard{} 's?

No, I don’t see any point of that.

@Qqwy

Perhaps my brain is still too OOP wired, but I was expecting to find some way to provide my custom comparer to the ‘MapSet.difference(s1,s2,[comparer: fn(s, t) → …]’ or perhaps a way to augment the %PrimeNumberCollectorsCard{} object with a new way to compare itself to another instance. A way which would then be automatically honored by the &MapSet.difference/2
I could certainly take the route of re-implementing the MapSet.difference but the whole idea with my approach was to avoid re-inventing the wheel

OvermindDL1 · May 22, 2019, 8:42pm

MapSet is an old setup, it doesn’t have comparator function support or anything.

If you are just wanting the difference information between two of the sets then you could use something like map_diff or a variety of other libraries out there for that, or @Qqwy’s code is a good start for it.

I wouldn’t recommend MapSet in that case, it’s designed to be a very low level construct.

Qqwy · May 23, 2019, 10:27am

No worries: This has nothing to do with OOP vs functional programming.
Instead, MapSet was built in the way it is for (a) simplicity and (b) low memory (both RAM and when serialized on disk) usage. The trade-off has been made that elements in a MapSet are always expected to be unique according to their structural equality, see Erlang term comparisons’ “exactly equal to”.

When adding a separate ‘element’ vs ‘element identity’ representation, you’ll have to either:

keep track of them separately in the datastructure, thus using more memory.
recalculate the identity all the time when manipulating the set, thus being significantly slower (with operations going from amortized constant time or logarithmic time to linear time).

And with the alternate approach of using a ‘comparer’ function and storing the elements internally in a binary tree, we still reduce amortized constant running time to logarithmic running time, as well as needing n * log n amount of space for the tree.

Neither of these approaches are bad, but they make different trade-offs than the ones made for MapSet.
I definitely do think that there is a place for sets that allow either custom identities or a custom comparer function. Hopefully, this will be filled by a library at some point, because re-inventing this wheel for every project definitely is not fun and not productive.

If you are wondering by the way, the difference between these two approaches (a comparer function vs custom identities) is similar to the difference between Enum.sort/2 and Enum.sort_by/2.