Hey everyone, I made an algorithm which will take a list of strings with its correspondent trigrams and group them by similarity. It basically returns the same similarity score as the PostgreSQL pg_trgm extension.
For my inputs, I have something like this:
values = [
%{name: "Eduardo", trigram: MapSet.new([" e"," ed","ard","do ","dua","edu","rdo","uar"])},
%{name: "Jeferson", trigram: MapSet.new([" j"," je","efe","ers","fer","jef","on ","rso","son"])},
%{name: "Eduardo B.", trigram: MapSet.new([" b"," e"," b "," ed","ard","do ","dua","edu","rdo","uar"])},
%{name: "Jefferson", trigram: MapSet.new([" j"," je","eff","ers","fer","ffe","jef","on ","rso","son"])},
%{name: "Maria", trigram: MapSet.new([" m"," ma","ari","ia ","mar","ria"])}
]
The trigrams are generated directly by PostgreSQL using the show_trgm
function (ex. select show_trgm('Eduardo');
)
defmodule GroupBySimilarity do
def group(values, threshold \\ 0.9)
def group([], _threshold), do: []
def group([value | rest], threshold) do
%{trigram: trigram} = value
{similar, rest} =
Enum.split_with(rest, fn %{trigram: rhs_trigram} -> similarity(trigram, rhs_trigram) >= threshold end)
[[value] ++ similar] ++ group(rest, threshold)
end
defp similarity(lhs, rhs) do
intersection_count = lhs |> MapSet.intersection(rhs) |> MapSet.size()
union_count = lhs |> MapSet.union(rhs) |> MapSet.size()
intersection_count / union_count
end
end
As you can see, the algorithm is pretty simple, it splits the values
list based on the similarity/2
function and then repeats the process with the remaining values until the list is empty.
So, for the example values above, the result would be:
iex(194)> GroupBySimilarity.group(values, 0.7)
[
[
%{
name: "Eduardo",
trigram: MapSet.new([" e", " ed", "ard", "do ", "dua", "edu", "rdo",
"uar"])
},
%{
name: "Eduardo B.",
trigram: MapSet.new([" b", " e", " b ", " ed", "ard", "do ", "dua",
"edu", "rdo", "uar"])
}
],
[
%{
name: "Jeferson",
trigram: MapSet.new([" j", " je", "efe", "ers", "fer", "jef", "on ",
"rso", "son"])
},
%{
name: "Jefferson",
trigram: MapSet.new([" j", " je", "eff", "ers", "fer", "ffe", "jef",
"on ", "rso", "son"])
}
],
[
%{
name: "Maria",
trigram: MapSet.new([" m", " ma", "ari", "ia ", "mar", "ria"])
}
]
]
This works great, but the problem is that the algorithm is very slow when the input is bigger, for example, for 10_000
values, it will take around 279 seconds.
I would love any suggestion on how to improve its performance and lower its complexity.