Suggestions to improve my group by string similarity algorithm performance

sezaru · December 9, 2023, 2:11pm

Hey everyone, I made an algorithm which will take a list of strings with its correspondent trigrams and group them by similarity. It basically returns the same similarity score as the PostgreSQL pg_trgm extension.

For my inputs, I have something like this:

values = [
  %{name: "Eduardo", trigram: MapSet.new(["  e"," ed","ard","do ","dua","edu","rdo","uar"])},
  %{name: "Jeferson", trigram: MapSet.new(["  j"," je","efe","ers","fer","jef","on ","rso","son"])},
  %{name: "Eduardo B.", trigram: MapSet.new(["  b","  e"," b "," ed","ard","do ","dua","edu","rdo","uar"])},
  %{name: "Jefferson", trigram: MapSet.new(["  j"," je","eff","ers","fer","ffe","jef","on ","rso","son"])},
  %{name: "Maria", trigram: MapSet.new(["  m"," ma","ari","ia ","mar","ria"])}
]

The trigrams are generated directly by PostgreSQL using the show_trgm function (ex. select show_trgm('Eduardo');)

defmodule GroupBySimilarity do
  def group(values, threshold \\ 0.9)
  
  def group([], _threshold), do: []

  def group([value | rest], threshold) do
    %{trigram: trigram} = value

    {similar, rest} =
      Enum.split_with(rest, fn %{trigram: rhs_trigram} -> similarity(trigram, rhs_trigram) >= threshold end)

    [[value] ++ similar] ++ group(rest, threshold)
  end

  defp similarity(lhs, rhs) do
    intersection_count = lhs |> MapSet.intersection(rhs) |> MapSet.size()
    union_count = lhs |> MapSet.union(rhs) |> MapSet.size()

    intersection_count / union_count
  end
end

As you can see, the algorithm is pretty simple, it splits the values list based on the similarity/2 function and then repeats the process with the remaining values until the list is empty.

So, for the example values above, the result would be:

iex(194)> GroupBySimilarity.group(values, 0.7)
[
  [
    %{
      name: "Eduardo",
      trigram: MapSet.new(["  e", " ed", "ard", "do ", "dua", "edu", "rdo",
       "uar"])
    },
    %{
      name: "Eduardo B.",
      trigram: MapSet.new(["  b", "  e", " b ", " ed", "ard", "do ", "dua",
       "edu", "rdo", "uar"])
    }
  ],
  [
    %{
      name: "Jeferson",
      trigram: MapSet.new(["  j", " je", "efe", "ers", "fer", "jef", "on ",
       "rso", "son"])
    },
    %{
      name: "Jefferson",
      trigram: MapSet.new(["  j", " je", "eff", "ers", "fer", "ffe", "jef",
       "on ", "rso", "son"])
    }
  ],
  [
    %{
      name: "Maria",
      trigram: MapSet.new(["  m", " ma", "ari", "ia ", "mar", "ria"])
    }
  ]
]

This works great, but the problem is that the algorithm is very slow when the input is bigger, for example, for 10_000 values, it will take around 279 seconds.

I would love any suggestion on how to improve its performance and lower its complexity.

RudManusachi · December 9, 2023, 9:46pm

Hi @sezaru

the lowest hanging fruits to harvest I see is following some recommendation from Erlang Efficiency Guide # List Handling

prepend to list with | instead of “concatenating” with ++…

[[value] ++ similar] ++ group(rest, threshold)

into

[[value | similar] | group(rest, threshold)]

tail-call optimization might also show some improvement. However, as mentioned in the section Recursive List Functions

measure before rewriting your code.

Now the bigger question is if it’s possible to reduce the complexity of the algorithm, because currently as I see it’s O(N^2) (if I understood it correctly)…

Maybe we could come up with some algorithm that could build the result in at least O(N Log N) or ideally O(N) (I’m not sure if it’s possible, didn’t give it too much thought yet)
But usually for these type of problems we might want to have a map where we would store some intermediate info along by iterating through the list and lookup in that store if we could reuse some results from there.