Creating a "fuzzy zip" method: advice on implementation

BenSchZA · April 6, 2020, 7:54pm

I’m working on a problem where I’m parsing a list of column headers from a Google spreadsheet, trying to fuzzy match these headers to the corresponding Ecto schema field names. A column header like “Select DHR Revision Number” would ideally be simplified to “dhr_revision_number” as an example.

Here’s the method, naming it fuzzy_zip, that takes a header name list and a list of Ecto schema fields, and uses a modified String.jaro_distance method to match the two columns. I still need to add a condition that makes sure an entry isn’t repeated.

  def averaged_jaro_distance(a, b, seperators \\ {" ", " "}) when is_bitstring(a) and is_bitstring(b) do
    {a_seperator, b_seperator} = seperators
    a_split = String.split(a, a_seperator)
    b_split = String.split(b, b_seperator)

    sum =
      a_split
      |> Enum.map(fn x ->
        Enum.reduce(b_split, 0, fn y, acc -> acc + String.jaro_distance(x, y) end) /
          length(b_split)
      end)
      |> Enum.sum()

    sum / length(a_split)
  end

  def sort_zip(a, b, mapper, sorter \\ &<=/2) when is_list(a) and is_list(b) and length(a) == length(b) do
    a
    |> Enum.map(fn x ->
      {x, Enum.sort_by(b, &mapper.(&1, x), sorter) |> List.first()}
    end)
  end

  def fuzzy_zip(a, b, seperators \\ {" ", " "}) when is_list(a) and is_list(b) and length(a) == length(b) do
    sort_zip(a, b, &(averaged_jaro_distance(&1, &2, seperators)), &>=/2)
  end

And here are the doctests to explain why I modified jaro_distance - string order affects result.

      iex> String.jaro_distance("apple cider", "cider")
      0.43030303030303035

      iex> String.jaro_distance("apple cider", "apple")
      0.8181818181818182

      iex> averaged_jaro_distance("apple cider", "cider")
      0.7333333333333333

My modified jaro_distance method still results in an incorrect zip, although better than the original method.

fuzzy_zip(
             [
               "Timestamp",
               "Confirm Start Time",
               "Manufacturer",
               "Select WI Revision Number",
               "Select DHR Revision Number",
               "Select Item Number"
             ],
             [
               "dhr_revision_number",
               "start_time",
               "manufacturer",
               "wi_revision_number",
               "item_number",
               "timestamp"
             ],
             {" ", "_"}
           )

  [
    {"Timestamp", "timestamp"},
    {"Confirm Start Time", "start_time"},
    {"Manufacturer", "manufacturer"},
    {"Select WI Revision Number", "wi_revision_number"},
    {"Select DHR Revision Number", "wi_revision_number"},
    {"Select Item Number", "item_number"}
  ]

Any suggestions to improve this algorithm? I can’t change the sheet headers, might end up having additional entries added without warning (these will be filtered out) or having slight variations, and so ideally would like a robust fuzzy search method as attempted. Also thought it’s an interesting problem to solve

al2o3cr · April 7, 2020, 1:42pm

Are there any common patterns in the column names? For instance, your example seems like it would benefit from pre-trimming words like “Confirm” and “Select” from the beginnings of headers.

lucaong · April 7, 2020, 2:55pm

Implementing efficient fuzzy match is not easy, but also a very interesting endeavor There are many possibilities, depending on your constraints and on how you define the problem. I had some fun with it while implementing MiniSearch, a compact client-side full-text search engine written in JavaScript.

If you feel like getting dragged in the rabbit hole of information retrieval, here’s a great publicly available resource to get started with it.

That said, in your case, the issue is probably that you should first normalize the terms, at least downcasing them, otherwise you will get wrong results when you calculate the Jaro distance:

String.jaro_distance("ABC", "abcd")
#=> 0.0

String.jaro_distance("abc", "abcd")
#=> 0.9166666666666666

If you downcase the spreadsheet headers, you get the correct result with your current method:

fuzzy_zip(
  Enum.map([
    "Timestamp",
    "Confirm Start Time",
    "Manufacturer",
    "Select WI Revision Number",
    "Select DHR Revision Number",
    "Select Item Number"
  ], &(String.downcase(&1))),
  [
    "dhr_revision_number",
    "start_time",
    "manufacturer",
    "wi_revision_number",
    "item_number",
    "timestamp"
  ],
  {" ", "_"}
)
#=> [
#   {"timestamp", "timestamp"},
#   {"confirm start time", "start_time"},
#   {"manufacturer", "manufacturer"},
#   {"select wi revision number", "wi_revision_number"},
#   {"select dhr revision number", "dhr_revision_number"},
#   {"select item number", "item_number"}
# ]

There would be more efficient and precise ways to perform this match, but if this is good enough for your use case I would stay with it for now.

BenSchZA · April 7, 2020, 4:36pm

@al2o3cr and @lucaong - both of those suggestions will definitely help! I think implementing the lower-casing normalization is easier, so will go with that for now and compare results.

Thanks for the other resources @lucaong. It is an interesting fun problem to solve