I’m working on a problem where I’m parsing a list of column headers from a Google spreadsheet, trying to fuzzy match these headers to the corresponding Ecto schema field names. A column header like “Select DHR Revision Number” would ideally be simplified to “dhr_revision_number” as an example.
Here’s the method, naming it fuzzy_zip
, that takes a header name list and a list of Ecto schema fields, and uses a modified String.jaro_distance
method to match the two columns. I still need to add a condition that makes sure an entry isn’t repeated.
def averaged_jaro_distance(a, b, seperators \\ {" ", " "}) when is_bitstring(a) and is_bitstring(b) do
{a_seperator, b_seperator} = seperators
a_split = String.split(a, a_seperator)
b_split = String.split(b, b_seperator)
sum =
a_split
|> Enum.map(fn x ->
Enum.reduce(b_split, 0, fn y, acc -> acc + String.jaro_distance(x, y) end) /
length(b_split)
end)
|> Enum.sum()
sum / length(a_split)
end
def sort_zip(a, b, mapper, sorter \\ &<=/2) when is_list(a) and is_list(b) and length(a) == length(b) do
a
|> Enum.map(fn x ->
{x, Enum.sort_by(b, &mapper.(&1, x), sorter) |> List.first()}
end)
end
def fuzzy_zip(a, b, seperators \\ {" ", " "}) when is_list(a) and is_list(b) and length(a) == length(b) do
sort_zip(a, b, &(averaged_jaro_distance(&1, &2, seperators)), &>=/2)
end
And here are the doctests to explain why I modified jaro_distance
- string order affects result.
iex> String.jaro_distance("apple cider", "cider")
0.43030303030303035
iex> String.jaro_distance("apple cider", "apple")
0.8181818181818182
iex> averaged_jaro_distance("apple cider", "cider")
0.7333333333333333
My modified jaro_distance
method still results in an incorrect zip, although better than the original method.
fuzzy_zip(
[
"Timestamp",
"Confirm Start Time",
"Manufacturer",
"Select WI Revision Number",
"Select DHR Revision Number",
"Select Item Number"
],
[
"dhr_revision_number",
"start_time",
"manufacturer",
"wi_revision_number",
"item_number",
"timestamp"
],
{" ", "_"}
)
[
{"Timestamp", "timestamp"},
{"Confirm Start Time", "start_time"},
{"Manufacturer", "manufacturer"},
{"Select WI Revision Number", "wi_revision_number"},
{"Select DHR Revision Number", "wi_revision_number"},
{"Select Item Number", "item_number"}
]
Any suggestions to improve this algorithm? I can’t change the sheet headers, might end up having additional entries added without warning (these will be filtered out) or having slight variations, and so ideally would like a robust fuzzy search method as attempted. Also thought it’s an interesting problem to solve