Any plans for other diff algorithms like String.myers_difference method, but for other data types?

Hey, I read a &String.myers_difference/2 code and modified it, so it’s possible to use List (also char list) and Keyword types instead of String type.

Here is an example module:

defmodule Example do
  def myers_difference(list1, list2) do
    len1 = Enum.count(list1)
    len2 = Enum.count(list2)
    path = {0, 0, list1, list2, []}
    find_script(0, len1 + len2, [path])
  end

  defp find_script(envelope, max, _paths) when envelope > max do
    nil
  end

  defp find_script(envelope, max, paths) do
    case each_diagonal(-envelope, envelope, paths, []) do
      {:done, edits} -> compact_reverse(edits, [], false)
      {:next, paths} -> find_script(envelope + 1, max, paths)
    end
  end

  defp compact_reverse([], acc, _), do: acc

  defp compact_reverse([{kind, el} | rest], [{kind, next_elem} | acc], false) do
    compact_reverse(rest, [{kind, [el, next_elem]} | acc], true)
  end

  defp compact_reverse([{kind, el} | rest], [{kind, next_elements} | acc], true) do
    compact_reverse(rest, [{kind, [el | next_elements]} | acc], true)
  end

  defp compact_reverse([elem | rest], acc, _) do
    compact_reverse(rest, [elem | acc], false)
  end

  defp each_diagonal(diag, limit, _paths, next_paths) when diag > limit do
    {:next, Enum.reverse(next_paths)}
  end

  defp each_diagonal(diag, limit, paths, next_paths) do
    {path, rest} = proceed_path(diag, limit, paths)
    with {:cont, path} <- follow_snake(path) do
      each_diagonal(diag + 2, limit, rest, [path | next_paths])
    end
  end

  defp proceed_path(0, 0, [path]), do: {path, []}

  defp proceed_path(diag, limit, [path | _] = paths) when diag == -limit do
    {move_down(path), paths}
  end

  defp proceed_path(diag, limit, [path]) when diag == limit do
    {move_right(path), []}
  end

  defp proceed_path(_diag, _limit, [path1, path2 | rest]) do
    if elem(path1, 1) > elem(path2, 1) do
      {move_right(path1), [path2 | rest]}
    else
      {move_down(path2), [path2 | rest]}
    end
  end

  defp move_right({x, y, list1, [elem | rest], edits}) do
    {x + 1, y, list1, rest, [{:ins, elem} | edits]}
  end

  defp move_right({x, y, list1, [], edits}) do
    {x + 1, y, list1, [], edits}
  end

  defp move_down({x, y, [elem | rest], list2, edits}) do
    {x, y + 1, rest, list2, [{:del, elem} | edits]}
  end

  defp move_down({x, y, [], list2, edits}) do
    {x, y + 1, [], list2, edits}
  end

  defp follow_snake({x, y, [elem | rest1], [elem | rest2], edits}) do
    follow_snake({x + 1, y + 1, rest1, rest2, [{:eq, elem} | edits]})
  end

  defp follow_snake({_x, _y, [], [], edits}) do
    {:done, edits}
  end

  defp follow_snake(path) do
    {:cont, path}
  end
end

As you can see I modified only some things:

  1. variable names - now we don’'t have a string and chars (graphemes) - instead we have list and elements
  2. compact_reverse method has one more (boolean) parameter and case - it’s for create/join a list of changes (no conflict with nested list)
  3. myers_difference main method is modified to support list (len1 and len2 variables)

Here is the link for source of original implementation.

If someone is interested in Map implementation I can try modify the code.

What do you think about my changes?

4 Likes

I’d think a myers difference of various base types would be quite fascinating. A good diff algorithm is useful in a lot of ways. Should make an elixir library? There is a diff package on hex but it only works on list of base values (like integers, graphemes, etc…), and a more generic one (also protocol driven for custom user structs and such perhaps?) would be quite useful.

2 Likes

I know a diff hex package.
I’m working on a differ package (compare, patch and revert) for all types that can be saved in Ecto. Just see that implementation for String had 99% of implementation for Lists and post my code + question. It’s not so more important for me to add this into elixir code. Just modify it and see if someone interested. Got first like, so I see it’s not a bad idea :smile:. Maybe in this month I post my differ package, but there are a lot of things I want to implement to it. I have done it for String, Atom (also boolean), Integer and Float types. Now I’m working on List and Map (also structs).

2 Likes

We did discuss at least Enum.myers_difference but I don’t remember where we settled on that. I will ping @lexmag and follow up on that.

3 Likes

The problem with Enum.myers_difference/2 is that it won’t work with every enumerable.
Algorithm works on sequences but enumerables are collections (and only sometimes sequences, maps, for example, have no order).

We could provide List.myers_difference/2 and implement String.myers_difference/2 in terms of it. I think this would be generic enough and people can use it on their own, for example, to diff tuples.

2 Likes

What about turn Map into Keyword?
With my code in iex:

old = Enum.into(%{a: 5}, [])
new = Enum.into(%{a: 5, b: 10}, [])
result = Example.myers_difference(old, new)
assert result == [eq: {:a, 5}, ins: {:b, 10}]

Turning map into keyword won’t help as we get keyword with non-deterministic order.
Maps with less than 33 elements have order but it is an implementation detail and this fact should be ignored.

3 Likes

33 elements - ok now I see “special case” - I tested it for small maps
How about this:

old_map = %{...}
new_map = %{...}
old_data = old_map |> Enum.into([]) |> Enum.sort
new_data = new_map |> Enum.into([]) |> Enum.sort
# create diff from old_data and new_data here ...

I see that big maps lost order, but after turning them into Keyword and then sort them it already works. We have order and we can compare two maps (as sorted keywords).
Do we need to have ordered maps after patching and reverting when we will have already done creating diff of ordered keywords?
What do you think about it?

EDIT
In my differ library (under development) I have not problems with compare, patch and revert big maps, but I think ordered diff is much more cleaner (what differ already do), so I added &Enum.sort/1 in Map support.
Do you see any other special case?

Keywords ordering will improve edit script, however, it will be far from optimal and non-representative. Myers’ algorithm is designed to work on sequences, but maps have different semantics. It would be wasteful application of algorithm with inaccurate result while there are much faster ways to find edit script for maps, for example.

2 Likes

I agree with your arguments.
As I wrote I’m working on differ library. If you know that better implementation for Map will be added in next versions I can stop work on this (I mean Map support) and wait for new release(s) of Elixir.
Note: I started wrote about myers algorithm, because don’t know any Elixir implementation for other data types.
I’m looking for all diff algorithms for data that could be saved in Ecto.
Can you list all Elixir (also in future releases) algorithms you plan to support?

If I good understand:

  1. List and String could have implemented myers_difference own method (TODO implementation for List).
  2. Atom and binary could use an String implementation.
  3. Atom (boolean), Decimal, Float and Integer do not need any advanced diff algorithm.

What about Map? Any public API or plans for this?
What about UUID, Date, DateTime and Time? For data and time infos I want to get difference in milliseconds, but I don’t find any API for it.

EDIT: Topic updated.

1 Like

Can you list all Elixir (also in future releases) algorithms you plan to support?

Currently we don’t have plans for adding more diff algorithms.
List.myers_difference/2 was added today in master branch.

What about Map? Any public API or plans for this?

I recommend checking ExUnit.Diff module to see what we do for maps and other data types. In particular, for maps we use O(N) algorithm there.

4 Likes

In other thread @josevalim wrote that is not good idea to use non-public API:

You are not supposed to be calling it and there is no guarantee the module nor the function will exist in future Elixir versions. :slight_smile:

I think copy it to my project is not good idea too, because I need to compare changes each release.
Anyway thanks for your respond. I will use temporary Map as sorted Keyword until somebody add ExUnit.Diff module to public API.
I will try to implement diff for other data types.

See simple Map.meyers_difference here: https://gist.github.com/boydm/ab7aec16d54709862ec3491c2b01ddda

@boydm: thanks, but keep in mind what was already said: