List.myers_difference is very slow and memory hungry

smetana · April 16, 2018, 5:51pm

Hello,

Is this algorithm inherently slow and memory inefficient or is there something wrong with the implementation? Running it on 5000 elements takes 9 seconds and 500 Megabytes

iex(1)> memory = fn() -> :erlang.memory()[:processes] / 1024 / 1024 |> trunc() end
iex(2)> memory.()
4

iex(3)> a = ""
iex(4)> b = String.duplicate("a", 5000)
iex(5)> memory.()                      
5

iex(6)> c = String.myers_difference(a, b)
iex(7)> memory.()                        
577

iex(8) (:timer.tc(fn() -> String.myers_difference(a, b) end) |> elem(0)) / 1_000_000
9.227623

Thanks,
Serge

mgwidmann · April 16, 2018, 6:08pm

Looking at the implementation, it seems it just transforms both strings into lists of graphemes and performs List.myers_difference/2 and then maps it to the results. The grapheme transformation seems quick when I try it so it cannot be that.

smetana · April 16, 2018, 6:15pm

Yes. The problem is in List.myers_difference

iex(1)> memory = fn() -> :erlang.memory()[:processes] / 1024 / 1024 |> trunc() end
iex(2)> a = String.duplicate("a", 5000) |> String.graphemes()
iex(3)> b = String.duplicate("b", 5000) |> String.graphemes()
iex(4)> memory.()
6

iex(5)> List.myers_difference(a, b)
iex(6)> memory.()
1716

1.5 Gb for two lists of 5000 elements

mgwidmann · April 20, 2018, 8:41pm

Perhaps you should try with elixir HEAD, seems José Valim has made some changes:

Qqwy · April 21, 2018, 8:40am

The thing is, what Myer’s difference builds under the hood (if I remember correctly from the university) is a two-dimensional matrix: each combination of items in the two sequences is a cell. This means that it has an O(N*M) space complexity, which might explain the 1.5 GB of RAM that is suddenly in use.

EDIT: No, that’s not correct. You implicitly iterate over such a matrix, but you only perform a saddleback-search (So you won’t generate the whole matrix, luckily). Still, turning 5000-element strings into a list will at least double their size (because all of the intermediate pointers).

josevalim · April 21, 2018, 9:02am

There is also an issue: https://github.com/elixir-lang/elixir/issues/7559

TL;DR - the algorithm uses more memory depending on how different the words are. If they are completely different, then it will use a lot. A better algorithm is also available but we did not implement. Pull requests are welcome.

Architect · April 26, 2018, 2:24pm

@josevalim This has been bugging me for awhile. I think the name for this function is not correct. The function returns what is called the Levenshtein Distance. Myers Difference is an algorithm for calculating the Levenshtein Distance.

josevalim · April 26, 2018, 2:37pm

Not quite. Levenshtein Distance tracks inserts, deletes and substitutions while Myers solves LCS (longest common subsequence) which tracks only inserts and deletes. This means that “hello” and “hallo” has Levenshtein Distance of 1 since the edit script has one entry (a substitution) but it would be considered to have distance of 2 for Myers since the edit script has two entries (one insert and one deletion).