(Programming challenge #1)
I’m often asked if I can think of any good problems to solve - well yes - here’s a problem I’ve been working on for years. It’s really difficult (and BTW it formed the subject of Chapter 27 (Sherlock’s last cast) - the final chapter in my Erlang book
Here’s the problem.
Given a large set of paragraphs find the most similar paragraphs in the set.
Why is this an interesting problem?
Finding similarities between things is an essential step in reducing
entropy - much of science happens because somebody notices that two
apparently different things have certain similarities.
Use cases:
- I have an idea - I write it down in a single paragraph that
describes the idea. I then want to know “has anybody else had the same
or a very similar idea” - ie search all paragraphs and find the most
similar paragraph to what I have just written.
Comment: This is a very useful idea - often I have what I think is a
new idea I write it down, when search a bit only to find I had a
similar idea 20 years ago …
- Given a very large document (say 5-10,000 paragraphs) can I reduce
the size of the document by finding very similar paragraphs.
Given two files A and B can I find a partition so that A = A1 ++ C ++
A2 and B = B1 ++ C ++ B2 (++ means append) in which case I can create
a file C and replace the common parts in A and B by links, or
translusions to C.
- Given a very large document can I find two paragraphs A and B that
are so similar that one of them can be removed.
Now to the problems:
First I’ll define an input format.
The input is a JSON file like this:
[{“tag”““1”},{“value”:” the content of paragrpah 1"},
{“tag”:“2”},{“value”:" paragraph 2"},
…]
Tags identify the paragraphs, values are the paragraph data. I’m also
set a limit on paragraph sizes (say no longer than 1KB)
The program
$find_similarities <input> <output>
Finds the most similar paragraphs in writing the results to a
file
We also programs to create reasonable input data - so, for example a
program to take a typical input and turn it into a appropriate input
file is needed.
For starters we might use the Elixir Documentation or the set of RFCs
It would be fantastic to reduce the entropy of the set of all RFC’s –
I suspect they cut-and paste heavily in writing them. Many RFCs
reproduce the text from earlier RFCs so a similarity detector could
hopefully be used to reduce the total amount of text.
Algorithms
I have played with some similarities detectors for years now the problem is
that they are very slow.
Essentially we do this (pseudo code)
L = []
forall pairs of paragraphs A,B in <input>
L += compute similarity(A,B)
end
then sort L and display the first few entries
This is essentially O(N^2) algorithm - and is painfully slow for large
data sets.
As for the similarity measures.
The best I have found (so far) is
This is very good for medium length paragraphs but very slow.
Keyword extraction using TF-IDF and consine similarity is good
(this was the method I described in my Erlang book) - a lot faster but
less accurate.
The Jaccard Index might be OK - I haven’t tried this
Or Levenenshtein distance - I haven’t tried this
If we want to detect similarities caused by cut-and-pasting text (ie
exact copies) then the rsync algorithms are great which just uses a
Rolling hash - Wikipedia algorithm. This is used in
plagiarism detection algorithms and is very fast.
See also this paper
https://www.andrew.cmu.edu/course/15-749/READINGS/required/cas/tridgell96.pdf
As a first step to solving this it would be good for somebody to write
code to generate test input data sets - with appropriate files names -
say _500.json might be a file containing 500 paragraphs from
Feel free to solve this in any other language or cross-post to any
other lists - I’m primarilty interested in the solution and algorithms
not the langauge used to solve the problem.
Any takers?
Any ideas for good similarity algorithms?