The similarity challenge

joeerl · September 27, 2018, 1:42pm

(Programming challenge #1)

I’m often asked if I can think of any good problems to solve - well yes - here’s a problem I’ve been working on for years. It’s really difficult (and BTW it formed the subject of Chapter 27 (Sherlock’s last cast) - the final chapter in my Erlang book

Here’s the problem.

Given a large set of paragraphs find the most similar paragraphs in the set.

Why is this an interesting problem?

Finding similarities between things is an essential step in reducing
entropy - much of science happens because somebody notices that two
apparently different things have certain similarities.

Use cases:

I have an idea - I write it down in a single paragraph that
describes the idea. I then want to know “has anybody else had the same
or a very similar idea” - ie search all paragraphs and find the most
similar paragraph to what I have just written.

Comment: This is a very useful idea - often I have what I think is a
new idea I write it down, when search a bit only to find I had a
similar idea 20 years ago …

Given a very large document (say 5-10,000 paragraphs) can I reduce
the size of the document by finding very similar paragraphs.

Given two files A and B can I find a partition so that A = A1 ++ C ++
A2 and B = B1 ++ C ++ B2 (++ means append) in which case I can create
a file C and replace the common parts in A and B by links, or
translusions to C.

Given a very large document can I find two paragraphs A and B that
are so similar that one of them can be removed.

Now to the problems:

First I’ll define an input format.

The input is a JSON file like this:

[{“tag”““1”},{“value”:” the content of paragrpah 1"},
{“tag”:“2”},{“value”:" paragraph 2"},
…]

Tags identify the paragraphs, values are the paragraph data. I’m also
set a limit on paragraph sizes (say no longer than 1KB)

The program

$find_similarities <input> <output>

Finds the most similar paragraphs in writing the results to a
file

We also programs to create reasonable input data - so, for example a
program to take a typical input and turn it into a appropriate input
file is needed.

For starters we might use the Elixir Documentation or the set of RFCs

It would be fantastic to reduce the entropy of the set of all RFC’s –
I suspect they cut-and paste heavily in writing them. Many RFCs
reproduce the text from earlier RFCs so a similarity detector could
hopefully be used to reduce the total amount of text.

Algorithms

I have played with some similarities detectors for years now the problem is
that they are very slow.

Essentially we do this (pseudo code)

L = []
forall pairs of paragraphs A,B in <input>
  L += compute similarity(A,B)
end
then sort L and display the first few entries

This is essentially O(N^2) algorithm - and is painfully slow for large
data sets.

As for the similarity measures.

The best I have found (so far) is

This is very good for medium length paragraphs but very slow.

Keyword extraction using TF-IDF and consine similarity is good

(this was the method I described in my Erlang book) - a lot faster but
less accurate.

The Jaccard Index might be OK - I haven’t tried this

Or Levenenshtein distance - I haven’t tried this

If we want to detect similarities caused by cut-and-pasting text (ie
exact copies) then the rsync algorithms are great which just uses a
Rolling hash - Wikipedia algorithm. This is used in
plagiarism detection algorithms and is very fast.
See also this paper
https://www.andrew.cmu.edu/course/15-749/READINGS/required/cas/tridgell96.pdf

As a first step to solving this it would be good for somebody to write
code to generate test input data sets - with appropriate files names -
say _500.json might be a file containing 500 paragraphs from

Feel free to solve this in any other language or cross-post to any
other lists - I’m primarilty interested in the solution and algorithms
not the langauge used to solve the problem.

Any takers?

Any ideas for good similarity algorithms?

AstonJ · September 27, 2018, 3:35pm

Firstly I will say I am in no way up (or qualified!) for this challenge

That would be quite difficult I imagine, because a single word can change the meaning of a sentence - so your program would need to understand the meaning of words too (i.e know which words are similar - maybe to begin with just classify them as positive or negative words?)

It was my birthday today, so I went to the beach with my friend and we had a memorable day because it rained like crazy! We were drenched from head to toe - what a birthday!

That’s quite ambiguous already, but here are some slight variations that could change the meaning:

Single word:

It was my birthday today, so I went to the beach with my friend and we had a terrible day because it rained like crazy! We were drenched from head to toe - what a birthday!

Emoji:

It was my birthday today, so I went to the beach with my friend and we had a memorable day because it rained like crazy! We were drenched from head to toe - what a birthday!

Different emoji:

It was my birthday today, so I went to the beach with my friend and we had a memorable day because it rained like crazy! We were drenched from head to toe - what a birthday!

Added the emojis to emphasise they would need to be taken into account too - can you tell I am a fan of them

With regards to finding paragraphs that are similar, my first (uneducated) guess would be to count the number of words per paragraph and then match on those (remove words like ‘the’ ‘and’ etc), but also as mentioned above, perhaps include a rating that suggested whether a word is positive, negative or neutral and perhaps mark each paragraph accordingly. I’d guess though, that you would need to break it down even further and drop to finding similarities in sentences too.

Not the technical answer I guess you were expecting so I hope you don’t mind me sharing my thoughts anyway

Qqwy · September 28, 2018, 8:52am

This is a very interesting, but difficult, problem.

I completely agree with @AstonJ that a slight difference in wording can already completely change something’s meaning:

Let’s eat, grandma!

vs

Let’s eat grandma!

However, if you, for a given paragraph, just want to find a ‘top ten’ of similar paragraphs (after which it falls to the user to see if they indeed describe a similar idea), then that is definitely possible. You will still have many false positives (similar wordings, completely different meanings) as well as many false negatives (usage of synonyms or other ways to describe the same idea with completely different words).

The best bets here are probably the fuzzy text searching algorithms that already see widespread use in the web-search-engines we use every day.

Out of the listed algorithm possibilities, I’ve only worked with the Levenshtein distance before: It works well if you are comparing words, short strings, or have a limited alphabet (so it is great for spellcheckers or comparing DNA sequences), but rather slow for longer strings with a large alphabet (and we might consider a ‘word’ a symbol in an alphabet, I guess?) Also, it will greatly be influenced by the word-order, rather than which words they are.

A very interesting application that I really like, which does something similar but for symbols, is http://shapecatcher.com/ . Here, you draw whatever symbol you like, and it attempts to match this against a unicode-symbol, by comparing it with the "shape context"s of all shapes in its database More info.

If we’re able to generate a “paragraph context” in a sort of similar way, then that might speed up the searching tremendously (basically: Once we’ve indexed many paragraphs, we can find matching ones in linear time, or even go all the way into MapReduce-land and divide-and-conquer the wits out of this problem).

quazar · September 30, 2018, 4:56am

I did something something like that in final year college project. The first thing I’ll recommend is to read some research papers on Recommendation engine. You’ll get tons of insights. It’s been while since I passed out of college, so I don’t remember papers name. I am sure google will help you.

As for my project it was primarily made for recommending document based on combination of content filtering and collaborative filtering using tfidf. Look at apache mahout, you can skip writing most of the algorithms. Also consider languages like java, Scala or at least python because JVM is fast and has tons of libraries for this purpose. Elixir is not exactly suitable for this kind of stuff. I’ll recommend Scala/Java.

joeerl · September 30, 2018, 8:43am

Thanks

I’ve read quite a few papers - but if you have any specific recommendations I’d appreciate it.

I don’t know why you say Elixir is not suitable for these algorithms - I’ve implemented many of the algorithms in “Managing Gigabytes” by Moffit and Bell ( a wonderful book BTW) and didn’t notice any performance problems - as soon as you start using ets tables etc. you’re virtually programming in C, so the low level operations are done in C and the high level operations (list comprehensions etc) are done in Erlang - the combination is pretty fast.

My purpose is understanding not performance - and good algorithms are key here.

Cheers

quazar · September 30, 2018, 9:02am

I haven’t used Erlang so my opinion may be biased. I recommended Scala because their eco system is very mature and there are many tools/libraries specially made for information processing. Take for example Lucene, mahout, Spark, etc. they are mature and widely used. My guess was you want to get going fast, using already existing libraries. Since this isn’t the case you should use language which you are comfortable with.

joeerl · September 30, 2018, 1:53pm

I found this book “Mining of Massive Datasets”
http://infolab.stanford.edu/~ullman/mmds/booka.pdf by Leskovec Rajaraman and Ullman.

Chapter 3 is brilliant and has a load of great references.

This is worthy of a close and careful reading - having slogged my way through the “Dragon book” I know that anything that Ullman has blessed is well worth the study.

xlphs · September 30, 2018, 2:42pm

Long time ago I did a bag of word approach combined with stemming to measure sentiment of stock market news, how bullish or bearish it is. It turned out to be pretty good given how simple the implementation was, and I did that in Erlang. The lesson I learned is that a generic measure of similarity is difficult, a better approach is to classify the text then use the confidence for similarity measurement. In this case, if both paragraphs have the same classification with high confidence then they are similar. That’s just my limited experience.