[linux-elitists] finding similarities in text

mbp@sourcefrog.net mbp@sourcefrog.net
Fri Jun 20 16:40:31 PDT 2003


The 'shred and compare MD5sums' technique described in the Register is
a simple form of what are called 'shingle algorithms'.  ('Simple'
rocks, especially when you can do it in shell script.)  The 'shingle'
name comes from the way the chunks slightly overlap.

This is the sort of thing that Google and other search engines use to
find duplicate or similar web pages, and plagiarism-detection software
uses to find naughty schoolchildren.

I understand this is a pretty active research topic, related to delta
compression and network deltas like rsync:

  http://www.google.com/search?q=shingle%20algorithms

--
Martin



More information about the linux-elitists mailing list