[linux-elitists] Any SCO source licensees present (or known?)
Wed Jun 18 12:01:13 PDT 2003
A friend and I have been discussing doing what Egan just did. My friend
even did some coding. In the discussion, we got an idea that may
increase the number of hits. Run the code through a code beautifier
before you shred it. Evidently that is what the BSD folk did when they
were trying to prove the differences between BSD and SysV. The work was
done by Keith Bostic and evidently the original code is in the CVS
On Wed, 2003-06-18 at 08:23, Karsten M. Self wrote:
> The Inquirer posts the following algorith for doing an MD5 based file
> comparison of Linux and SCO kernel sources:
> Shutting down SCO's FUD machine
> By Egan Orion: Wednesday 18 June 2003, 10:58
> "Yesterday I realized how trivial it was to find matching code within
> two source trees.
> "While working on this stuff, I realized that [the] SCO lawsuit is
> indeed pure FUD, and they will keep it like that till the end. So it
> seems like the best thing for the linux community now would be to find
> the matching code ourselves and figure out where it came from. SCO
> help is not needed. Otherwise Linux is so to speak a sitting duck. If
> Linux community knows what is very similar and why, that would fully
> protect Linux in press and leave IBM to annihilate SCO."
> I don't know how "fully" this might be effective, because certain
> press elements are practically extensions of the Vole's propaganda
> office. It does sound interesting enough to look into closely,
> though. Our unnamed correspondent continues:
> "Since I do not have access to System V code, I took Linux 2.4.20
> and BSD-lite 4.4. I'll give the technical details later, but here
> are the findings:
> "[Linux versus] 4.4BSD-Lite
> " lines Linux BSD
> 200- 260 ...amd7930.c ...bsd_audio.c
> 398- 519 ...slhc.c ...slcompress.c
> 739- 766 ...balloc.c ...ffs_alloc.c
> 2267-2299 ...bonding.c ...inet_addr.c
> [Note: We truncated the full paths for formatting purposes, but the
> original email is available containing all paths and other details.]
> "On the left is the file in the Linux tree, on the right is the
> file in the 4.4BSD tree. Also the range of matching lines in Linux
> is given on the left. It is unlikely that I missed any other large
> matching fragments.
> "Now, it seems to be quite likely that the matching Linux-System
> V code shown to the "experts" by SCO came from one of these
> files. And all because this is the original BSD code, which got
> copied everywhere."
> As our reader intimates, he's found a clever way to compare Unix
> source code without viewing the code directly or violating copyrights
> We will let him explain in further detail how it's possible to
> do this:
> "Here is the procedure for finding the matching code.... "1. Each
> file withing each source tree is "shredded" into 5 line pieces (1-5,
> 2-6, 3-7, etc.). MD5 sum is computed for each block of lines. The
> output is 3 columns: MD5sum, source file, 1st line in the block.
> "At this stage, 4.4BSD had [a] ~40Mb file, linux ~160Mb. Potentially,
> one could shred into smaller or larger pieces, however, with pieces
> too small there'll be a lot of noise, with pieces too large some
> matches won't be seen. 5 liners seem to be a good compromise.
> "2. Within each source tree the "shredded" file is sorted by
> MD5sum, and duplicate entries within the same tree are removed
> completely (these are either trivial 5-line sequences or licensing
> disclaimers). Unix sort here takes a couple of minutes on a 600Mhz P3.
> "3. A column indicating the origin of the file is inserted into the
> file (0 - BSD, 1 - linux). Both Linux and BSD "shredded" files are
> merged such that MD5sums stay sorted.
> "4. At this point a given MD5sum will occur either once or twice,
> i.e., in both source trees. Here remove all thesingle lines, and
> have the 5 liners left that are matching.
> "5. Count for each file in Linux tree the number of matches with
> the BSD tree using the file generated at step 4. Sort this list,
> and the largest counts will occur for the files with the largest
> number of matching lines. The range can be extracted from the file
> from step 4, since at step 1 we kept the address of the 1st line in
> the block. That is how the info above was generated.
> "The beauty of this scheme is that anybody with System V code can
> inform the Linux community about what is identical without revealing
> any System V code. And this might actually be legal, since I do
> not think that there are clauses in the contracts NOT to shred the
> code and compare it with other code. Also, it is quite easy to stay
> anonymous since the person who does the analysis need not to reveal
> him/herself in any way."
More information about the linux-elitists