[linux-elitists] Any SCO source licensees present (or known?)
Karsten M. Self
Wed Jun 18 08:23:25 PDT 2003
The Inquirer posts the following algorith for doing an MD5 based file
comparison of Linux and SCO kernel sources:
Shutting down SCO's FUD machine
By Egan Orion: Wednesday 18 June 2003, 10:58
"Yesterday I realized how trivial it was to find matching code within
two source trees.
"While working on this stuff, I realized that [the] SCO lawsuit is
indeed pure FUD, and they will keep it like that till the end. So it
seems like the best thing for the linux community now would be to find
the matching code ourselves and figure out where it came from. SCO
help is not needed. Otherwise Linux is so to speak a sitting duck. If
Linux community knows what is very similar and why, that would fully
protect Linux in press and leave IBM to annihilate SCO."
I don't know how "fully" this might be effective, because certain
press elements are practically extensions of the Vole's propaganda
office. It does sound interesting enough to look into closely,
though. Our unnamed correspondent continues:
"Since I do not have access to System V code, I took Linux 2.4.20
and BSD-lite 4.4. I'll give the technical details later, but here
are the findings:
"[Linux versus] 4.4BSD-Lite
" lines Linux BSD
200- 260 ...amd7930.c ...bsd_audio.c
398- 519 ...slhc.c ...slcompress.c
739- 766 ...balloc.c ...ffs_alloc.c
2267-2299 ...bonding.c ...inet_addr.c
[Note: We truncated the full paths for formatting purposes, but the
original email is available containing all paths and other details.]
"On the left is the file in the Linux tree, on the right is the
file in the 4.4BSD tree. Also the range of matching lines in Linux
is given on the left. It is unlikely that I missed any other large
"Now, it seems to be quite likely that the matching Linux-System
V code shown to the "experts" by SCO came from one of these
files. And all because this is the original BSD code, which got
As our reader intimates, he's found a clever way to compare Unix
source code without viewing the code directly or violating copyrights
We will let him explain in further detail how it's possible to
"Here is the procedure for finding the matching code.... "1. Each
file withing each source tree is "shredded" into 5 line pieces (1-5,
2-6, 3-7, etc.). MD5 sum is computed for each block of lines. The
output is 3 columns: MD5sum, source file, 1st line in the block.
"At this stage, 4.4BSD had [a] ~40Mb file, linux ~160Mb. Potentially,
one could shred into smaller or larger pieces, however, with pieces
too small there'll be a lot of noise, with pieces too large some
matches won't be seen. 5 liners seem to be a good compromise.
"2. Within each source tree the "shredded" file is sorted by
MD5sum, and duplicate entries within the same tree are removed
completely (these are either trivial 5-line sequences or licensing
disclaimers). Unix sort here takes a couple of minutes on a 600Mhz P3.
"3. A column indicating the origin of the file is inserted into the
file (0 - BSD, 1 - linux). Both Linux and BSD "shredded" files are
merged such that MD5sums stay sorted.
"4. At this point a given MD5sum will occur either once or twice,
i.e., in both source trees. Here remove all thesingle lines, and
have the 5 liners left that are matching.
"5. Count for each file in Linux tree the number of matches with
the BSD tree using the file generated at step 4. Sort this list,
and the largest counts will occur for the files with the largest
number of matching lines. The range can be extracted from the file
from step 4, since at step 1 we kept the address of the 1st line in
the block. That is how the info above was generated.
"The beauty of this scheme is that anybody with System V code can
inform the Linux community about what is identical without revealing
any System V code. And this might actually be legal, since I do
not think that there are clauses in the contracts NOT to shred the
code and compare it with other code. Also, it is quite easy to stay
anonymous since the person who does the analysis need not to reveal
him/herself in any way."
Karsten M. Self <firstname.lastname@example.org> http://kmself.home.netcom.com/
What Part of "Gestalt" don't you understand?
"Life," said Marvin, "don't talk to me about life."
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://allium.zgp.org/pipermail/linux-elitists/attachments/20030618/4e2a71fb/attachment.pgp
More information about the linux-elitists