[p2p-hackers] Generalizing BitTorrent..

Gregory P. Smith greg at electricrain.com
Sat Jan 15 08:02:54 UTC 2005


> 
>     The basic idea is best described with a real-world example.  There are a
> number of "Full-MAME" torrents, one for each version of MAME (for those who
> don't know, MAME is an arcade emulator and the torrents contain the ROMs for
> the arcade games).  As MAME is updated frequently, there is a string of
> these torrents on the net.  Each one is very large (10 GB), and contains 95%
> of the EXACT SAME data from the previous torrent.  The other 5% is a small
> amount of changes, and some new content.
> 
>     If a user is running the v0.7 torrent and has become a Seed, he serves
> ONLY the v0.7 peers.  When v0.8 is released, his Seed status is essentially
> useless to the peers in the v0.8 crowd, even though he has > 90% of the same
> data.  And again, when v0.9 is released, the same problem.  It seems like
> there should be an extension to the protocol to allow for this type of
> 'shared data' among torrents.

The flaw in this logic here is that to aggregate common data across
different instances of content in a system you need to be able to
locate and identify the common data portions.  In a typical tarball of
a new version of something where only 5% of the files have updated
-most- of the hashes of the fixed sized pieces are likely to change;
certianly -way- more than 5% anyways.  why?  because the common data
has shifted around or in the case of compressed streams of data
(.tar.bz2) the entire stream will be different.  To get any benefit
from this the content would need to be extreemly carefully packaged.
no zips, no tars, no compression, etc.  That alone could destroy the
benefit.

Others have already mentioned an alternate general solution that
applies to -any- distribution method (the linux kernel is distributed
this way): updates that share data should be published as binary diffs
against the previous version.  Downloading n+1 becomes a recursive
"download n and the n->n+1 diff" operation.

What you're really desiring is for peers to integrate the diff
knowledge so that it doesn't need to be done manually and so that it
automatically decides when the base+sum(diffs against base) warrants
just issuing a new base to distribute for future diffs to start from.
(fwiw, some version control systems make many of the same decisions as
to how they store versions of data internally)

-greg




More information about the P2p-hackers mailing list