distributed document popularity metrics using amortizable hashcash (Re: [p2p-hackers] BitTorrent measurements / fully decentralizedsystems)

Adam Back adam at cypherspace.org
Thu Dec 16 01:56:04 UTC 2004


On Wed, Dec 15, 2004 at 03:33:54PM -0800, David Barrett wrote:
> Are there any good examples of content-playback applications
> integrating with content-distribution systems to obtain implicit
> quality measures?
>
> I could imagine a decentralized, open protocol where every time I
> listen to a song (for example) WinAmp spits out a digitally signed
> "I played this song to completion, and here's its CRC" into the
> ether.  The sum total of these certificates is the implict quality
> of the track.

I proposed a way to do this without a reputation system: amorizable
hashcash.  http://www.hashcash.org/papers/amortizable.pdf

It is designed to not need a reputation system. 

There are no certificates and each user essentially gets to vote on
the document popularity (through implicit or explicit action) in a
distributed fashion.  To vote the user expends CPU time to create a
hashcash stamp tied to the document id.  (Eg lower priority background
thread during playback for implicit, for some duration as a background
thread after marking it as quality content for explicit).

Stamps have a special property that you can add them, and
probabilistically they resulting stamp is equal to the sum of the
input stamps.  (Adding the same stamp repeatedly has no effect).  (I
call this probabilistic addition "amortization").  The size of the
stamp does not grow.

Documents are requested from peers, and with the requester sends his
view of the documents popularity stamp value.  The server adds the
requesters stamp to it's own stamp, and sends the new stamp in the
response.

So each peer sharing a document sums the stamps it sees, and each peer
that downloads a document sends stamps it sees from other peers it is
swarming from.  So (for swarming) no additional connections are needed
for popularity metrics.  The only comms overhead is the the size of
the stamp which is relatively compact.


The exact size of the stamp depends on how you tune it -- you can
reduce the error introduced by the randomness by increasing the size.
The smallest stamp is going to be around 16 bytes, and reasonably low
error rate ones maybe 100 bytes.

What can cheaters do, there two types of cheating:

1. jammers trying to mark junk/mislabeled documents as high quality

   - We don't try to do anything about the jammer.  Best case the p2p
     net has significantly more CPU resources than the jammer.  Worst
     case the jammer is the RIAA or MPAA and they blow a bundle on
     server racks.  But generally the kazza net would be a pretty
     decent super-computer: 1 million online peers is a lot of compute
     if you're paying for it.

2. non-participants who want to download and view the file, but for
   some reason do not want to contribute to it's popularity
 
   - this probably isn't a signifcant concern.  It can be combatted to
     a limited extent as described in the paper see "fair amortizable
     cost-function"

The paper in hind-sight does not really explain the above in much
detail though this was the purpose of it.  (The paper is mostly crypto
math showing how and why it works).

But the actual stuff is pretty simple overall.  There is source code
for basic hashcash (non-amortizable) here:

	http://www.hashcash.org/

Amortizable hashcash (with worst case error rate) is to "add" hashcash
stamps x and y by retaining the larger x+y = max(x,y).  (The addition
is kind of logarithmic, but you can adjust for it to have a hits
estimate).

For lower error rate with let's say with 16 stamps you keep the
largest 16 stamps you've seen: say X = (x1,...,x16) and Y =
(y1,...,y16) then X+Y= top16(sort(X,Y)) (ie take the biggest 16 of the
combined list).


Higher level things:

A. you can have negative votes also, just define other properties --
rather than quality vote, a negative vote saying a document is line
noise or mis-labelled.

B. you can vote on other things, eg peer aggregate contribution (in
terms of MB shared), performance, reliability, uptime

C. you could weight document popularity voting by the contribution
level of the voter (the RIAA has to share lots of good stuff to be
able to jam other stuff:-).

D. other than contribution have a pseudonym that has a cost: Say that
each user who joins the p2p has to create a relatively expensive stamp
on their pseudonym.  Now when they publish stamps their stamps are
ignored unless accompanied by a pseudonym stamp.

E. pseudonyms could be meta-voted on.  If a jammer creates a pseudonym
and starts to vote heavily (with rented super-computer time trying to
compete with kazza users), his pseudonym gets negatively voted on and
he has to discard the pseudonym and create another one.  (Of course
the jammer can go ahead and falsely vote against everyone else's
pseudonyms, but it's 1 against many, and they all can vote against the
jammer).


What about the CPU waste?

So one might think this leads to wasted CPU resources, but the scarce
resources a p2p net is trying to conserve are: bandwidth, and human
satisfaction (getting what they expect).

I also proposed that in p2p sytems that do cacheing (and works for
pre-fetching) that the cache replacement policy is biased to favor
popular content.

Adam

> Naturally there are the standard reputation system problems, but
> assuming a decent reputation system existed already, this would be one
> way to leverage it for implicit quality measures.
> 
> Is there anything like this already working or in the works?



More information about the P2p-hackers mailing list