[p2p-hackers] How to solve the "Britney problem"?

Paul Campbell paul at ref.nmedia.net
Tue Dec 28 17:46:29 UTC 2004

There's a simple solution to solve the "Britney Spears" problem in my
opinion in the DHT model (randomized searches don't suffer from this problem
as a result of their structural fault that they handle common files well
but don't handle rare files well). Let's first recast your question in an
even worse situation.

Let's say that we directly implement an inverted index (keyword-->file) on
the DHT. Even with common word elimination, this certainly and severely
generates a very non-linear index space. But it is something that is very
obvious and desirable since it also solves a major DHT and randomized P2P
dilema: how to quickly search for files by keyword.

So far, most proponents of DHT's believe that hashing is the answer. While
hashing MASKS problems, it doesn't actually solve the problem.

All solutions that I've seen appear to be pretty simple. The idea is to spread
the location of a particular key out over a wider arc. Conceptually, one can
spread it in multiple points (such as spreading it to equadistant points
around the ring) or on multiple hierarchal rings (like the distributed
sloppy hash tables...DSHT's although that was done for delay purposes). But
the concept is essentially the same.

The routing algorithm remains valid. Since the routing algorithm incrementally
searches for the destination. It just remains to send a search that is in
the form of incrementally refining an "area" to search in. As the search
request passes through potential nodes, the search area gradually narrows.

Storage is slightly more complicated. It is necessary to store a key/value
pair in the appropriate AREA of the DHT. The storage targets the area and
not the point. The above search mechanism (which has to be used to find the
appropriate destination for a key/value pair anyways) also paradoxically
nails down the appropriate area for storage.

thecircle (from your terminology, it appears that's what you've been
looking at) attaches a random bit string to the end of the hash to
accomplish this and then spreads a file over a small area using that random
hash (although a lot of this isn't actually implemented when you look at
the code). The various "tree" designs out there just use the hash keys
directly and split storage as necessary. I've seen a few designs that attempt
to simply delete common files from the DHT and index those using a random
P2P structure (thus taking advantage of both designs).

What I've noticed in general is this. In order for the routing properties
of a DHT to be preserved, it is necessary to maintain the node link
structure. But that's not to say that you have to maintain the same
structure at the key/value pair level of the system...that is, a direct
one-to-one mapping is necessary. The "skip graph" structures specifically
acknowledge this. Simply substitute "key/value pair" for "DNS name" and
you convert those structures to the idea I'm referring to.

In a similar way, one could simply store "first key/last key" values in
addition to "node keys" and decouple the data structure from the DHT link
structure. This is similar to index tabs in a rolodex...no matter how many
cards are in the rolodex and no matter what the ordering, you maintain labels
and you can approximately locate a card using the same search mechanism all
the time.

As to the basic problem of "too many keys with the same value", this is
also pretty easy to deal with. Extend the key/value pairs so that for
certain operations (searching), you use just the key as before. However,
optionally treat the keys as "key+value" meta-keys. For instance, when
storing a new key/value pair, use a "search-store" function that treats
the key as "key"+"value" and the value as just "value". Then no matter
how identical keys there are, there is still a valid ordering and the
number of keys with the same values are no longer a limiting factor. Of
course, if your system doesn't require a total ordering on all keys and
values (the node balancing algorithms more or less do require this at least
right now), then you can dispense with the "key+value" idea and simply
ignore this issue.

The one remaining problem is node balancing. The mechanisms that are already
published seem to be pretty uniform in design. Nodes that undertake
rebalancing can do so using one of two mechanisms. Either an empty node
(that may have to be created by dumping it's current payload) is moved into
position on the ring, or else incremental data movement is undertaken. Since
the former incurs a much higher communication cost, the latter is given
preference as much as possible. However, it has been shown that both
mechansims are necessary for load balance to remain bounded.

More information about the P2p-hackers mailing list