[p2p-hackers] Paradigma Question: DHT's or Small World?

Serguei Osokine Serguei.Osokine at efi.com
Thu Feb 3 18:12:31 UTC 2005


On Thursday, February 03, 2005 Bryan Turner wrote:
> Given my arguments above, the total work performed by the "system"
> to achieve a query is roughly equivalent between the two models.

	Uh, looks to me that given your arguments above the models are
logically equivalent, which says nothing about whether the work is 
the same or not.

	In fact, I can easily imagine the situations where the load
would be orders of magnitude different for Gnutella and DHTs.

	Best wishes -
	S.Osokine.
	3 Feb 2005.

-----Original Message-----
From: p2p-hackers-bounces at zgp.org [mailto:p2p-hackers-bounces at zgp.org]On
Behalf Of Bryan Turner
Sent: Thursday, February 03, 2005 8:35 AM
To: 'Peer-to-peer development.'
Subject: RE: [p2p-hackers] Paradigma Question: DHT's or Small World? 


Regarding Small World vs DHT;

	Pedantically, there is no difference.. you can map a DHT to Small
World by viewing the domain of the DHT (it's keyspace) to be the semantic
information sought by the peers.  Thus, peers which seek nearby points in
the keyspace are linked by Small World links, while those which seek distant
points are only occasionally referenced.

	The difference that is being argued is what the USER is interested
in versus what the PEER is interested in.  In a DHT, the peer is required to
be interested in keys which  conform to some dynamic metric based on the
specific model of DHT being used, while there is no model for the user's
interests.

	I'm not arguing for or against Small World - simply that the models
are equally expressive and thus equally capable of implementing each other's
features.  Just something to keep in mind.

And to keep things on track:

	Gwendal, I like your explanation of a user's semantic profile, it's
very crisp and approachable.  It's been difficult to explain to colleagues
in the past, next time I'll use your words.  ;)

	In the following by "Gnutella", I mean "Gnutella-like systems".
Please do not be offended by my mis-representation of the specific features
supported by Gnutella.

	I see publishing between the two mediums in a different light.
While it seems simpler to publish under Gnutella, there are tradeoffs that
you haven't pointed out.  For instance, single-word queries and exact-file
searches are significantly more difficult under the Gnutella model exactly
because your query must reach all your 'friends' - and return from all of
them!  In effect you get worst-case performance for every query.

	DHTs achieve best-case performance for this type of query, but are
burdened by a more complex publishing process.

	I would also like to argue that full-text indexing on all documents
is equally difficult for *both* models.  My reasoning follows from the
processing requirements (in any model) to index/query a full document:
	1. Process a document to produce an index.
	2. Store the index for future retrieval.
	3. Provide query capability to a client.
	4. Discover relevant indexes to a query.
	5. Search the indexes for query terms.
	6. Return results

	It should be clear that #1, #3, and #6 are essentially the same
between the two models, as some entity must perform the same amount of work
for these steps regardless of how it is handled "under the covers".

	#2 differs only in the location where the index is stored - locally
or distributed.  And in the amount of work done (Gnutella;less, DHT;more).

	#4 differs again in the location and work, but here I argue the
amount of work has reversed from #2.  Gnutella requires *many* peers to
perform complex queries against their complex indexes, which constitutes a
great deal of work.  OtoH, a DHT implicitly knows which peers to address,
and which queries to perform (in fact, the very act of addressing a peer is
effectively performing the query).

	#5 again differs, although I argue that the total amount of work
performed is essentially the same.

	Given my arguments above, the total work performed by the "system"
to achieve a query is roughly equivalent between the two models.  There
isn't any one area in which one of the systems is burdened by an order of
magnitude over the other.

--Bryan
bryan.turner at pobox.com

-----Original Message-----
From: p2p-hackers-bounces at zgp.org [mailto:p2p-hackers-bounces at zgp.org] On
Behalf Of SIMON Gwendal RD-MAPS-ISS
Sent: Thursday, February 03, 2005 10:49 AM
To: Peer-to-peer development.
Subject: TR: [p2p-hackers] Paradigma Question: DHT's or Small World? 

Hi,

	Here are two assumptions that advocate for small-world.

	The first one, related to the human language, has been partially
established by several studies [1,2] since the pioneering work of [3]. The
graph of word interactions is constructed by linking two words when they
co-occur in a sentence (a fortiori in a file). The study of the properties
of these graphs shows they exhibit the small world effect and a scale-free
distribution of degrees.

	The second assumption follows the observations you cite and some
others [4,5,6]. The data-sharing graph is constructed by linking two users
when they share a same file. Observations on several real traces show that
this graph exhibits also the small-world effect and the scale-free
distribution of degrees.

	Besides, it is known that the lexicon of an human contains few
thousands of words. This lexicon and the words contained in the documents
which have been produced and dowloaded by an user define her "semantic
profile". Through the preceeding assumptions, we naturally infer that the
graph generated by linking users when their semantic profile overlap is also
small-world and scale-free.

	That is, if we consider that users emit requests on keywords chosen
within their profile, we can expect that almost *all* files of interest for
an user are stored by a small set of "friends". Moreover, these "friends"
are already known by the user thanks to previous successfull queries.

	Therefore, it is possible to limit the search to a subspace of the
information space without preventing the quality of responses. On the
contrary, it is probable that these responses are more relevant for the
requester point of view. For instance, a fan of "Fiona Apple" will discover
mp3 of Fiona Apple and not informations on Apple Inc. or webpages for "apple
pie" cooking. Or, an European querying informations on "football" will not
receive pages on NFL.

	By the way, another related concern is the publication of a file.
	In a gnutella-like systems, peers just have to put their files in
their "shared directory" in order to make them available by any node in the
system.
	On the contrary, the task of publication in a DHT-based overlay
requires to reach as many peers as the number of words describing the
published document. Indeed, the published file has to be known by the peers
that are responsible of all the *relevant* words of the document. This is
clearly an issue for keyword-based search in DHTs. If you want to design a
search engine indexing *all* words in the document, this task becomes
unrealistic.

--------------------
Gwendal Simon
France Telecom R&D
http://solipsis.netofpeers.net


[1] D. Watts. Six Degrees.
[2] A. Barabasi. Linked: the New Science of Networks.
[3] R. Ferrer i Canco and R. Sole. The Small World of Human Language.
[4] J. Keller, D. Stern and F. Dang Ngoc. MAAY: A Self-Adaptive Peer Network
for Efficient Document Search.
[5] V. Cholvi, P. Felber, and E.W. Biersack. Efficient Search in
Unstructured Peer-to-Peer Networks.
[6] Adriana Iamnitchi, Matei Ripeanu and Ian Foster, Small-World
File-Sharing Communities.

_______________________________________________
p2p-hackers mailing list
p2p-hackers at zgp.org
http://zgp.org/mailman/listinfo/p2p-hackers
_______________________________________________
Here is a web page listing P2P Conferences:
http://www.neurogrid.net/twiki/bin/view/Main/PeerToPeerConferences



More information about the P2p-hackers mailing list