Dijjer and Freenet (RE: [p2p-hackers] clustering)
Serguei Osokine
osokin at osokin.com
Sun Mar 12 03:08:38 UTC 2006
On Tuesday, March 07, 2006 Ian Clarke wrote:
> I think Chapters 1 and 2 of Oskar Sandberg's recent thesis may be
> extremely relevant:
>
> http://www.math.chalmers.se/~ossa/lic.pdf
>
> It outlines an algorithm to choose which nodes in a network should
> have edges between them such that "greedy" routing (routing to the
> peer closest to what you are looking for) has small world O( log^2
> (N) ) path lengths.
Ian, Oskar, that was wonderful. I rarely see anything that has
that much aesthetic value, but the analyitical proof that the graph
rewiring caused by the random search actions actually makes it a
small-world graph with log^2(N) path length, was certainly worth the
one-year wait since last April, which is when Ian has preannounced
this paper.
I have to apologize in advance for my questions - some of them
might be downright stupid (I won't even pretend that I followed the
logic of all your proofs), and in any case there are many of them.
Perhaps, too many. Feel free to answer any subset that does not seem
stupid to you (and as a special case, of course, just ignore all of
them :-)
But I waited for this paper for a year, and as you might imagine,
I had time to ponder quite a few issues... So here we go:
1. What is the difference between Freenet[1] and Sandberg/Dijjer[2]
models that makes Dijjer easier to analyze? This is mentioned a few
times in [2] and in other Freenet- and Dijjer- related places, but
I did not notice the explicit explanation of the differences
anywhere. Was it the directed links in the graph in [2]? Something
else? I honestly tried to figure this out by myself, but couldn't;
sorry... Maybe if I'd spend more time looking at both [1] and [2],
I'd figure this out, but since you guys are already here, I thought
that maybe you could answer that?
2. Whatever this difference is, how relevant do you think are the
Sandberg/Dijjer results[2] for the original Freenet paper[1]?
For example:
a. section 5.2 of [1] says about Fig. 3: "We can see that the
pathlength scales approximately logarithmically, with a change
of slope near 40,000 nodes."
Given what you guys know now, do you think that Fig. 3 shows
log^2(N) instead, or Freenet simply has different scaling
properties due to its different algorithms?
b. Fig. 5 shows the link number distribution that awfully
resembles the power-law one, whereas Oskar mentioned in
another mail that Kleinberg is constant out-degree and
Poisson in-degree. Is my assumption that Sandberg/Dijjer
model is supposed to mimic the Kleinberg distibution
incorrect and in fact both Sandberg/Dijjer and Freenet
are power-law? Or is just one of them power law due to
the different algorithms used? Or maybe Fig. 5 in [1] is
actually showing a part of the Poisson distribution that
can be easily mistaken for the power law?
c. Section 5.4 of [1] says: "In a small-world network, the
majority of nodes have only relatively few, local, connections
to other nodes, while a small number of nodes have large,
wide-ranging sets of connections. Small-world networks permit
efficient short paths between arbitrary points because of the
shortcuts provided by the well-connected nodes..."
This does not sound like anything I've been able to see in the
Oskar's thesis[2] - it has left me with a distinct impression
that the connectivity in the small world (at least in the one
analyzed there) was provided by the exactly right (Kleinberg's)
proportion of long-distance links, which Sandberg/Dijjer
rewiring model was designed to achieve, and not by the means
of any special nodes with "wide-ranging sets of connections".
Is it because Freenet and Dijjer used different algorithms, or
the original assumption about the reasons for the Freenet
connectivity was wrong?
Or is it simply that I'm misreading the Freenet paper[1] and
it did not mean to imply that small world properties of Freenet
are due to the small subset of well connected nodes? Maybe my
understanding of Sandberg/Dijjer[2] is wrong?
3. Why would one want to use the system with O(log^2(N)) path length
when pretty much every DHT gives a path lenght of O(log(N))? At
today's P2P nets scale (millions of nodes) the difference between
log and log^2 is non-trivial. I understand that creating the small-
world network by a simple and natural evolutionary process is very
elegant, simple, and attractive, but is this elegance worth the
increased path length? Why not simply use a DHT?
4. Speaking of DHTs: Oskar recently said something curious here that
I did not understand: "Actually, while the frequency of Chord links
falls exponentially with the "level" (not sure what the Chord term
is) the length of such links increases exponentially as well, so in
fact the frequency of links with certain lengths do fall
harmonically. One could see Chord as some sort of "mean field"
version of the same dynamics as Kleinberg's model."
What exactly does this mean? DHTs (including Chord) have log(N)
diameter, and Kleinberg has log^2(N), right? This looks fairly
different to me, and I totally missed the meaning of the "mean
field", and why Chord and Kleinberg would have the same dynamics;
I did not even understand what is this dynamics that you're talking
about. Oskar, could you please elaborate on this?
5. Throughout this mail I've been using the expresson Sandberg/Dijjer
to denote both what is described in [2] and Dijjer, as if it is the
same system; is this a right thing to do, or there are significant
differences between these two? What about these directed links, for
example? Does Dijjer use them? Does this matter?
6. Both the simulation in Freenet paper[1] and Sandberg/Dijjer paper[2]
use the random requests to rewire the networks. In reality, there
will be all kinds of hotspots and uneven distributions of requests
due to the different popularity of content. Do you guys think it
would affect the results of analysis performed in [2], and if yes,
how? For example, if one particular file is extremely popular, one
could imagine that its storage node[s] will have many more links
than the average, and the search for the other (rarely requested)
data items might become ineffective, because these path lengths will
become very high.
Do you think it is a valid concern? Are there any results that would
show that this is not an issue?
7. While Sandberg [2] assumes that the requests are passing between
the random points, both Freenet and Dijjer seem to allow the length
of the path to be shortened when the requested content is already
cached on the intermediate node. If this is really so, what is the
effect that will have on Oskar's analysis results? Fewer links will
be rewired, so this should have some effect on the graph dynamics,
right? Any idea what it would be?
8. The answer to '7' above might also depend on the cache size and
replacement policies on the nodes, and on the content popularity
distribution (which I already mentioned in '6' above). All these
factors are hard to predict for the real system in advance, and I'm
wondering what is the feeling that you guys have about the overall
stability of the rewiring algorithm?
I mean, the original Kleinberg d(x, y)^(-1) approach is extremely
sensitive to the value of this "-1". Any other value, and you either
do not have enough long distance links, or you do not have enough
short-range links once you arrive into the general vicinity of your
destination. In any case, your path length becomes polynomial instead
of polylogarithmic once you have the smallest deviation from this
"-1" number.
And this makes me a bit uneasy; I do understand that the catastrophic
failure is unlikely, and the detrimental effect of all these changes
(if any) is likely to be tolerable, but still... Look at it this
way: Chapter 1 of Oskar's thesis is dedicated to proving that
"...family given by (1.1) allows for polylogarithmic routing at,
and only at, one value of the [alpha]...", then Chapter 2 proves
that this is exactly what happens with random-request rewiring, and
then Dijjer's practical implementation and deployment arrive and
throw these random requests out of the window with their caching,
different content popularity, and whatnot. So given the strict
requirements that were proven to be vital in Chapter 1, I cannot
help wondering: is Dijjer still polylogarithmic, or not?
I'd feel much better about all this if there would be some results
showing that the system is stable in the presence of such changes
(in a sense that reasonably small algorithm input changes should not
cause any catastrophically large changes in the algorithm results).
Do you have any results that would point in that direction? Would
you say that the data displayed on Figures 2.4-2.6 already shows
that your results are not exactly Kleinberg, but this does not
cause any catastrophic consequences, or you also have some other
sources of optimism?
9. And finally, if I'm not mistaken, Dijjer description seems to imply
that the links are rewired after every search, whereas Oskar suggests
that this should be happen only with proability p, and in section 2.3
says: "...there are simple heuristic arguments for why p should
reasonably be on the order of one over the expected length of the
greedy walks". Any particular reason for that Dijjer behaviour, or
I've missed something and Dijjer does, in fact, rewire with this
probability p? (Would be interesting to hear these simple heuristic
arguments too, but that's another story...)
[1] Freenet: A Distributed Anonymous Information Storage and Retrieval
System. Ian Clarke1, Oskar Sandberg2, Brandon Wiley3, and Theodore
W. Hong.
http://www.cl.cam.ac.uk/~twh25/academic/papers/icsi-revised.pdf
[2] Searching in a Small World. Oskar Sandberg.
http://www.math.chalmers.se/~ossa/lic.pdf
Best wishes -
S.Osokine.
11 Mar 2006.
-----Original Message-----
From: p2p-hackers-bounces at zgp.org [mailto:p2p-hackers-bounces at zgp.org]On
Behalf Of Ian Clarke
Sent: Tuesday, March 07, 2006 5:03 PM
To: Peer-to-peer development.
Subject: Re: [p2p-hackers] clustering
I think Chapters 1 and 2 of Oskar Sandberg's recent thesis may be
extremely relevant:
http://www.math.chalmers.se/~ossa/lic.pdf
It outlines an algorithm to choose which nodes in a network should
have edges between them such that "greedy" routing (routing to the
peer closest to what you are looking for) has small world O( log^2
(N) ) path lengths.
The algorithm is very simple, and pleasingly "natural". Basically
you use greedy routing to find a path to your intended destination
node, then you add a link from each node along this path to the
destination. The number of outbound edges from each node can't
obviously increase indefinitely, so they are deleted according to a
least recently used scheme to make room for new edges.
This is based on Freenet's approach to edge selection. The paper
describes both experimental results that confirm that this algorithm
does lead to a small world topology, and also delves into the
mathematics behind why this might be the case.
It is interesting not just because of the simplicity of the
algorithm, but because it is extremely amenable to the construction
and maintainence of decentralized P2P networks, and because it could
tell us something about why human relationships tend to form small
world networks.
This algorithm is used by the Dijjer (http://dijjer.org/) P2P network.
Ian.
On 7 Mar 2006, at 02:46, Jeff Rose wrote:
> Anyone out there doing work in clustering? I'd be interested in
> hearing what people are up to, or if you have pointers to good
> papers that would be great too. At this point I'm interested in
> any angle, geographic distance based, semantic distance based,
> hierarchical, non-hierarchical etc... My goal is to work towards a
> very amorphous clustering scheme that allows any kind of object to
> be located close to relatives in the network as long as they share
> a common distance function.
_______________________________________________
p2p-hackers mailing list
p2p-hackers at zgp.org
http://zgp.org/mailman/listinfo/p2p-hackers
_______________________________________________
Here is a web page listing P2P Conferences:
http://www.neurogrid.net/twiki/bin/view/Main/PeerToPeerConferences
More information about the P2p-hackers
mailing list