Dijjer and Freenet (RE: [p2p-hackers] clustering)
Serguei Osokine
osokin at osokin.com
Sun Mar 12 22:12:15 UTC 2006
Oskar, thank you for a quick and thorough reply!
> In fact, I haven't even managed to show that the class of
> distributions for which I have a bound - those which are solutions
> to the system implied by equations (2.3) and (2.4) in the text - is
> nonempty. Until this is done the theorem as it stands is interesting,
> but from a strict mathematical viewpoint quite meaningless. (Which is
> why I have held off on publication).
Damn... I knew that I should've spent more time following the
chain of proofs... I thought that you did prove your rewiring method
to converge to the balanced distribution in terms of Theorem 2.2 with
some margin of error due to the dynamic nature of the rewiring, and
that the differences between solid and dotted lines in Figs. 2.4-2.6
were basically the result of this dynamic error, especially since the
solid line on Fig. 2.6 seemed to have the clearly visible simulation
artifacts... :-)
> Also, the log^2 is with constant routing table size, if you scale up
> the routing table with the size of the network, you can roughly divide
> the order by the routing table size (most log n DHTs scale the routing
> table with log n as well).
> ...
> Again, the routing tables grow in Chord. If each node had one "chord"
> link (every log n th node having a chord of the same length), then
> Chord would be roughly log^2 as well.
Ah, great points. So let me get this straight - what you're
saying is that Chord and other similar DHTs (Kademlia, etc) all have
basically the Kleinberg distribution, correct? They have same average
number of shortcuts covering every shortcut length interval with ~2^k
order of magnitude size and ~2^k order of magnitude average shortcut
length, so the probability of having the shortcut with length 2^k -
P(2^k) is inversely proportional to the interval size 2^k, so what we
have is P[2^k] ~ 1/(2^k), or P[x] ~ 1/x.
So the apparent log(N) Chord path length is due only to log(N)
links from every node, and if say, Dijjer would have log(N) links per
node, its path length would also scale as long(N).
Very interesting - if I really got all this correctly. It seems
obvious in retrospect, but I, for one, never thought about it this way.
In fact, I could not fully understand your point before doing some back
of the envelope (well, back of your thesis, actually :-) calculations
while writing this mail. Judging from the previous discussion, I won't
be the only one for whom this chain of reasoning will be a surprise -
thank you for taking time to lay it out in the open!
By the way, as a practical matter, does Dijjer grow its routing
tables with network size? Shouldn't be a problem to scale these tables
as log(N) and have the scalability similar to Chord and other DHTs. If
you do, you might want to mention it somewhere - I'm fairly sure that
I'm not the single person dense enough to miss the link between log(N)
path length and log(N) links per node and to be left with an impression
that Chord scales better than Dijjer as a result.
> It won't matter so much if the routing table is large (remember the
> mathematical model has one long link) though I would recommend that
> Dijjer updated less often.
If the Dijjer routing table indeed has log(N) size and you replace
only one entry in it, doesn't it already mean that you are updating with
the probability of "...order of one over the expected length of the
greedy walks", which length also becomes log(N) in this case? Or your
recommendation is based on some other considerations?
Best wishes -
S.Osokine.
12 Mar 2006.
-----Original Message-----
From: Oskar Sandberg [mailto:ossa at math.chalmers.se]
Sent: Sunday, March 12, 2006 8:18 AM
To: osokin at osokin.com; Peer-to-peer development.
Subject: Re: Dijjer and Freenet (RE: [p2p-hackers] clustering)
Serguei Osokine wrote:
> Ian, Oskar, that was wonderful. I rarely see anything that has
> that much aesthetic value, but the analyitical proof that the graph
> rewiring caused by the random search actions actually makes it a
> small-world graph with log^2(N) path length, was certainly worth the
> one-year wait since last April, which is when Ian has preannounced
> this paper.
It should be noted that the analytic proof in fact does not established
this. I did come up with the proof that appears in the thesis last
April, and I have since (with little success) attempted to generalize it
so it holds for the rewiring algorithm. The problem is that the proof,
like the Kleinberg proofs, assumes that edges are chosen independently
at each point, but in fact using our algorithm they will depend on one
another (in a good way, but that is hard to prove).
In fact, I haven't even managed to show that the class of distributions
for which I have a bound - those which are solutions to the system
implied by equations (2.3) and (2.4) in the text - is nonempty. Until
this is done the theorem as it stands is interesting, but from a strict
mathematical viewpoint quite meaningless. (Which is why I have held off
on publication).
<>
> 1. What is the difference between Freenet[1] and Sandberg/Dijjer[2]
> models that makes Dijjer easier to analyze? This is mentioned a few
> times in [2] and in other Freenet- and Dijjer- related places, but
> I did not notice the explicit explanation of the differences
> anywhere. Was it the directed links in the graph in [2]? Something
> else? I honestly tried to figure this out by myself, but couldn't;
> sorry... Maybe if I'd spend more time looking at both [1] and [2],
> I'd figure this out, but since you guys are already here, I thought
> that maybe you could answer that?
What makes Freenet (by which I mean the old Freenet routing, not the one
they are using now) much more difficult to analyze is that the data
rather than the nodes which have addresses. Data bounces around the
network dynamically, and so the markov chain implied becomes a lot more
complicated. For one thing, Freenet suffers from load balancing issues,
while the point to point routing cannot have such issues.
> 2. Whatever this difference is, how relevant do you think are the
> Sandberg/Dijjer results[2] for the original Freenet paper[1]?
>
> For example:
>
> a. section 5.2 of [1] says about Fig. 3: "We can see that the
> pathlength scales approximately logarithmically, with a change
> of slope near 40,000 nodes."
>
> Given what you guys know now, do you think that Fig. 3 shows
> log^2(N) instead, or Freenet simply has different scaling
> properties due to its different algorithms?
Paths in Freenet are affected by other things than just point to point
routing. Caching for instance will play a big role. I also don't
remember exactly how we were scaling the routing table size in those
simulations (the log^2 bound is for a constant routing table size).
I'm guessing that a lot of what was seen in those simulation were
artifacts of other factors.
> b. Fig. 5 shows the link number distribution that awfully
> resembles the power-law one, whereas Oskar mentioned in
> another mail that Kleinberg is constant out-degree and
> Poisson in-degree. Is my assumption that Sandberg/Dijjer
> model is supposed to mimic the Kleinberg distibution
> incorrect and in fact both Sandberg/Dijjer and Freenet
> are power-law? Or is just one of them power law due to
> the different algorithms used? Or maybe Fig. 5 in [1] is
> actually showing a part of the Poisson distribution that
> can be easily mistaken for the power law?
The model presented in the paper has fixed out-degree and (roughly)
poisson in-degree like Kleinberg's. Freenet has a fixed out-degree but
possibly a power-law in-degree, which is the cause of the aforementioned
load balancing issues (I don't have any good evidence for this, just the
feeling that their is some sort of preferential attachment going on).
> c. Section 5.4 of [1] says: "In a small-world network, the
> majority of nodes have only relatively few, local, connections
> to other nodes, while a small number of nodes have large,
> wide-ranging sets of connections. Small-world networks permit
> efficient short paths between arbitrary points because of the
> shortcuts provided by the well-connected nodes..."
>
> This does not sound like anything I've been able to see in the
> Oskar's thesis[2] - it has left me with a distinct impression
> that the connectivity in the small world (at least in the one
> analyzed there) was provided by the exactly right (Kleinberg's)
> proportion of long-distance links, which Sandberg/Dijjer
> rewiring model was designed to achieve, and not by the means
> of any special nodes with "wide-ranging sets of connections".
> Is it because Freenet and Dijjer used different algorithms, or
> the original assumption about the reasons for the Freenet
> connectivity was wrong?
The original paper is based on less sophisticated small-world models
than the later stuff. I think most of the small-world references in
there are hand-waving really.
If you place an arbitrary point somewhere where you say that connections
shorter than that are local and others are long distance, then
Kleinberg's model (and ours) will reflect the somewhat fuzzy statement
you quote above.
> 3. Why would one want to use the system with O(log^2(N)) path length
> when pretty much every DHT gives a path lenght of O(log(N))? At
> today's P2P nets scale (millions of nodes) the difference between
> log and log^2 is non-trivial. I understand that creating the small-
> world network by a simple and natural evolutionary process is very
> elegant, simple, and attractive, but is this elegance worth the
> increased path length? Why not simply use a DHT?
In real life, the order is much less interesting then the actual route
length. One shouldn't stair blindly at the order when the constants
could be dominating in real world situations. Also, the log^2 is with
constant routing table size, if you scale up the routing table with the
size of the network, you can roughly divide the order by the routing
table size (most log n DHTs scale the routing table with log n as well).
> 4. Speaking of DHTs: Oskar recently said something curious here that
> I did not understand: "Actually, while the frequency of Chord links
> falls exponentially with the "level" (not sure what the Chord term
> is) the length of such links increases exponentially as well, so in
> fact the frequency of links with certain lengths do fall
> harmonically. One could see Chord as some sort of "mean field"
> version of the same dynamics as Kleinberg's model."
>
> What exactly does this mean? DHTs (including Chord) have log(N)
> diameter, and Kleinberg has log^2(N), right? This looks fairly
> different to me, and I totally missed the meaning of the "mean
> field", and why Chord and Kleinberg would have the same dynamics;
> I did not even understand what is this dynamics that you're talking
> about. Oskar, could you please elaborate on this?
Again, the routing tables grow in Chord. If each node had one "chord"
link (every log n th node having a chord of the same length), then Chord
would be roughly log^2 as well.
<>
> 6. Both the simulation in Freenet paper[1] and Sandberg/Dijjer paper[2]
> use the random requests to rewire the networks. In reality, there
> will be all kinds of hotspots and uneven distributions of requests
> due to the different popularity of content. Do you guys think it
> would affect the results of analysis performed in [2], and if yes,
> how? For example, if one particular file is extremely popular, one
> could imagine that its storage node[s] will have many more links
> than the average, and the search for the other (rarely requested)
> data items might become ineffective, because these path lengths will
> become very high.
I would imagine that the large number of files stored per node, and the
random distribution implied by using a secure hash would help even out
distribution. I'm not sure what an uneven distribution does under our
algorithm, but intuitively it should handle it just fine (it would mean
more links to the destination of popular queries, but that is a good
idea since such links are popular...)
> Do you think it is a valid concern? Are there any results that would
> show that this is not an issue?
Yes, and no (besides standard concentration results showing that it
should even out if there are sufficiently many documents per node).
> 7. While Sandberg [2] assumes that the requests are passing between
> the random points, both Freenet and Dijjer seem to allow the length
> of the path to be shortened when the requested content is already
> cached on the intermediate node. If this is really so, what is the
> effect that will have on Oskar's analysis results? Fewer links will
> be rewired, so this should have some effect on the graph dynamics,
> right? Any idea what it would be?
This is part of the answer to your first question about why these
situations are harder to analyze.
> 8. The answer to '7' above might also depend on the cache size and
> replacement policies on the nodes, and on the content popularity
> distribution (which I already mentioned in '6' above). All these
> factors are hard to predict for the real system in advance, and I'm
> wondering what is the feeling that you guys have about the overall
> stability of the rewiring algorithm?
No idea.
> I mean, the original Kleinberg d(x, y)^(-1) approach is extremely
> sensitive to the value of this "-1". Any other value, and you either
> do not have enough long distance links, or you do not have enough
> short-range links once you arrive into the general vicinity of your
> destination. In any case, your path length becomes polynomial instead
> of polylogarithmic once you have the smallest deviation from this
> "-1" number.
It isn't actually that sensitive. It is sensitive asymptotically, but
for any given network size varying the exponent a little will make
rather little difference. In fact, one can calculate numerically (I know
of no proof) that there are better exponents than -1 for any given n
(though the best exponent approaches 1 as the network grows towards
infinity).
> And this makes me a bit uneasy; I do understand that the catastrophic
> failure is unlikely, and the detrimental effect of all these changes
> (if any) is likely to be tolerable, but still... Look at it this
> way: Chapter 1 of Oskar's thesis is dedicated to proving that
> "...family given by (1.1) allows for polylogarithmic routing at,
> and only at, one value of the [alpha]...", then Chapter 2 proves
> that this is exactly what happens with random-request rewiring, and
> then Dijjer's practical implementation and deployment arrive and
> throw these random requests out of the window with their caching,
> different content popularity, and whatnot. So given the strict
> requirements that were proven to be vital in Chapter 1, I cannot
> help wondering: is Dijjer still polylogarithmic, or not?
It is certainly not proved.
> I'd feel much better about all this if there would be some results
> showing that the system is stable in the presence of such changes
> (in a sense that reasonably small algorithm input changes should not
> cause any catastrophically large changes in the algorithm results).
> Do you have any results that would point in that direction? Would
> you say that the data displayed on Figures 2.4-2.6 already shows
> that your results are not exactly Kleinberg, but this does not
> cause any catastrophic consequences, or you also have some other
> sources of optimism?
I would feel better about it too if more were proven. I am optimistic
that things actually do work, but I am not very optimistic about how
much I will ever be able to show (and it isn't just me, though a better
mathematician may of course have gotten further, I have put these
problems in front of a many good researchers who seem to agree they are
difficult problems).
> 9. And finally, if I'm not mistaken, Dijjer description seems to imply
> that the links are rewired after every search, whereas Oskar suggests
> that this should be happen only with proability p, and in section 2.3
> says: "...there are simple heuristic arguments for why p should
> reasonably be on the order of one over the expected length of the
> greedy walks". Any particular reason for that Dijjer behaviour, or
> I've missed something and Dijjer does, in fact, rewire with this
> probability p? (Would be interesting to hear these simple heuristic
> arguments too, but that's another story...)
It won't matter so much if the routing table is large (remember the
mathematical model has one long link) though I would recommend that
Dijjer updated less often.
// oskar
More information about the P2p-hackers
mailing list