Dijjer and Freenet (RE: [p2p-hackers] clustering)

Oskar Sandberg ossa at math.chalmers.se
Sun Mar 12 16:18:28 UTC 2006

Serguei Osokine wrote:
> 	Ian, Oskar, that was wonderful. I rarely see anything that has
> that much aesthetic value, but the analyitical proof that the graph
> rewiring caused by the random search actions actually makes it a 
> small-world graph with log^2(N) path length, was certainly worth the
> one-year wait since last April, which is when Ian has preannounced 
> this paper.

It should be noted that the analytic proof in fact does not established 
this. I did come up with the proof that appears in the thesis last 
April, and I have since (with little success) attempted to generalize it 
so it holds for the rewiring algorithm. The problem is that the proof, 
like the Kleinberg proofs, assumes that edges are chosen independently 
at each point, but in fact using our algorithm they will depend on one 
another (in a good way, but that is hard to prove).

In fact, I haven't even managed to show that the class of distributions 
for which I have a bound - those which are solutions to the system 
implied by equations (2.3) and (2.4) in the text - is nonempty. Until 
this is done the theorem as it stands is interesting, but from a strict 
mathematical viewpoint quite meaningless. (Which is why I have held off 
on publication).

> 1. What is the difference between Freenet[1] and Sandberg/Dijjer[2]
>    models that makes Dijjer easier to analyze? This is mentioned a few
>    times in [2] and in other Freenet- and Dijjer- related places, but
>    I did not notice the explicit explanation of the differences
>    anywhere. Was it the directed links in the graph in [2]? Something
>    else? I honestly tried to figure this out by myself, but couldn't; 
>    sorry... Maybe if I'd spend more time looking at both [1] and [2],
>    I'd figure this out, but since you guys are already here, I thought
>    that maybe you could answer that? 

What makes Freenet (by which I mean the old Freenet routing, not the one 
they are using now) much more difficult to analyze is that the data 
rather than the nodes which have addresses. Data bounces around the 
network dynamically, and so the markov chain implied becomes a lot more 
complicated. For one thing, Freenet suffers from load balancing issues, 
while the point to point routing cannot have such issues.

> 2. Whatever this difference is, how relevant do you think are the
>    Sandberg/Dijjer results[2] for the original Freenet paper[1]?
>    For example:
> 	a. section 5.2 of [1] says about Fig. 3: "We can see that the
> 	   pathlength scales approximately logarithmically, with a change
> 	   of slope near 40,000 nodes."
> 	   Given what you guys know now, do you think that Fig. 3 shows
> 	   log^2(N) instead, or Freenet simply has different scaling
> 	   properties due to its different algorithms?

Paths in Freenet are affected by other things than just point to point 
routing. Caching for instance will play a big role. I also don't 
remember exactly how we were scaling the routing table size in those 
simulations (the log^2 bound is for a constant routing table size).

I'm guessing that a lot of what was seen in those simulation were 
artifacts of other factors.

> 	b. Fig. 5 shows the link number distribution that awfully
> 	   resembles the power-law one, whereas Oskar mentioned in
> 	   another mail that Kleinberg is constant out-degree and 
> 	   Poisson in-degree. Is my assumption that Sandberg/Dijjer 
> 	   model is supposed to mimic the Kleinberg distibution
> 	   incorrect and in fact both Sandberg/Dijjer and Freenet
> 	   are power-law? Or is just one of them power law due to
> 	   the different algorithms used? Or maybe Fig. 5 in [1] is
> 	   actually showing a part of the Poisson distribution that 
> 	   can be easily mistaken for the power law?

The model presented in the paper has fixed out-degree and (roughly) 
poisson in-degree like Kleinberg's. Freenet has a fixed out-degree but 
possibly a power-law in-degree, which is the cause of the aforementioned 
load balancing issues (I don't have any good evidence for this, just the 
feeling that their is some sort of preferential attachment going on).

> 	c. Section 5.4 of [1] says: "In a small-world network, the
> 	   majority of nodes have only relatively few, local, connections
> 	   to other nodes, while a small number of nodes have large, 
> 	   wide-ranging sets of connections. Small-world networks permit
> 	   efficient short paths between arbitrary points because of the
> 	   shortcuts provided by the well-connected nodes..."
> 	   This does not sound like anything I've been able to see in the
> 	   Oskar's thesis[2] - it has left me with a distinct impression
> 	   that the connectivity in the small world (at least in the one
> 	   analyzed there) was provided by the exactly right (Kleinberg's)
> 	   proportion of long-distance links, which Sandberg/Dijjer
> 	   rewiring model was designed to achieve, and not by the means 
> 	   of any special nodes with "wide-ranging sets of connections".
> 	   Is it because Freenet and Dijjer used different algorithms, or
> 	   the original assumption about the reasons for the Freenet
> 	   connectivity was wrong? 

The original paper is based on less sophisticated small-world models 
than the later stuff. I think most of the small-world references in 
there are hand-waving really.

If you place an arbitrary point somewhere where you say that connections 
shorter than that are local and others are long distance, then 
Kleinberg's model (and ours) will reflect the somewhat fuzzy statement 
you quote above.

> 3. Why would one want to use the system with O(log^2(N)) path length
>    when pretty much every DHT gives a path lenght of O(log(N))? At
>    today's P2P nets scale (millions of nodes) the difference between
>    log and log^2 is non-trivial. I understand that creating the small-
>    world network by a simple and natural evolutionary process is very
>    elegant, simple, and attractive, but is this elegance worth the 
>    increased path length? Why not simply use a DHT?

In real life, the order is much less interesting then the actual route 
length. One shouldn't stair blindly at the order when the constants 
could be dominating in real world situations. Also, the log^2 is with 
constant routing table size, if you scale up the routing table with the 
size of the network, you can roughly divide the order by the routing 
table size (most log n DHTs scale the routing table with log n as well).

> 4. Speaking of DHTs: Oskar recently said something curious here that 
>    I did not understand: "Actually, while the frequency of Chord links
>    falls exponentially with the "level" (not sure what the Chord term
>    is) the length of such links increases exponentially as well, so in
>    fact the frequency of links with certain lengths do fall
>    harmonically. One could see Chord as some sort of "mean field"
>    version of the same dynamics as Kleinberg's model."
>    What exactly does this mean? DHTs (including Chord) have log(N)
>    diameter, and Kleinberg has log^2(N), right? This looks fairly
>    different to me, and I totally missed the meaning of the "mean
>    field", and why Chord and Kleinberg would have the same dynamics;
>    I did not even understand what is this dynamics that you're talking
>    about. Oskar, could you please elaborate on this?

Again, the routing tables grow in Chord. If each node had one "chord" 
link (every log n th node having a chord of the same length), then Chord 
would be roughly log^2 as well.

> 6. Both the simulation in Freenet paper[1] and Sandberg/Dijjer paper[2]
>    use the random requests to rewire the networks. In reality, there
>    will be all kinds of hotspots and uneven distributions of requests
>    due to the different popularity of content. Do you guys think it
>    would affect the results of analysis performed in [2], and if yes,
>    how? For example, if one particular file is extremely popular, one
>    could imagine that its storage node[s] will have many more links 
>    than the average, and the search for the other (rarely requested)
>    data items might become ineffective, because these path lengths will
>    become very high. 

I would imagine that the large number of files stored per node, and the 
random distribution implied by using a secure hash would help even out 
distribution. I'm not sure what an uneven distribution does under our 
algorithm, but intuitively it should handle it just fine (it would mean 
more links to the destination of popular queries, but that is a good 
idea since such links are popular...)

>    Do you think it is a valid concern? Are there any results that would
>    show that this is not an issue?

Yes, and no (besides standard concentration results showing that it 
should even out if there are sufficiently many documents per node).

> 7. While Sandberg [2] assumes that the requests are passing between 
>    the random points, both Freenet and Dijjer seem to allow the length
>    of the path to be shortened when the requested content is already
>    cached on the intermediate node. If this is really so, what is the
>    effect that will have on Oskar's analysis results? Fewer links will
>    be rewired, so this should have some effect on the graph dynamics,
>    right? Any idea what it would be?

This is part of the answer to your first question about why these 
situations are harder to analyze.

> 8. The answer to '7' above might also depend on the cache size and 
>    replacement policies on the nodes, and on the content popularity
>    distribution (which I already mentioned in '6' above). All these
>    factors are hard to predict for the real system in advance, and I'm
>    wondering what is the feeling that you guys have about the overall
>    stability of the rewiring algorithm?

No idea.

>    I mean, the original Kleinberg d(x, y)^(-1) approach is extremely
>    sensitive to the value of this "-1". Any other value, and you either
>    do not have enough long distance links, or you do not have enough
>    short-range links once you arrive into the general vicinity of your
>    destination. In any case, your path length becomes polynomial instead
>    of polylogarithmic once you have the smallest deviation from this
>    "-1" number.

It isn't actually that sensitive. It is sensitive asymptotically, but 
for any given network size varying the exponent a little will make 
rather little difference. In fact, one can calculate numerically (I know 
of no proof) that there are better exponents than -1 for any given n 
(though the best exponent approaches 1 as the network grows towards 

>    And this makes me a bit uneasy; I do understand that the catastrophic
>    failure is unlikely, and the detrimental effect of all these changes
>    (if any) is likely to be tolerable, but still... Look at it this
>    way: Chapter 1 of Oskar's thesis is dedicated to proving that 
>    "...family given by (1.1) allows for polylogarithmic routing at, 
>    and only at, one value of the [alpha]...", then Chapter 2 proves
>    that this is exactly what happens with random-request rewiring, and
>    then Dijjer's practical implementation and deployment arrive and
>    throw these random requests out of the window with their caching,
>    different content popularity, and whatnot. So given the strict
>    requirements that were proven to be vital in Chapter 1, I cannot 
>    help wondering: is Dijjer still polylogarithmic, or not?

It is certainly not proved.

>    I'd feel much better about all this if there would be some results
>    showing that the system is stable in the presence of such changes
>    (in a sense that reasonably small algorithm input changes should not
>    cause any catastrophically large changes in the algorithm results).
>    Do you have any results that would point in that direction? Would
>    you say that the data displayed on Figures 2.4-2.6 already shows 
>    that your results are not exactly Kleinberg, but this does not 
>    cause any catastrophic consequences, or you also have some other
>    sources of optimism?

I would feel better about it too if more were proven. I am optimistic 
that things actually do work, but I am not very optimistic about how 
much I will ever be able to show (and it isn't just me, though a better 
mathematician may of course have gotten further, I have put these 
problems in front of a many good researchers who seem to agree they are 
difficult problems).

> 9. And finally, if I'm not mistaken, Dijjer description seems to imply
>    that the links are rewired after every search, whereas Oskar suggests
>    that this should be happen only with proability p, and in section 2.3
>    says: "...there are simple heuristic arguments for why p should
>    reasonably be on the order of one over the expected length of the 
>    greedy walks". Any particular reason for that Dijjer behaviour, or
>    I've missed something and Dijjer does, in fact, rewire with this
>    probability p? (Would be interesting to hear these simple heuristic
>    arguments too, but that's another story...)

It won't matter so much if the routing table is large (remember the 
mathematical model has one long link) though I would recommend that 
Dijjer updated less often.

// oskar

More information about the P2p-hackers mailing list