[p2p-hackers] Practical clustering

David Barrett dbarrett at quinthar.com
Thu Mar 9 21:27:15 UTC 2006

So I’ve enjoyed the discussion on the mathematical foundation and taxonomy
of networks vis-à-vis clustering, but what techniques have you found useful
on a more nuts-and-bolts level?  It seems to me the clustering problem can
be broken down into (roughly):


1) Measuring the “distance” between any two nodes: How have you defined
‘distance’ in this context: physical distance, latency, network hops,
throughput, etc?  And for your selection, what techniques have you used to
measure it (IP geolocation, synthetic coordinate systems, embedded
traceroute, network monitoring, etc)?


2) Measuring the “strength” of any given node: How have you defined
‘strength’ in this context: uptime, CPU, memory, upload capacity, download
capacity, etc?  How have you measured these?


3) Join algorithm: When a new node comes into the network, how do you
determine its placement?  Is it a top-down, recursive approach, or a
bottom-up, emergent approach?


4) Optimization algorithm: As new nodes come and go (or even as a
“background process”) how do you optimize the network by promoting and
demoting nodes, and on what basis?


5) Repair algorithm: Given that any node can disappear at any time without
warning, how do you mitigate the effect of this and clean up afterwards?


And not all clustering need be in a network sense.  With iGlance, for
example, I have a “video buddy list” feature where you have a super
low-bandwidth video stream to all your peers.  My initial thought for this
was to have each client build up a “pose library” in order to optimize for
the massive similarity between frames in a video stream of you sitting in
front of your computer.  For example, there’s the “looking at screen” pose
and “talking on phone” pose, and “away from computer” pose, and so on.  My
goal was to accumulate this pose library over time, and then just send out
indices into this library.


I’d classify the clustering algorithm I used for my pose library generation
as follows:


1) Distance: Use an image differencing function that roughly equated to
“number of pixels changed” (it was more sophisticated than this, but not by


2) Strength: If a new image coming in is within some tolerance of an
existing image, just increment the “strength” of the old image and discard
the new.  Thus we’d gradually identify which images were most common, and
call those the strongest poses.


3) Join algorithm: Trickle a new image down from top to bottom.  At each
node, if it’s within tolerance, discard and increment its strength.  If
outside tolerance, compare to children and pick whichever is within second
tolerance.  If none are, call it a new child.


4) Optimization algorithm: Set a fixed limit on the number of nodes allowed
in the tree (based on how much memory I was willing to allocate to the
problem) and – before a node creates a new leaf – discard the weakest child
branch if we’re at our limit.  I think there was also a fan-out limit and
when overrun, I’d measure the distances between all children, find the pair
that was closest, and then demote the weaker under the stronger.


5) Repair algorithm: Not applicable to this problem because images didn’t
just disappear without warning.


Anyway, so that’s a clustering approach I used, and it worked surprisingly
well.  The big problems were it had too many “magic numbers” (merge
threshold, split threshold, total node numbers, etc) and it was hard
choosing the right values.  That and at the end of the day, the number of
poses you have even staring at your computer is actually quite high, and
simply sending a 1FPS video feed creates a better experience, is easier to
implement, and is sufficiently low bandwidth for my needs.  I’m sure with
enough tweaking it could be made to work and support a huge number of peers,
but meh.  Work for a future day.  Assuming I even kept the source, when it’d
be just like me to throw it away.


So, that’s my real-world clustering story.  What story do you have, and what
lessons did you learn?





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://zgp.org/pipermail/p2p-hackers/attachments/20060309/7f9a4a7c/attachment.html

More information about the P2p-hackers mailing list