[p2p-hackers] Practical clustering
David Barrett
dbarrett at quinthar.com
Thu Mar 9 21:27:15 UTC 2006
So Ive enjoyed the discussion on the mathematical foundation and taxonomy
of networks vis-à-vis clustering, but what techniques have you found useful
on a more nuts-and-bolts level? It seems to me the clustering problem can
be broken down into (roughly):
1) Measuring the distance between any two nodes: How have you defined
distance in this context: physical distance, latency, network hops,
throughput, etc? And for your selection, what techniques have you used to
measure it (IP geolocation, synthetic coordinate systems, embedded
traceroute, network monitoring, etc)?
2) Measuring the strength of any given node: How have you defined
strength in this context: uptime, CPU, memory, upload capacity, download
capacity, etc? How have you measured these?
3) Join algorithm: When a new node comes into the network, how do you
determine its placement? Is it a top-down, recursive approach, or a
bottom-up, emergent approach?
4) Optimization algorithm: As new nodes come and go (or even as a
background process) how do you optimize the network by promoting and
demoting nodes, and on what basis?
5) Repair algorithm: Given that any node can disappear at any time without
warning, how do you mitigate the effect of this and clean up afterwards?
And not all clustering need be in a network sense. With iGlance, for
example, I have a video buddy list feature where you have a super
low-bandwidth video stream to all your peers. My initial thought for this
was to have each client build up a pose library in order to optimize for
the massive similarity between frames in a video stream of you sitting in
front of your computer. For example, theres the looking at screen pose
and talking on phone pose, and away from computer pose, and so on. My
goal was to accumulate this pose library over time, and then just send out
indices into this library.
Id classify the clustering algorithm I used for my pose library generation
as follows:
1) Distance: Use an image differencing function that roughly equated to
number of pixels changed (it was more sophisticated than this, but not by
much)
2) Strength: If a new image coming in is within some tolerance of an
existing image, just increment the strength of the old image and discard
the new. Thus wed gradually identify which images were most common, and
call those the strongest poses.
3) Join algorithm: Trickle a new image down from top to bottom. At each
node, if its within tolerance, discard and increment its strength. If
outside tolerance, compare to children and pick whichever is within second
tolerance. If none are, call it a new child.
4) Optimization algorithm: Set a fixed limit on the number of nodes allowed
in the tree (based on how much memory I was willing to allocate to the
problem) and before a node creates a new leaf discard the weakest child
branch if were at our limit. I think there was also a fan-out limit and
when overrun, Id measure the distances between all children, find the pair
that was closest, and then demote the weaker under the stronger.
5) Repair algorithm: Not applicable to this problem because images didnt
just disappear without warning.
Anyway, so thats a clustering approach I used, and it worked surprisingly
well. The big problems were it had too many magic numbers (merge
threshold, split threshold, total node numbers, etc) and it was hard
choosing the right values. That and at the end of the day, the number of
poses you have even staring at your computer is actually quite high, and
simply sending a 1FPS video feed creates a better experience, is easier to
implement, and is sufficiently low bandwidth for my needs. Im sure with
enough tweaking it could be made to work and support a huge number of peers,
but meh. Work for a future day. Assuming I even kept the source, when itd
be just like me to throw it away.
So, thats my real-world clustering story. What story do you have, and what
lessons did you learn?
-david
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://zgp.org/pipermail/p2p-hackers/attachments/20060309/7f9a4a7c/attachment.html
More information about the P2p-hackers
mailing list