Sat Dec 9 22:11:42 UTC 2006
in tiny parts or big parts, as a separate transaction or
alongside your content downloads. If anyone give you bogus
info, it's instantly recognizable as being inconsistent
with the root value.
The THEX interchange format is just one way to fill in the
> They appear to propose that the recipient would download log(n) chunks
> from a tree server (by asking for offsets and lengths in the tree
> file, presumably using HTTP keep-alive and offset, length features of
> HTTP). However this has significant request overhead when the file
> hashes are 16-20bytes (the request would be larger than the hash), it
> also involves at least two connections: one for authentication info,
> another for downloading chunks.
I wouldn't recommend that at all. The THEX data format is really best
for grabbing a whole subset of the internal tree values, from the
top/root on down, in one gulp. Yes, that top-down format includes
redundant info, and you could grab the data from lots of different
people, but it's so small compared to the content you're getting,
why not just get the whole thing from any one arbitrary peer who has
it handy? (If you were going to try any random access, I think you'd
aim for one whole generation of the tree at your desired resolution,
because you can calculate everything else from that.)
For example, the full tree to verify a 1GB file at a resolution of
64KB chunks is only (1G/64K)*2*24(Tiger)=768K, or less than 1/10th
of 1% of the total data being verified. So I'd say, just get it from
anyone who offers it, verify it's consistent with the desired root,
and keep it around -- nothing fancy.
Now, you could imagine a system which includes a custom minimal
proof that "paints in" any tranferred data segment with the
neighboring internal tree nodes, so that every range-get includes
its own standalone verification-up-to-root. I think that'd be
neat, but it'd be tougher to code, and introduces its own
redundancies, unless you make the protocol for requesting partial
trees even more intricate. Still, it'd be an approach compatible
with the same tree hash calculation method.
> They also mention their format is suitable for serial download (as
> well as the random access download I described in the above
> paragraph). Here I presume (though it is not stated) that the user
> would be expected to download either the entire set of leaf nodes (1/2
> the full tree size), or some subset of the leaf-nodes plus enough
> other nodes to verify that the leaf-nodes were correct. (To avoid
> being jammed during download from the tree server.) Again none of
> this is explicitly stated but would be minimally necessary to avoid
Yes, you'd get the whole tree or a whole generation to your desired
> A simpler and more efficient approach is as follows (I presume a
> 128bit (16byte) output hash function such as MD5, or truncated SHA1; I
> also presume each node has the whole file):
> if the file is <= 1KB download the file and compare to master hash.
> If the file is > 1KB and <= 64KB download, hash separately each of the
> 1KB chunk of the file; call the concatenation of those hashes the 2nd
> level hash set, and call the hash of the 2nd level hash set the master
> hash. To download first download the 2nd level hash set (a 1KB file)
> and check that it hashes to the master hash. Then download each 1KB
> chunk of the file (in random order from multiple servers) and check
> each 1KB chunk matches the corresponding chunk in the 2nd level hash
> If the file is > 64KB and <= 4MB, hash separately each of the 1KB
> chunks of the file; call the concatenation of those hashes the 2nd
> level hash set. The 2nd level hash set will be up be 64KB in size.
> Hash separately each of the up to 64 1KB chunks of the 2nd level hash
> set; call the concatenation of those hashes the 3rd level hash set.
> Call the hash of the 3rd level hash set the master hash. Download an
> dverification is an obvious extension of the 2 level case.
> Repeat for as many levels as necessary to match the file size.
> Bandwidth efficiency is optimal, there is a single compact file
> authenticator (the master hash: the hash of the 2nd level hash set),
> and immediate authentication is provided on each 1KB file chunk.
On what criteria is this more simple or efficient?
It appears to be essentially the same as the THEX approach, but
instead of concatenating intermediate values with their neighbors
in pairs (a simple binary tree) you're doing it in groups of up
You still need the exact same amount of "bottom generation"
data to do complete verification at any given resolution.
And if you ever wanted to do minimally compact proofs, they're
larger with your tree's larger node degree. To verify that a
single 1KB segment fits inside a 1MB file with a desired root,
using the THEX calculation technique, takes 10 intermediate hash
values, say 200 bytes using SHA1 as the internal hash.
To verify a single 1KB segment using your scheme, you'd need
the individual hashes of its 63 neighbors, and then the
individual hashes of the 31 next-level neighbors: at least 94
intermediate hash values, say 1.8K using SHA1.
So there are things the binary tree construction can do that
the degree-64 tree construction cannot, but there's nothing the
degree-64 tree can do that the binary tree can't. (Remember
that every 6th generation of the binary tree is almost exactly
analogous to the degree-64 tree: it includes segment summary
values covering the exact same amount of source data.)
> To avoid the slow-start problem (can't download and verify from
> multiple servers until the 2nd level hash set has been downloaded),
> the 2nd level hash set chunk could have download started from multiple
> serers (to discover the fastest one), and/or speculative download of
> 3rd level chunks or content chunks could be started and verifciation
> deferred until the 2nd level hash set chunk is complete.
Given that the verification tree info is such a tiny proportion of the
total data to be transferred, and that many (if not all) peers
will be able to provide the tree info on demand for any file
they're also providing, I can't imagine slow-start being much of a
problem in practice.
- Gordon @ Bitzi
Gordon Mohr <gojomo@ . . . At Bitzi, people cooperate to identify, rate,
bitzi.com> Bitzi CTO . . . describe and discover files of every kind.
_ http://bitzi.com _ . . . Bitzi knows bits -- because you teach it!
More information about the P2p-hackers