[p2p-hackers] human-oriented base-32 encoding
Zooko
zooko at zooko.com
Fri Nov 1 06:24:01 UTC 2002
I've written a rationale for my base-32 encoding scheme. The "DESIGN"
document is visible via viewcvs here:
http://cvs.sf.net/cgi-bin/viewcvs.cgi/libbase32/libbase32/DESIGN?rev=HEAD&content-type=text/vnd.viewcvs-markup
In order to minimize the chance that you ignore this message, I will now include
the contents of the file in this message. The version number of the DESIGN file
is v0.9 so that I can incorporate feedback from p2p-hackers (and any other
sources) before naming it "v1.0".
------- begin appended file "DESIGN"
Zooko O'Whielacronx
November 2002
human-oriented base-32 encoding
INTRO
The base-32 encoding implemented in this library differs from that described in
draft-josefsson-base-encoding-04.txt [1], and as a result is incompatible with
that encoding. This document describes why we made that choice.
This encoding is implemented in a project named libbase32 [2].
This is version 0.9 of this document. The latest version should always be
available at:
http://cvs.sf.net/cgi-bin/viewcvs.cgi/libbase32/libbase32/DESIGN?rev=HEAD
RATIONALE
Basically, the rationale for base-32 encoding in [1] is as written therein: "The
Base 32 encoding is designed to represent arbitrary sequences of octets in a
form that needs to be case insensitive but need not be humanly readable.".
The rationale for libbase32 is different -- it is to represent arbitrary
sequences of octets in a form that is as convenient as possible for human users
to manipulate. In particular, libbase32 was created in order to serve the Mnet
project [3], where 40-octet cryptographic values are encoded into URIs for
humans to manipulate. Anticipated uses of these URIs include cut-and-paste,
text editing (e.g. in HTML files), manual transcription via a keyboard, manual
transcription via pen-and-paper, vocal transcription over phone or radio, etc.
The desiderata for such an encoding are:
* minimizing transcription errors -- e.g. the well-known problem of confusing
`0' with `O'
* encoding into other structures -- e.g. search engines, structured or marked-
up text, file systems, command shells
* brevity -- Shorter URLs are better than longer ones.
* ergonomics -- Human users (especially non-technical ones) should find the
URIs as easy and pleasant as possible. The uglier the URI looks, the worse.
DESIGN
Base
The first decision we made was to use base-32 instead of base-64. An earlier
version of this project used base-64, but a discussion on the p2p-hackers
mailing list [4] convinced us that the added length of base-32 encoding was
worth the added clarity provided by: case-insensitivity, the absence of non-
alphanumeric characters, and the ability to omit a few of the most troublesome
alphanumeric characters.
In particular, we realized that it would probably be faster and more comfortable
to vocally transcribe a base-32 encoded 40-octet string (64 characters, case-
insensitive, no non-alphanumeric characters) than a base-64 encoded one
(54 characters, case-sensitive, plus two non-alphanumeric characters).
Alphabet
There are 26 alphabet characters and 10 digits, for a total of 36 characters
available. We need only 32 characters for our base-32 alphabet, so we can
choose four characters to exclude. This is where we part company with
traditional base-32 encodings. For example [1] eliminates `0', `1', `8', and
`9'. This choice eliminates two characters that are unambiguous (`8' and `9')
while retaining others that are potentially confusing. Others have suggested
eliminating `0', `1', `O', and `L', which is likewise suboptimal.
Our choice of confusing characters to eliminate is: `0', `l', `v', and `2'. Our
reasoning is that `0' is potentially mistaken for `o', that `l' is potentially
mistaken for `1' or `i', that `v' is potentially mistaken for `u' or `r'
(especially in handwriting) and that `2' is potentially mistaken for `z'
(especially in handwriting).
Note that we choose to focus on typed and written transcription errors instead
of vocal, since humans already have a well-established system of disambiguating
spoken alphanumerics (such as the United States military's "Alpha Bravo Charlie
Delta" and telephone operators' "Is that 'd' as in 'dog'?").
Sub-Octet Data
Suppose you have 10 bits of data to transmit, and the recipient (the decoder) is
expecting 10 bits of data. All previous base-32 encoding schemes assume that
the binary data to be encoded is in 8-bit octets, so you would have to pad the
data out to 2 octets and encode it in base-32, resulting in a string
4-characters long. The decoder will decode that into 2 octets (16 bits) and
then ignore the least significant 6 bits.
In the base-32 encoding described here, if the encoder and decoder both know the
exact length of the data in bits (modulo 40), then they can use this shared
information to optimize the size of the transmitted (encoded) string. In the
example that you have 10 bits of data to transmit, libbase32 allows you to
transmit the optimal encoded string: two characters.
If the length in bits is always a multiple of 8, or if both sides are not sure
of the length in bits modulo 40, or if this encoding is being used in a way that
optimizing one or two characters out of the encoded string isn't worth the
potential confusion, you can always use this encoding the same way you would use
other encodings -- with an "input is in 8-bit octets" assumption.
Padding
Honestly, I don't understand why all the base-32 and base-64 encodings require
trailing padding. Maybe I'm missing something, and when I publish this document
people will point it out, and then I'll hastily erase this paragraph.
[1] http://www.ietf.org/internet-drafts/draft-josefsson-base-encoding-04.txt
[2] http://sf.net/projects/libbase32
[3] http://mnet.sf.net/
[4] http://zgp.org/pipermail/p2p-hackers/2001-October/
More information about the P2p-hackers
mailing list