[p2p-hackers] human-oriented base-32 encoding

Zooko zooko at zooko.com
Fri Nov 1 06:24:01 UTC 2002

I've written a rationale for my base-32 encoding scheme.  The "DESIGN" 
document is visible via viewcvs here:


In order to minimize the chance that you ignore this message, I will now include 
the contents of the file in this message.  The version number of the DESIGN file 
is v0.9 so that I can incorporate feedback from p2p-hackers (and any other 
sources) before naming it "v1.0".

------- begin appended file "DESIGN"
                                                             Zooko O'Whielacronx
                                                                   November 2002

                    human-oriented base-32 encoding


The base-32 encoding implemented in this library differs from that described in 
draft-josefsson-base-encoding-04.txt [1], and as a result is incompatible with 
that encoding.  This document describes why we made that choice.

This encoding is implemented in a project named libbase32 [2].

This is version 0.9 of this document.  The latest version should always be 
available at:



Basically, the rationale for base-32 encoding in [1] is as written therein: "The 
Base 32 encoding is designed to represent arbitrary sequences of octets in a 
form that needs to be case insensitive but need not be humanly readable.".

The rationale for libbase32 is different -- it is to represent arbitrary 
sequences of octets in a form that is as convenient as possible for human users 
to manipulate.  In particular, libbase32 was created in order to serve the Mnet 
project [3], where 40-octet cryptographic values are encoded into URIs for 
humans to manipulate.  Anticipated uses of these URIs include cut-and-paste, 
text editing (e.g. in HTML files), manual transcription via a keyboard, manual 
transcription via pen-and-paper, vocal transcription over phone or radio, etc.

The desiderata for such an encoding are:

 * minimizing transcription errors -- e.g. the well-known problem of confusing 
   `0' with `O'
 * encoding into other structures -- e.g. search engines, structured or marked-
   up text, file systems, command shells
 * brevity -- Shorter URLs are better than longer ones.
 * ergonomics -- Human users (especially non-technical ones) should find the 
   URIs as easy and pleasant as possible.  The uglier the URI looks, the worse.



The first decision we made was to use base-32 instead of base-64.  An earlier 
version of this project used base-64, but a discussion on the p2p-hackers 
mailing list [4] convinced us that the added length of base-32 encoding was 
worth the added clarity provided by: case-insensitivity, the absence of non- 
alphanumeric characters, and the ability to omit a few of the most troublesome 
alphanumeric characters.

In particular, we realized that it would probably be faster and more comfortable 
to vocally transcribe a base-32 encoded 40-octet string (64 characters, case- 
insensitive, no non-alphanumeric characters) than a base-64 encoded one 
(54 characters, case-sensitive, plus two non-alphanumeric characters).


There are 26 alphabet characters and 10 digits, for a total of 36 characters 
available.  We need only 32 characters for our base-32 alphabet, so we can 
choose four characters to exclude.  This is where we part company with 
traditional base-32 encodings.  For example [1] eliminates `0', `1', `8', and 
`9'.  This choice eliminates two characters that are unambiguous (`8' and `9') 
while retaining others that are potentially confusing.  Others have suggested 
eliminating `0', `1', `O', and `L', which is likewise suboptimal.

Our choice of confusing characters to eliminate is: `0', `l', `v', and `2'.  Our 
reasoning is that `0' is potentially mistaken for `o', that `l' is potentially 
mistaken for `1' or `i', that `v' is potentially mistaken for `u' or `r' 
(especially in handwriting) and that `2' is potentially mistaken for `z' 
(especially in handwriting).

Note that we choose to focus on typed and written transcription errors instead 
of vocal, since humans already have a well-established system of disambiguating 
spoken alphanumerics (such as the United States military's "Alpha Bravo Charlie 
Delta" and telephone operators' "Is that 'd' as in 'dog'?").

Sub-Octet Data

Suppose you have 10 bits of data to transmit, and the recipient (the decoder) is 
expecting 10 bits of data.  All previous base-32 encoding schemes assume that 
the binary data to be encoded is in 8-bit octets, so you would have to pad the 
data out to 2 octets and encode it in base-32, resulting in a string 
4-characters long.  The decoder will decode that into 2 octets (16 bits) and 
then ignore the least significant 6 bits.

In the base-32 encoding described here, if the encoder and decoder both know the 
exact length of the data in bits (modulo 40), then they can use this shared 
information to optimize the size of the transmitted (encoded) string.  In the 
example that you have 10 bits of data to transmit, libbase32 allows you to 
transmit the optimal encoded string: two characters.

If the length in bits is always a multiple of 8, or if both sides are not sure 
of the length in bits modulo 40, or if this encoding is being used in a way that 
optimizing one or two characters out of the encoded string isn't worth the 
potential confusion, you can always use this encoding the same way you would use 
other encodings -- with an "input is in 8-bit octets" assumption.


Honestly, I don't understand why all the base-32 and base-64 encodings require 
trailing padding.  Maybe I'm missing something, and when I publish this document 
people will point it out, and then I'll hastily erase this paragraph.

[1] http://www.ietf.org/internet-drafts/draft-josefsson-base-encoding-04.txt
[2] http://sf.net/projects/libbase32
[3] http://mnet.sf.net/
[4] http://zgp.org/pipermail/p2p-hackers/2001-October/

More information about the P2p-hackers mailing list