[linux-elitists] 1 Gb e-mail mailboxes?

Rick Bradley roundeye@roundeye.net
Mon Apr 5 17:29:17 PDT 2004


* wuonm (wuonm@wanadoo.es) [040405 11:18]:
> for all e-mail admins out there...how can Google offer such huge inbox
> size and still pretend to earn some money?  I suppose they can afford
> the expenses, but, what kind of space-savings strategies use they?
> Gzipped messages? Bzipped mailboxes? Duplicated attach detection?

There's a big difference between offering a huge inbox and having people
fill that huge inbox.  

I've not heard mention of an "Import" facility as of yet, but even if
one is announced (besides the somewhat cumbersome "import via email"
which would likely lose much metadata, threading, etc.) users would have
to push data in at web-posting speed, which for the vast majority of
users implies uploading at 56K-256Kbps.  That means that the vast
majority of users will be starting out with empty inboxes.

Given 1GB of headroom, how much space will the average gmail user
consume in a year?  I'm probably a moderately heavy email user
(increasingly so over the past few years), I save everything (both
inbound and outbound -- including spam) and have ~3.2GB of email since
some time in 1997.  Average email folder size is right at 1800 bytes and
it looks like there might be a power law (as I would expect, see below)
on my folder distribution, with a 1GB folder sitting at the top.

I couldn't use gmail though due to my usage patterns, including heavy
dependence on procmail, fetchmail, mutt+gpg, wildcard addresses over
multiple domains, etc.  But primarily because every web interface has
always proved too slow or cumbersome for my daily use.

Reasoning from this experience, I posit that there's a relationship
between volume of email sent/received and the utility of a service like
gmail -- a relationship which limits the size of the mail archive
creatable by real gmail users over a few-years time window.   This
presumes users like me who don't send lots and lots of big attachments,
but see below for more on this.  (Anyone know if there's a limit on
message size inbound/outbound via gmail?)

The guys at Google probably have a good feel for what the average-sized
mail archive is likely to be over the next few years, presumably with
stddevs and similar measures (conversely, how many users are likely to
reach n% of capacity within time period t [0]).  They also certainly
know that in a few years the cost of storage will have shrunk yet again,
so that exponential archive growth can be tolerated for a fixed cost --
and if the program is running a few years hence then it's presumably
paying for itself, so any unforeseen scaling problems must have been
solved...

I guess the question is (to help determine how fast non-spam
communication is likely to grow, hence how big inbox size is likely to
get) what's the average connection count for a node in a scale-free
network?  This is, of course, straightforward to compute.  Why:  a gmail
user can only generate O(t) mails over a time period t, but (apart from
spam) over the same period of time (looking at the aggregate) the user's
inbound mail is determined by the number of links in their community
graph + spam.  I feel confident in saying that for any non-trivially
sized subset of users their community graph is going to be a scale-free
network.  Therefore, apart from spam, the size of a user's archive is
proportional to the size of their social network.  In the aggregate this
means that the average archive size is proportional to the average link
count of a node in a large scale-free network.

A service like Orkut helps to measure how many friends a near-Google
person is likely to have (and these are likely to be the first adopters
of gmail, unless they are like me and hence unlikely to contribute
significantly to the size of the archive), which tells something about
how much mail they're likely to introduce into gmail.  It also tells a
lot about how adoption of a service with established competitors (orkut
vs.  friendster et al, as compared with gmail vs. hotmail, et al) is
likely to grow, starting from the Google PR nexus.  That is, Orkut, as a
service offering no compelling features over other social networking
services, is precisely the type of offering you want to launch to be
able to study the rate of brand-switching to your future services
launched in a similar environment.  For those nagged at why Google is
interested in Orkut, there's a plausible answer.

Finally, as alluded to by other posters, having access to a large amount
of inbound (as well as outbound) email makes spam much easier to
identify -- hey, 400,000 emails with the same basic text came in today;
smells like spam to me -- making it easy to flag.  While that's a
service to the user, since spam is not something I'd imagine most
webmail users are going to want to keep around (even if they have 1GB
available), I feel certain gmail will eventually include a service that
dumps spam after a short period of time.  This addresses the
harder-to-quantify component of the typical user's inbound email burden. 

The remaining unknown is how many users tend to deal in large
attachments.  Presumably there are three factors Google can use to
mitigate storage archive risk here:  known distributions of mail message
sizes from other sources, the general acceptance of maximum mail size
limits, and the psychology of "big file" mailers.  If I mail out 100
10MB TIFF files (e.g.) in a month (or receive them inbound) two things
are probably true: I'm probably not going to trust my big data to a
webmail services, and I'm going to be watching my quota to make sure
I've got space left.  It's also probably worth it for me to invest in a
service which offers more storage -- up until this point I would've had
to anyway.  Perhaps the uptake rate on these types of users is also
going to be low, since they are probably already working with a provider
who meets their needs.

In addition to the benefits mentioned already by others, the ability to
distribute text classification and analysis over millions of
bring-your-own-content users is reason enough for Google to offer up The
Big Hard Drive for public use (with email being the easiest way to get
people to bring in content and classify it in front of you).
Context-aware information management and searching are the Next Big
Thing, and Google can head off Microsoft Research at the pass if they
can get hold of a good batch of data.

Anyway, I believe the short answer would've read "they don't expect
the average user to come even remotely close to 1GB of archives in
anything like the near future" -- and it's not even what I'd consider a
gamble.

[0] This smells a lot like the Erlang computation used in computing
    telephone line utilization.  E.g., see:
    http://www.diagnosticstrategies.com/traffic_modeling.htm

Rick
-- 
 http://www.rickbradley.com    MUPRN: 57
                       |  the official close.
   random email haiku  |  Separate instruments in
                       |  prism, in data phase.



More information about the linux-elitists mailing list