[linux-elitists] (tmda) Re: Constraining Bogus challenges.
Karsten M. Self
Wed Sep 24 04:17:33 PDT 2003
on Tue, Sep 23, 2003 at 09:28:35PM -0700, Aaron Lehmann (email@example.com) wrote:
> On Tue, Sep 23, 2003 at 10:21:04PM -0400, Andrew wrote:
> > Offhand, why don't you use spamc/spamd?
> Contrary to what people like Karsten Self would have you believe ;),
> SpamAssassin is not the be-all/end-all solution that I wish it was.
> SpamAssassin (spamd) uses something like 20MB of memory. No other
> daemon I use, not even Squid, needs that much to run. And memory is
> cheap - but SpamAssassin is SLOW. It shouldn't take so long to test an
> email against patterns, and it's even less acceptable for it to use so
> much memory for a rather trivial task.
Aaron, how *DARE* you mention *relevant*, *technical*, *legitimate*
objections to SpamAssassin. Don't you realize there are vast tracts of
my personal life, parentage, hygiene, social skills, sartorial
predilections, small animal relationships, and flatulent emissions
which you have failed to address. Really, we expect far better of
- Spamassassin can be a bit piggish on memory.
- Yes, filtering (particularly with remote tests enabled) can be quite
slow. Even in "fast" modes, it's not perky.
- Several of the Bayesian classifiers are far more efficient, if
posted test results are credible.
On the other hand, in an age of cheap hardware:
- It does a good job of spam classification.
- It's network-capable. Meaning that a cluster of systems could
provide SA classification services on a round-robin basis for a
high-load environment, answering _most_ of the performance
complaints. Specifically, you'd have to balance development cost
against HW cost to justify an extensive optimisation rewrite.
- It uses a broad range of metrics. Unlike virtually every other spam
classifier or rejection system I'm aware of, SA covers: RBLs,
Razor/known spam, content, context, message metadata, sender,
recipient, associated history. It's highly adaptable. These to me
are signs of software which _is_ fundamentally pointed in the right
In practice, I've seen it run acceptably for a small ISP with daily mail
loads of ~40k items, lagging slightly at mid-day, on modest server
hardware. PII-233 to PIII-500 single processor. My back-of-the-
envelope analysis for servicing AOL's 2.5 billion messages comes in with
a per-user cost of $0.02. Over system lifetime (call it three years).
I'm fully admitting the values given are very rough, but improved values
should be able to dial this in closer:
> Bayesian filtering? Bogofilter does it in the fraction of the time.
> The possibilities for SpamAssassin's temporal and spatial lameness are
> that SpamAssassin's code is utter shit and that Perl is much slower
> than C. I suspect the problems are caused by a mix of both. When
> software requires serious users (especially ISPs) to devote many more
> resources than truly necessary to it, it is not good software.
> SpamAssassin may work fine for the class of problems it was originally
> intended to solve. Yet once one has to run spamassassin as a daemon to
> make it somewhat more bearable, it's clear that it's poorly designed.
Have you looked at it enough to work out specific places where
refactoring could be applied incrementally (I'm omitting your DFA/NFA
discussion mostly as I'm wholly unqualified to discuss it). Or do you
think just fixing the RE engine would provide benefits?
Karsten M. Self <firstname.lastname@example.org> http://kmself.home.netcom.com/
What Part of "Gestalt" don't you understand?
Defeat EU Software Patents! http://swpat.ffii.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: Digital signature
Url : http://allium.zgp.org/pipermail/linux-elitists/attachments/20030924/790891cc/attachment.pgp
More information about the linux-elitists