[linux-elitists] (tmda) Re: Constraining Bogus challenges.

Joey Hess joey@kitenet.net
Wed Sep 24 07:42:16 PDT 2003


Aaron Lehmann wrote:
> I would appreciate a good spam filter with a similar design writen in
> a compiled language

Perl is a compiled language, you know. Anyway:

> and using efficient algorithms. Here's one example
> of an easy improvement that could be made to SpamAssassin if it was
> not confined to the facilities of perl: DFA regexp matching.

Yes, spamassassin has some truely bad regexps. For example:

lang pl body PL_ARTYKUL_USTAWY /Art.{0,50}25.{0,50}ust.{0,50}2.{0,50}pkt.{0,50}2/i

Welcome to backtracks-ville. The /i also makes this much more expensive
than it needs to be.

body TRACKER_ID         /^[a-z0-9]{6,24}[-_a-z0-9]{12,36}[a-z0-9]{6,24}\s*\z/is

Again with the /i, you could probably speed up SA's body checks by a factor
of 2 or more by lower casing the whole body before feeding it to all
/i rules.

rawbody ASCII_FORM_ENTRY        /[^<][A-Za-z][A-Za-z]+.{1,15}?[\x09\x20]*_{30,}/

This one is actually not bad, good use of character classes. Problem is
that the first half matches nearly every single line of the body of
every email that passes through SA. It's too early in the morning for me
to write this cleanly in one regexp, but it can be sped up quite a bit
just by moving the important part of the test to the front:

/[\x09\x20]*_{30,}/ && /[^<][A-Za-z][A-Za-z]+.{1,15}?[\x09\x20]*_{30,}/

joey@dragon:~>perl -e 'use Benchmark qw(:all); $a="foofofofoffd nf d fnd fnd fndfnbfdnbfdnbfdnbfdnbfdn ndf bndfbndfbnfdbndfbndfb n dfnbdfnfdb"; timethese(1000000 => { old => sub {return 1 if $a=~/[^<][A-Za-z][A-Za-z]+.{1,15}?[\x09\x20]*_{30,}/}, new => sub { return 1 if $a=~/[\x09\x20]*_{30,}/ && $a=~/[^<][A-Za-z][A-Za-z]+.{1,15}?[\x09\x20]*_{30,}/  }})'
Benchmark: timing 1000000 iterations of new, old...
       new:  1 wallclock secs ( 0.79 usr +  0.00 sys =  0.79 CPU) @ 1265822.78/s (n=1000000)
       old:  0 wallclock secs ( 0.71 usr +  0.00 sys =  0.71 CPU) @ 1408450.70/s (n=1000000)

You cite DFA's and a possible 10x speedup. Equivilant speedups can be
gained just by writing good NFA regexps, and NFAs can do more powerful
things. A heavily optimized NFA engine is often faster than naive DFA
implementations. If I were really worried about the perfomance of
spamassassin (and I'm not, I don't even use spamc, and my $300 server
hums along with ~10 thousand mails a day and negligible load from SA), I
would do two things before I even dared think of mentioning a rewrite:

1. Modify it to collect timing data for each rule, and run this for long
   enough to pinpoint the most expensive rules. There could easily be
   three or four rules that swamp the rest of the runtime for most inputs.

2. Identify the top 5 contributors of SA rules, and ship them personal
   copies of _MRE_. And some beer, to help them through chapter 2.

I notice that there is not one occurrance of /optimi[sz]e|speed ?up/i in
the whole spamassassin changelog. So there's probably a lot of
low-hanging fruit in the 2x-5x range.

-- 
see shy jo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://allium.zgp.org/pipermail/linux-elitists/attachments/20030924/b065f990/attachment.pgp 


More information about the linux-elitists mailing list