[linux-elitists] (tmda) Re: Constraining Bogus challenges.

Karsten M. Self kmself@ix.netcom.com
Fri Oct 3 18:24:43 PDT 2003


on Fri, Oct 03, 2003 at 01:30:46PM -0700, Matt Beland (matt@rearviewmirror.org) wrote:
> On Friday 03 October 2003 01:14 pm, Aaron Lehmann wrote:
> > By the way, what does it take for the stock configuration to mark
> > something as spam?
> >
> > I get some very nasty recurring spams which get these results:
> >
> > X-Spam-Status: No, hits=4.7 required=5.0
> >         tests=BAYES_99,DATE_IN_FUTURE_03_06,HTML_50_60,MIME_HTML_ONLY,
> >               MISSING_MIMEOLE,MISSING_OUTLOOK_NAME
> >
> > Something tells me the Bayes weights are way too low by default.
> > bogofilter would have thrown that very quickly. However, I never get
> > false positives and don't want to start now. Has anyone found a good
> > balance?
> 
> I'd suggest that if you've got a well populated Bayes database (>5k
> messages as Spam and Ham) and you're not seeing Bayes hits on legit
> messages, it'd be safe to bump up the scores - just be gentle with it.
> The above example would have been marked as spam with only .3 more
> points, so just give BAYES_99 an additional .5 or so and see how it
> works.

Echos on that.  Specifically, I've made the following changes in my
~/.spamassassin/user_prefs file:

    score MICROSOFT_EXECUTABLE 4
    score BAYES_90 4.027
    score BAYES_99 5.200

What I've done is apply the network+bayes scores to my system though I
have disabled network (RBL) lookups.

The default scoreres are given in /usr/share/spamassassin/50_scores.cf

If you look at 'perldoc Mail::SpamAssassin::Conf', you'll find that
there are four scores listed for tests, depending on inclusion or
exclusion of bayes and network tests.

> Don't forget to use "sa-learn --spam" on the false negatives, although
> in this case that wouldn't have helped - Bayes was already triggering
> on the message.  Still, can't hurt, might help. I whipped up a simple
> script that sits on the webmail interface to the main SA-using system
> I have; users can submit messages to sa-learn as either Spam or Ham
> just by clicking the appropriate button. (I'd share it, but it's both
> embarrassingly simple and hard-coded for the inherited custom webmail
> interface.)

My own solution is to create mailboxes for spam/ham learning, and run
training scripts on them periodically via a cronjob.


Incidentally, in the continuing discussion of efficacy and specificity
of spam filtering methods, you'll find the following section at the top
of /usr/share/spamassassin/50_scores.cf:

    # Set 0
    # SUMMARY for threshold 5.0:
    # Correctly non-spam: 130571  56.17%  (99.84% of non-spam corpus)
    # Correctly spam:      94371  40.59%  (92.80% of spam corpus)
    # False positives:       207  0.09%  (0.16% of nonspam,  18703 weighted)
    # False negatives:      7326  3.15%  (7.20% of spam,  23325 weighted)
    # Average score for spam:  17.2    nonspam: -1.3
    # Average for false-pos:   5.8  false-neg: 3.2
    # TOTAL:              232475  100.00%
    # Set 0 Validation
    # SUMMARY for threshold 5.0:
    # Correctly non-spam:  14511  56.18%  (99.87% of non-spam corpus)
    # Correctly spam:       9921  38.41%  (87.80% of spam corpus)
    # False positives:        19  0.07%  (0.13% of nonspam,    954 weighted)
    # False negatives:      1378  5.34%  (12.20% of spam,   4489 weighted)
    # TCR: 7.670740  SpamRecall: 87.804%  SpamPrec: 99.809%  FP: 0.07%  FN: 5.34%


There are a total of four corpuses listed.  All report 93.9% - 96.01%
accuracy at reporting spam, with a false-positive rate of 0.08% to
0.22%.


While the values may not match those attained in independent tests, they
are:

  - Published.
  - Reproduceable.  Running the same tests on the same corpus should
    result in the same accuracy.
  - Tuneable.  If you feel that the corpus(es) used don't reflect your
    own ham/spam patterns, you can use your own body of mail to
    determine scores more appropriate for your own system.

Contrast this with the following exchange from the developer of TMDA:

  KM Self:     What is the supporting basis for this statement?  [TMDA
               efficacy over content-based filters]

  JR Mastaler: My personal experience.

  KMSelf:      Please quantify your personal experience.

  JR Mastaler: I'd prefer not to.


Mastaler isn't just some C-R user.  He's the lead developer of one of
the most widely used, and best regarded, free software C-R system
implementations.  Point-blank refusing to quantify performance or
benefits of his tool.

I'm underwhelmed.


Peace.

-- 
Karsten M. Self <kmself@ix.netcom.com>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
    Ford had another Pan Galactic Gargle Blaster, the drink which has
    been described as the alcoholic equivalent of a mugging - expensive
    and bad for the head.
    -- HHGTG
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://allium.zgp.org/pipermail/linux-elitists/attachments/20031004/94065cd3/attachment.pgp 


More information about the linux-elitists mailing list