[linux-elitists] (tmda) Re: Constraining Bogus challenges.
Karsten M. Self
Fri Oct 3 18:24:43 PDT 2003
on Fri, Oct 03, 2003 at 01:30:46PM -0700, Matt Beland (firstname.lastname@example.org) wrote:
> On Friday 03 October 2003 01:14 pm, Aaron Lehmann wrote:
> > By the way, what does it take for the stock configuration to mark
> > something as spam?
> > I get some very nasty recurring spams which get these results:
> > X-Spam-Status: No, hits=4.7 required=5.0
> > tests=BAYES_99,DATE_IN_FUTURE_03_06,HTML_50_60,MIME_HTML_ONLY,
> > MISSING_MIMEOLE,MISSING_OUTLOOK_NAME
> > Something tells me the Bayes weights are way too low by default.
> > bogofilter would have thrown that very quickly. However, I never get
> > false positives and don't want to start now. Has anyone found a good
> > balance?
> I'd suggest that if you've got a well populated Bayes database (>5k
> messages as Spam and Ham) and you're not seeing Bayes hits on legit
> messages, it'd be safe to bump up the scores - just be gentle with it.
> The above example would have been marked as spam with only .3 more
> points, so just give BAYES_99 an additional .5 or so and see how it
Echos on that. Specifically, I've made the following changes in my
score MICROSOFT_EXECUTABLE 4
score BAYES_90 4.027
score BAYES_99 5.200
What I've done is apply the network+bayes scores to my system though I
have disabled network (RBL) lookups.
The default scoreres are given in /usr/share/spamassassin/50_scores.cf
If you look at 'perldoc Mail::SpamAssassin::Conf', you'll find that
there are four scores listed for tests, depending on inclusion or
exclusion of bayes and network tests.
> Don't forget to use "sa-learn --spam" on the false negatives, although
> in this case that wouldn't have helped - Bayes was already triggering
> on the message. Still, can't hurt, might help. I whipped up a simple
> script that sits on the webmail interface to the main SA-using system
> I have; users can submit messages to sa-learn as either Spam or Ham
> just by clicking the appropriate button. (I'd share it, but it's both
> embarrassingly simple and hard-coded for the inherited custom webmail
My own solution is to create mailboxes for spam/ham learning, and run
training scripts on them periodically via a cronjob.
Incidentally, in the continuing discussion of efficacy and specificity
of spam filtering methods, you'll find the following section at the top
# Set 0
# SUMMARY for threshold 5.0:
# Correctly non-spam: 130571 56.17% (99.84% of non-spam corpus)
# Correctly spam: 94371 40.59% (92.80% of spam corpus)
# False positives: 207 0.09% (0.16% of nonspam, 18703 weighted)
# False negatives: 7326 3.15% (7.20% of spam, 23325 weighted)
# Average score for spam: 17.2 nonspam: -1.3
# Average for false-pos: 5.8 false-neg: 3.2
# TOTAL: 232475 100.00%
# Set 0 Validation
# SUMMARY for threshold 5.0:
# Correctly non-spam: 14511 56.18% (99.87% of non-spam corpus)
# Correctly spam: 9921 38.41% (87.80% of spam corpus)
# False positives: 19 0.07% (0.13% of nonspam, 954 weighted)
# False negatives: 1378 5.34% (12.20% of spam, 4489 weighted)
# TCR: 7.670740 SpamRecall: 87.804% SpamPrec: 99.809% FP: 0.07% FN: 5.34%
There are a total of four corpuses listed. All report 93.9% - 96.01%
accuracy at reporting spam, with a false-positive rate of 0.08% to
While the values may not match those attained in independent tests, they
- Reproduceable. Running the same tests on the same corpus should
result in the same accuracy.
- Tuneable. If you feel that the corpus(es) used don't reflect your
own ham/spam patterns, you can use your own body of mail to
determine scores more appropriate for your own system.
Contrast this with the following exchange from the developer of TMDA:
KM Self: What is the supporting basis for this statement? [TMDA
efficacy over content-based filters]
JR Mastaler: My personal experience.
KMSelf: Please quantify your personal experience.
JR Mastaler: I'd prefer not to.
Mastaler isn't just some C-R user. He's the lead developer of one of
the most widely used, and best regarded, free software C-R system
implementations. Point-blank refusing to quantify performance or
benefits of his tool.
Karsten M. Self <email@example.com> http://kmself.home.netcom.com/
What Part of "Gestalt" don't you understand?
Ford had another Pan Galactic Gargle Blaster, the drink which has
been described as the alcoholic equivalent of a mugging - expensive
and bad for the head.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: Digital signature
Url : http://allium.zgp.org/pipermail/linux-elitists/attachments/20031004/94065cd3/attachment.pgp
More information about the linux-elitists