[linux-elitists] (tmda) Re: Constraining Bogus challenges.
Karsten M. Self
Tue Sep 23 16:19:54 PDT 2003
on Tue, Sep 23, 2003 at 11:17:34AM -0700, Larry M. Augustin (email@example.com) wrote:
> Karsten M. Self wrote:
> > on Mon, Sep 22, 2003 at 03:22:50PM -0600, Jason R. Mastaler
> > (firstname.lastname@example.org) wrote:
> > > "Karsten M. Self" <email@example.com> writes:
> > >
> > > > Bollux. There are existing content/context based filters which
> > > > discriminate between spam and non spam with better than 98%
> > > > accuracy, and less than 0.02% false positive rates.
> This is a great example of lying with statistics.
I'm requesting an apology, Larry.
Those are statistics attained by myself, and reported in multiple
independent comparisons of filtering systems.
The statistics refer to the following configuration:
- Whitelist pass for known addresses.
- Spamassasin filtering on any addresses not explicitly whitelisted.
- A last-ditch rule for Asian characterset mail not otherwise filtered
This isn't _just_ SA (though it likely could be -- SA now offers both
automated and explicit whitelisting). It is however a reasonably
trivial filtering configuration. And it provides the results reported.
There _is_ a fair amount of unclassified mail: judged neither spam nor
ham (hrm...looks like we need to update the lexicon). This is typically
a small number of messages a day -- 6-12 is typicall. They include mail
from new correspondents -- usually recruiters, occasionally out-of-
the-blue responses to list posts or online comments I've made -- and
occasional spam that wasn't trapped by other rules, which is then moved
to my "spam-learn" folder which a cronjob processes every fifteen
Oh, and since moving the Asian charset rule to after SA, it's not
captured any mail.
The statistics were based on:
- Count of messages in all "normal delivery" folders (inbox, admin
mail, mailing lists)
- Count of messages in my spam archive. Yes, I archived all my spam.
- Count of messages in my false-positives folder. Yes, I copied
almost all false-positives to this, however trivial.
- Count of messages in my false-negative folder. Ditto.
While I don't have access to this corpus now (former job), I believe the
stats were based on about nine months' experience.
> I've done an extensive cross-product survey of content/context
> filtering, and on average the numbers are no where near that.
Well...that's what I call exposition by handwave. Got details on
> I don't doubt that there exist people on this mailing list who have
> carefully tuned setups fitting their individual tastes that are able
> to achieve those rates. However, for the non-technical user the
> tweaking necessary to achieve that level of accuracy is not an option.
> For the typical user of anti-spam systems based on content/context
> filtering, accuracy is more like 75%.
I've posted my raw SA scores, updated yesterday. And yes, the raw
filter rate is lower than the ultimate achived rate -- I get about 87%
spam filtering with SA alone -- 2179 of 12048 messages are identified by
me as spam, but scored less than 5.
A plot of these data is posted here:
There are of course caveats:
- I ran 817 of these messages through sa-learn, as spam, and rescoved
them. Many now registered as spam. Since I only started explicitly
training SA's Bayesian classifier lately, I'd likely have seen
better performance previously. Why only 817 of the 2179? Dunno,
just realized that's all I'd run.
- Many of these were classified by other means -- specifically, my
Asian charset filters picked up a lot of the messages.
- I have a pretty broad definition of what is "spam", including some
mails not normally picked up by SA: irrelevant vacation messages,
virus alerts, misdirected bounces, and the like. These compuse a
fair portion of this corpus.
- For much of this period, I explicitly bypassed SA filtering on mail
from myself, or from commonly used administrative accounts: root,
- Whitelist rules mean that only a very small fraction of this
actually hit my inbox. Some of the mail reached mailing lists, etc.
> Just because a technology proof of concept can be done where accuracy
> is 98%, doesn't mean that the rest of the world can achieve 98%
In my experience, and the experience of others, high levels of efficacy
_are_ attainable, with some, but minor, modification or training of
rules, in real-world situations. If the typical user can't reach an
effective rate of 90-95%, with a false-positive rate below 0.1%, I would
be quite surprised. With very little effort, most can achieve far
Specifically, I recommend:
- Virus rejection at SMTP. I'm starting to feel this should be
a required service option of ISPs, configured by default.
- Spam classification via content/context and Bayesian classifiers.
- User-maintained whitelisting (possibly incorporated into the
With this combination, the rates stated should be attainable.
> To claim that it does, is just as disingenuous as you feel are the
> claims of C-R advocates.
I take extreme exeption to this statement.
I'm not going to recap the comments and citations I made in my response
to George's TMDA post. Suffice to say: C-R users lie, evade, bash,
belittle, ignore costs, and generally behave in act of intellectual
dishonesty unheard of outside of Microsoft and the vencture capital
Again: I've stated my own and others real-world, published,
experiences. In what way is stating the truth disingenuous?
Karsten M. Self <firstname.lastname@example.org> http://kmself.home.netcom.com/
What Part of "Gestalt" don't you understand?
Defeat EU Software Patents! http://swpat.ffii.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: Digital signature
Url : http://allium.zgp.org/pipermail/linux-elitists/attachments/20030924/64284cc1/attachment.pgp
More information about the linux-elitists