[linux-elitists] Spam, aggregators, ASN, reporting tools, and stats

Karsten M. Self kmself@ix.netcom.com
Mon Jan 19 06:34:50 PST 2004


I've spent a few days developing a spam reporting tool.  It's a script I
run over my filtered spam[1], it pokes and prods a bit, and tries to
work out who to bitch to through rDNS, WHOIS queries, and a few
DNS-based queries.  For shits'n'grins, I also tossed in a few tests of
other lists.


First, some really useful tools I've turned up:

  - jwhois:  caching whois client.  This generates a system-wide cache
    of WHOIS queries.  Very useful when you're running batches of one or
    more score at a time....  Or a few thousand ;-)  
    
    Proper care and feeding of the jwhois.conf file is necessary, and
    suggests some changes for the app as there various regional
    registries are legion (132 listed in the conf file), and a
    comprehensive CIDR listing for same can run to a thousand lines or
    more.  These are updated frequently (as often as daily), and should
    probably be split out from the conf file itself and treated as
    variable data.

  - aggregate:  a CIDR domains-specified-in-various-formats calculator.
    Nicest thing is that you can pipe output to it and it will come up
    with the CIDR spec (say, grep and sed applied to jwhois output).
    There are rumors that scripted submissions to the rfc-ignorant
    ipwhois database are performed by some in the dark of night....[2]

  - A DNS-based ASN lookup.  That is, from an IP address, you can get
    the associated Autonomous System Number, network start, and CIDR
    size.  E.g.:

        250.202.144.198.asn.routeviews.org text "7961" "198.144.192.0" "19"

    ...which would correspond to our good host here.

    Further querying the ASN itself (more below) gives other goodness:

        [whois.radb.net]
        aut-num:            AS7961
        descr:              Raw Bandwidth Communications, Inc.
        admin-c:            MSD21-ARIN
        tech-c:             MSD21-ARIN
        notify:             mdurkin@rawbandwidth.com
        mnt-by:             MAINT-AS7961
        changed:            mdurkin@rawbandwidth.com 20021021
        source:             VERIO

    ...well and good.

    Astute readers will have noted that the relevant lookup is reversed
    IP at:

        asn.routeviews.org

    And you'll want a TXT query (host -t txt ...).

    The reason this is such a big deal is that ordinarily the
    information is only accessible by command-line access to mainline
    routers and other ugliness, or so I'm told.  Combined with a local
    caching DNS, and some common spam characteristics[3]


    For more on the general topic:

    http://darkwing.uoregon.edu/~joe/one-pager-asn.pdf [4]
    http://www.routeviews.org/


The reason ASNs are so useful is that there are so few of them[5].
65,536 (do the math) possible, of which somewhat fewer have actually
been assigned (I don't have the absolute number).  This is considerably
fewer, however, than the 4 and some billion ipv4 addresses available.

In my occasionally alluded to other life, I work with various sorts of
data, generally providing both drill-down (specific case, often outlier
or exception), and summary data.  The problem with data in general being
that there's often too many damned individuals, and except in very few
cases, meaningful results based on unaggregated data are difficult.

Well, except sometimes.

Fully 20% of my Nigeria Advance Fee / 419 spam mail originates from one
address:

    193.252.22.158 (smtp1.freeserve.com)

...Spamhaus and SpamCop both show it spewing since summer 2003.


Healthcare data, by way of diversion, is largely based on two sets of
code, CPT-4 (procedure) and ICD-9 (now ICD-10), diagnostic codes.  There
are literally thousands of values for each, and except for the very
largest healthcare systems, the values are too specific to be of much
practical use.  One consequence is that most definitive large-scale
medical studies comprise the elderly, the poor, and the HMO'd --
Medicare, Medicaid, and Kaiser are the three most studied databases.
Common strategies for the rest of the world involve various aggregating
techniques:  trimming digits (3 or 4 of the ICD-9;  there's a well known
"DRG Grouper" (blecherous piece of software) which aggregates diagnoses
into standardized categories.  

Similarly, demographic and marketing folks have ZIP, ZIP+4, census block
and MSA data.  Just because sometimes you want a number, not a (wo)man.

The alternatives to ASN aggregation have some serious problems:

  - rDNS isn't available for about 40% of my spam.

  - Lopping off octets from an IP is at best crude.  Sure, you can get
    yourself to a /24 or /16, but that may span many organizations
    bearing little relation to one another.

  - WHOIS is slow.  Even caching only helps so much as you're querying
    individual IPs[6], and the data transfer (relative to a typical DNS
    query) is large.  More significantly, WHOIS data formats are highly
    variable, and the contents are often suspect[7].  Then there are the
    WHOIS servers which rate-limit queries effectively restricting
    access to the data (LACNIC are particularly bad).


By contrast, ASNs are easy to get (thanks to routeviews.org), uniform,
standardized, and define a scope of organizational control.  That is:
the organization to which the ASN is assigned has control over that
block, and the enforcement (or lack) of good Netizen policies will be
widely reflected on that block.



Before I continue with the ASN results, I thought I'd mention a few
other results of interest.

First, the SpamCop and Spamhaus RBL lookups are highly effective at
tagging spam.  Rates are 80% and 53% respectively -- I suspect that
SpamCop is more aggressive, and Spamhaus more definitive.  What I
haven't done is tested them on non-spam corpuses.

The other major finding is that relays are largely a non-issue -- or
they're close quickly enough that a retrospective analysis of a
two-month archive doesn't find many.

More revealingly:  the "RFC-I" lookups -- queries for domains which
don't provide useful contact information, as required by Internet
standards ("RFCs"), are very frequently tripped by spam -- 50% for
"Abuse" and "Postmaster" (whether or not the domain has or receives mail
to postmaster@<domain>, etc.).  The other three tests were recently
added and/or had coding errors preventing successful lookups.  For 343
spams:

    274 SpamCop:                Flagged
     69 SpamCop:                not found

    183 Spamhaus:               Flagged
    160 Spamhaus                not found

      3 Relays ORDB:            Flagged
    340 Relays ORDB:            not found

    169 RFC-I Abuse:            Flagged
    174 RFC-I Abuse:            not found

    171 RFC-I Postmaster:       Flagged
    172 RFC-I Postmaster:       not found

    343 RFC-I BogusMX:          not found
    343 Relays VISI:            not found
    343 RFC-ignorant-whois:     not found (Code error: 53/343: 15%)

The power comes from combining tests though:  311 of the spams failed at
least one of these lookups (90%); meaning I'd want to at the very
least take a closer look at the mail involved.

       Flags     Spams
       -----     -----
        0           33  (so 1/10 of spam doesn't trip anything)
        1          109
        2           53
        3          111
        4            1

If you want to focus your local CPU filtering on just this mail, you're
in pretty good shape.

St Sauver claims that open proxies (not to be confused with open relays)
are increasingly significant with spam, though my own data don't bear
this out.  Yet.



ASNs are damned useful in profiling where spam comes from.

Results here are based on a long-established (1996), well-publicized,
single email address.  My ISPs spam filtering is disabled (they offer
emergency filtering during "spam storms" -- a barn door closed after the
Swen horse fled...  The data here are collected from between 11 Nov,
2003, and (for most of this analysis) Jan 5, 2004, comprising 3,822
spams.  I'm *not* including the several thousand Swen I've received over
the same period.


You've probably had a sense that Korea, China, and a handful of US
broadband providers are the major spam sources.  Not to disappoint, in
first place, with 9.3% of all the spam I see

    AS4766 -  61.72.0.0/13: KORnet Powered BY Korea Telecom  
    Spam:  9.37%    Spams: 358	

That's nearly ten percent of all spam from a single ASN.


This trend continues:   The top six ASNs account for 26% of spam, the
top 28 for 50%, and the top 107 for 75% of all spam.  That's of the tens
of thousands of ASNs in existence.

Here you go, the top
ten:


    Rank  Cum %  % tot   ASN     CIDR             Name
    --   ------  -----  ----  -------------   -----------------------
     2   13.24%  3.87%  4134  218.66.0.0/16   China Telecom
     3   16.80%  3.56%  7132  67.112.0.0/12   SBC Internet Services
     4   20.28%  3.48%  6478  24.8.128.0/18   AT&T WorldNet Services
     5   23.34%  3.06%  4813  218.18.128.0/18 China Telecom GUANGDONG PROVINCE 
     6   26.09%  2.75%  9318  211.58.0.0/16   APNIC ASN block
     7   27.92%  1.83%  9277  211.244.0.0/16  THRUNET
     8   29.67%  1.75%  3462  61.228.0.0/16   Chunghwa Telecom Co., Ltd.
     9   31.40%  1.73%  7018  12.0.0.0/8      AT&T WorldNet Services
    10   32.89%  1.49%  7843  67.23.16.0/20   Adelphia Communications

For those who want to compare results on their own systems, I'm
attaching an awk script that will generate similar results.  I suspect
they'll be pretty close -- if spam is as ubiquitous as it seems, most of
the sampling problem is taken care of.  Different users or sites may
vary somewhat in volume, but they're mostly seeing the same spam -- it's
a matter of degree, not selection.



The spam report generator itself is a bash script (unfortunately for me,
I can almost always do what I want in bash....), which probably _should_
be cleaned up and rewritten in Perl or Python to avoid use of tainted
input.  And the annoying habit of hanging during the occasionally jwhois
query....

I've got sample runlogs (summary of activity) and a report at:

    http://linuxmafia.com/~karsten/spam-reports.20040117-16%3A01%3A56
    http://linuxmafia.com/~karsten/spam-report.txt

It incorporates several sanity checks, including an excludes list of
mailing lists and of undeliverable addresses.  It tries default
deliveries (abuse/p'master) if it thinks they'll work.  Bounces are
automatically culled by another script, and I manually report to
RFC-Ignorant (http://www.rfc-ignorant.org/) as well.  

The generated report has dynamic elements.  In particular if I post to a
secondary recipient (not p/a or a direct WHOIS lookup), the WHOIS result
is included along with a LART explaining why.  Results of various RBL
queries are also included, with webpage links if possible.

If anyone can see obvious bozo errors in the output as presented, LMK.
I'm still working out how best to interpret query output.  I'd also like
to work in an additional recipients feature.  kornet.net and a few other
recipients appear to be black holes.  Might be useful to tickle the
Korean telecoms ministries a few times a day.



And some final food for thought.  Spam trends show a volume doubling
time of 6-12 months.  I saw 3,100 spams in December, 2003.  That's 6k
come July, 12k by next January.  At about 9k per message, we're going to
see the death of unfiltered dialup mail (or any other low-bandwidth
periodic connection) in a matter of 12-24 months.   Broadband and DSL
only buy you so much time.  The email system is under serious stress,
and current trends don't look good.  There are two possibilities:  it
can adapt, or it will break.

Meantime, I see some ASNs which don't have any business being received
by anyone.



--------------------
Notes:

1.  I don't control my MTA, my ISPs filters suck.  So I filter what I
    get through fetchmail, and being a retentive analyst type, archive
    the crap.

2.  Anyone with recommendations for searching through WHOIS records
    looking for, say, a particular email address, I speak.  I've been
    running serial searches using "seq" and ranges above and below a
    known target.

3.  Spammers lie.  They also are overwhelmingly few in number and points
    of origin.  So a little data goes a long way -- or at least a little
    with respect to the size of the 'Net at large.

4.  Joe St. Sauver, "ASNs (Autonomous System Numbers)".  Joe is director
    of user services and network applications at the University of
    Oregon Computing Center, and his website contains a number of
    references both technical and less so.  His comments regarding spam
    strike me as exceptionally sane, though his focus on sources to the
    exclusion of content isn't one I fully agree with (despite the
    content of this mail...hrm, judge that comment by its source...).
    http://www.uoregon.edu/~joe/
    http://darkwing.uoregon.edu/~joe/spamwar/winning-the-war-on-spam.pdf

5.  Forgive me for stating the obvious, but everything I know on the
    topic I've pretty much absorbed in the past 48 hours.  Definitive
    source is RFC1930:
    http://www.faqs.org/rfcs/rfc1930.html

6.  One suggestion for jwhois is that ip-based queries be indexed by
    CIDR, eliminating the need for repeat queries within the same IP
    block.

7.  One response to a spam report was from a large US ISP who had quit
    use of the IP block in question years ago.

-- 
Karsten M. Self <kmself@ix.netcom.com>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
   Is GNU/Linux the future?  Hell, it's the present:
     http://www.dwheeler.com/oss_fs_why.html
-------------- next part --------------
#!/usr/bin/awk -f
#
# Karsten M. Self
# (c) 2004, All Rights Reserved.
# Licensed under GNU GPL v.2 or at your option any later version.
# NO WARRANTY

# Read in a sorted, ranked list if IPs.  Hrm...   We don't even have to
# do that.  Just IPs.
# You'll probably want to sort the output.

{ 
    ips[$1]++ 
}

END {

    # Get ASN for each IP.
    # Generate stats for ASNs.


    i=0
    for ( ip in ips ) {
	i++
	r = split(ip, ipa, "\\.")
	reverseip = ipa[4] "." ipa[3] "." ipa[2] "." ipa[1]
	# printf( "IP:  %15s  - Reverse IP:  %15s\n", ip, reverseip )

	cmd = "host -R 4 -t txt " reverseip ".asn.routeviews.org"
	cmd | getline; close( cmd )
	# print( "ASN: " $0 )

	rc = sub("^.* text ", ""); rc = gsub("\"", "")
	ASN[ip] = $1; NETSTART[ASN[ip]] = $2;  NETPREFIX[ASN[ip]] = $3; 
	ASNs[ASN[ip]]++; 
	CIDR[ASN[ip]] = NETSTART[ASN[ip]] "/" NETPREFIX[ASN[ip]]

	cmd = "jwhois AS" ASN[ip]
	while (( "jwhois AS" ASN[ip] ) | getline ) {
	    if ( $0 ~ "^descr:" ) { break } 
	}
	close(cmd)
	sub("^descr: *", ""); org[ASN[ip]] = $0

	# printf( "%4s  %14-s  ASN: %-5s  (%4s)  -  %s\n",
	#    i, ip, ASN[ip], ASNs[ASN[ip]], CIDR[ASN[ip]] )
	printf(".")

	}

    print ""

    for ( asn in ASNs ) {
        printf( "%3s  %5s - %19-s  %s\n",
	    ASNs[asn], asn, CIDR[asn], org[asn] )
    }


}
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://allium.zgp.org/pipermail/linux-elitists/attachments/20040119/0aaa4998/attachment.pgp 


More information about the linux-elitists mailing list