[linux-elitists] web server software for tarpitting?

Gerald Oskoboiny gerald@impressive.net
Thu Feb 14 19:14:37 PST 2008


* Karsten M. Self <karsten@linuxmafia.com> [2008-02-12 13:21-0800]
>on Tue, Feb 12, 2008 at 10:33:17AM -0800, Gerald Oskoboiny (gerald@impressive.net) wrote:

>> I'd expect most medium/large sites have some kind of defensive
>> measures in place to deal with abuse. Google and Wikipedia block all
>> access from generic user-agents like Java/x and Python-urllib/x.
>
>Just out of curiosity, what's the user-agent distribution on this
>traffic?
>
>  - Are we dealing with poorly-written end-user software (browsers and
>    the like)?  Are these typically proprietary or Free Software?  Are
>    you getting hammered on account of legacy proprietary software,
>    non-standards-compliant FS tools?

It can be hard to tell which UAs are responsible because most of
the User-Agent headers are the defaults sent by whatever HTTP
libraries people are using, so there are lots of generic ones
like MSIE and Java/1.x.

The top UAs based on about 24 hours of traffic to one of our
mirrors recently:

     3660301 MSIE 7.0
     3474785 MSIE 6.0
     1815157 Java
     1599825 -
      315645 MSIE 5.01
       77656 Python-urllib
       54740 Mozilla
       45962 Novasoft NetPlayer
       22780 libosp 1.5
       18225 MSIE 5.0

(counting hits for .dtd/ent/mod/xsd files)

The MSIE hits seem to be from MSIE-the-http-stack, not
MSIE-the-browser. I haven't seen any evidence of IE downloading
this stuff on its own, there's usually some broken toolbar or
other app involved.

A few more numbers: (again from 24 hours on one mirror)

398132 unique IP addrs requested at least one of these files,
  1510 IP addrs made more than 1000 such requests,
   256 IP addrs made more than 5000 such requests.

Half the requests were made by the top 1174 IP addresses.

>  - Or are you getting slammed by spiders, custom agents, and possibly
>    more nefarious tools (including possibly viruses)?  A mix of a
>    robots.txt and agent-related IP blocking might come into play.

I think we are also getting a fair number of hits for our home
page from viruses that use 'connect to w3.org:80' as a connectivity
check, but haven't confirmed that for sure; that's just a theory
based on the fact that most of them have null user-agents and
don't bother to follow the redirect from there to www.w3.org.

>... and of course, there's always a salatious interest in seeing names
>named.

sorry, posting IP addresses would violate our privacy policy :)

>Also:  What regions / ASNs / CIDRs seem to be the most prevalent among
>abuse sources, does the distribution differ markedly from putatively
>legitimate traffic, and are any specific sources markedly more abusive
>than the rest?

Good questions, I'll have to look into that. Thanks for the ideas
and for the link to http://www.routeviews.org

>I don't see an easy solution -- you've sort of made your bed here with
>the current DTD reference schema, and we can assume that technology
>*will* be poorly implemented.  Might make sense to put DTDs into another
>space other than http.

We're pretty firmly committed to http URIs, for reasons
documented here:

    http://www.w3.org/TR/webarch/#uri-benefits
    http://www.w3.org/TR/webarch/#namespace-document

The fact that these URIs are easily dereferenceable is a feature,
not a bug.

>DNS, for example, *is* naturally a cache-and-
>forward system, and it's been abused for all sorts of stuff (see Dan
>Kaminsky's shenanigans).  While it's not a perfect solution, it's got
>several elements which could be made useful and might want to be
>considered for future DTD definitions and distribution of same.

HTTP caching seems like an ideal solution to me, and it's a big
peeve of mine that currently deployed software doesn't make
better use of it, so I'm keen to get that fixed.

-- 
Gerald Oskoboiny <gerald@impressive.net>
http://impressive.net/people/gerald/



More information about the linux-elitists mailing list