[linux-elitists] web server software for tarpitting?

Karsten M. Self karsten@linuxmafia.com
Fri Feb 15 11:50:06 PST 2008

on Thu, Feb 14, 2008 at 07:14:37PM -0800, Gerald Oskoboiny (gerald@impressive.net) wrote:
> * Karsten M. Self <karsten@linuxmafia.com> [2008-02-12 13:21-0800]
> >on Tue, Feb 12, 2008 at 10:33:17AM -0800, Gerald Oskoboiny (gerald@impressive.net) wrote:
> >> I'd expect most medium/large sites have some kind of defensive
> >> measures in place to deal with abuse. Google and Wikipedia block all
> >> access from generic user-agents like Java/x and Python-urllib/x.
> >
> >Just out of curiosity, what's the user-agent distribution on this
> >traffic?
> >
> >  - Are we dealing with poorly-written end-user software (browsers and
> >    the like)?  Are these typically proprietary or Free Software?  Are
> >    you getting hammered on account of legacy proprietary software,
> >    non-standards-compliant FS tools?
> It can be hard to tell which UAs are responsible because most of
> the User-Agent headers are the defaults sent by whatever HTTP
> libraries people are using, so there are lots of generic ones
> like MSIE and Java/1.x.
> The top UAs based on about 24 hours of traffic to one of our
> mirrors recently:
>      3660301 MSIE 7.0
>      3474785 MSIE 6.0
>      1815157 Java
>      1599825 -
>       315645 MSIE 5.01
>        77656 Python-urllib
>        54740 Mozilla
>        45962 Novasoft NetPlayer
>        22780 libosp 1.5
>        18225 MSIE 5.0
> (counting hits for .dtd/ent/mod/xsd files)
> The MSIE hits seem to be from MSIE-the-http-stack, not
> MSIE-the-browser. I haven't seen any evidence of IE downloading
> this stuff on its own, there's usually some broken toolbar or
> other app involved.

Possibly.  Sounds in general though that talking to the dev folks at
MSFT and for Java might address a fair portion of your problem ...
> A few more numbers: (again from 24 hours on one mirror)
> 398132 unique IP addrs requested at least one of these files,
>   1510 IP addrs made more than 1000 such requests,
>    256 IP addrs made more than 5000 such requests.
> Half the requests were made by the top 1174 IP addresses.

This suggests that rate-limiting or blocking based on IP would provide a
significant benefit.  My understanding is that a lot of modern routing
equipment provides these capabilities, or you could code something up on
a generic Linux or *BSD box.  Updating the blocklist periodically (say,
roughly, daily, plus or minus a factor or so of two) would be a good
first cut solution.

It would be interesting to further see how tightly (or loosely)
distributed those IPs (and hits) are in CIDR and/or ASN space.
> >  - Or are you getting slammed by spiders, custom agents, and possibly
> >    more nefarious tools (including possibly viruses)?  A mix of a
> >    robots.txt and agent-related IP blocking might come into play.
> I think we are also getting a fair number of hits for our home
> page from viruses that use 'connect to w3.org:80' as a connectivity
> check, but haven't confirmed that for sure; that's just a theory
> based on the fact that most of them have null user-agents and
> don't bother to follow the redirect from there to www.w3.org.

Interesting.  I ususally use www.google.com as my connectivity test
these days.  Not that I'm writing many viruses... ;-)
> >... and of course, there's always a salatious interest in seeing names
> >named.
> sorry, posting IP addresses would violate our privacy policy :)

Pressing the point:  to what level of aggregation does that policy
apply?  I see:

    We make no effort to identify public users of our site. No
    identifying data is disclosed to any third party for any purpose.
    Data that we collect is used only for server administration and Web
    protocol research.

at:  http://www.w3.org/Consortium/Legal/privacy-statement-20000612

I'd suggest that a lawyer-weasel might suggest that "Data that we
collect is used only for server administration and Web protocol
research" might provide a possible out.

I further see:

    As is typical, we log http requests to our server. This means that
    we know the originating IP (e.g. address of a user agent
    requesting a URL. We also know the Referer and User-Agent
    information accompanied with an HTTP request. We do not log the
    specific identity of visitors. We occasionally analyze the log files
    to determine which files are most requested and the previous site or
    user agent which prompted the request. Our logging is passive; we do
    not use technologies such as cookies to maintain any information on
... which suggests that aggregated data not identifying any individual
_user_ in the personal or organizational sense, but which could be used
to track abuse through ISPs, would be valid.  If not for public release
than certainly to work with other parties toward the ends of improved
server administration.

> >Also:  What regions / ASNs / CIDRs seem to be the most prevalent among
> >abuse sources, does the distribution differ markedly from putatively
> >legitimate traffic, and are any specific sources markedly more abusive
> >than the rest?
> Good questions, I'll have to look into that. Thanks for the ideas
> and for the link to http://www.routeviews.org

My pleasure.  I'm quite fond of that resource, really.
>> I don't see an easy solution -- you've sort of made your bed here with
>> the current DTD reference schema, and we can assume that technology
>> *will* be poorly implemented.  Might make sense to put DTDs into another
>> space other than http.
> We're pretty firmly committed to http URIs, for reasons
> documented here:
>     http://www.w3.org/TR/webarch/#uri-benefits
>     http://www.w3.org/TR/webarch/#namespace-document
> The fact that these URIs are easily dereferenceable is a feature,
> not a bug.

Use of URIs doens't mandate use of http/ftp or other specific protocols,
though the W3C would likely need to use a widely accepted protocol and
naming protocol to identify same.  Referencing Wikipedia (usual caveats
noted), URIs can include such identifiers as ISBNs.   If it seems
prudent to define a URI appropriate for reference works which could be
subject to frequent queries from billions to trillions of clients, then
doing so with an eye to:

  - Availability
  - Load management
  - Caching
  - Integrity
  - Cost effectiveness
  - Usability

... might make sense.  I see something piggybacking existing widely-used
protocols, and likely incorporating features of DNS, HTTP, and PKI,
being good starting points for a solution.
>> DNS, for example, *is* naturally a cache-and- forward system, and
>> it's been abused for all sorts of stuff (see Dan Kaminsky's
>> shenanigans).  While it's not a perfect solution, it's got several
>> elements which could be made useful and might want to be considered
>> for future DTD definitions and distribution of same.
> HTTP caching seems like an ideal solution to me, and it's a big
> peeve of mine that currently deployed software doesn't make
> better use of it, so I'm keen to get that fixed.

Yeah, I could see some wins here, and I suspect that a mix of aggressive
caching, multi-site hosting, and filtering is going to be your near-term


Karsten M. Self <karsten@linuxmafia.com>        http://linuxmafia.com/~karsten
    Ceterum censeo, Caldera delenda est.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: Digital signature
Url : http://allium.zgp.org/pipermail/linux-elitists/attachments/20080215/34738566/attachment.pgp 

More information about the linux-elitists mailing list