[linux-elitists] web server software for tarpitting?

Greg Folkert greg@gregfolkert.net
Tue Feb 12 12:07:16 PST 2008

On Tue, 2008-02-12 at 10:33 -0800, Gerald Oskoboiny wrote:
> * Evan Prodromou <evan@prodromou.name> [2008-02-12 12:37-0500]
> >On Sun, 2008-02-10 at 23:06 -0800, Gerald Oskoboiny wrote:
> >> The other day we posted an article [1] about excessive traffic
> >> for DTD files on www.w3.org: up to 130 million requests/day, with
> >> some IP addresses re-requesting the same files thousands of times
> >> per day. (up to 300k times/day, rarely)
> >>
> >> The article goes into more details for those interested, but the
> >> solution I'm thinking will work best (suggested by Don Marti
> >> among others) is to tarpit the offenders.
> >
> >...and not punish everybody else, right?
> Right, just punish those who are abusive.
> >>      W3C's current traffic is something like:
> >>
> >>        - 66% DTD/schema files (.dtd/ent/mod/xsd)
> >>        - 25% valid HTML/CSS/WAI icons
> >>        - 9% other
> >
> >It sounds like W3C has been having a problem satisfying its promises,
> >then. When you publicize an URL, like a DTD or schema, you're giving
> >some tacit permission to use that URL.
> Yes, but a single IP address re-fetching the same URL thousands
> or hundreds of thousands of times a day seems excessive.
> >It seems to me the way to solve your problem is to:
> >
> >     1. Clarify and publicize best practises for using W3C resources
> >        into a server use policy. How often is it OK to hit a W3C-hosted
> >        DTD? Once a day? Once an hour? Once a minute?
> Yeah, we'll have to figure something out there.
> >     2. For absolutely terrible bad-behavers, block them by IP number --
> >        or return a brief-as-possible HTTP 403 response with a link to
> >        your server use policy . It sounds like a quick way to cut down
> >        on your traffic and save some headaches.
> We have been doing this since May 2006 to no effect.
> Every 10 minutes, a cron job wakes up and scans the logs for the
> previous 10 minutes, and any IPs who requested the same resource
> more than 500 times in 10 minutes, or who made more than 6000
> requests in 10 minutes get blocked from the entire site for the
> next 24 hours with custom responses depending on the abuse:
> http://www.w3.org/Help/abuse/re-reqs
> http://www.w3.org/Help/abuse/fast-reqs
> But that hasn't accomplished much and we're still getting
> hammered so we're looking at tarpitting.

I would suggest using a reverse proxy with caching turned on. AND also
use a simple re-write mapping script in that proxy, checking frequency
etc and number of requests... does a redirect to your policy page.

> >     3. Build a content-distribution network (CDN) to free up your
> >        servers for the important stuff. You could either pony up the
> >        cash for a commercial CDN, or you could use W3C's goodwill in
> >        the Web community to put together a free and informal system of
> >        mirrors.
> We do have an automatic mirroring system and it's easy to add
> more mirrors, but it seems silly to scale up to handle traffic
> that doesn't have much business being there in the first place
> (in my opinion. Others on staff think we should just serve all
> these requests as quickly as we can.)
> >The whole tarpit thing sounds too smart by half. I think a more direct
> >approach is more ethical, and also sets a good example for other Web
> >publishers.
> I'd expect most medium/large sites have some kind of defensive
> measures in place to deal with abuse. Google and Wikipedia block
> all access from generic user-agents like Java/x and
> Python-urllib/x.

Here is an example of a script written in Perl, its only using that a
compiled Perl as I wanted to separate out mod_perl stuff and system
Perl. I am forced to use CentOS at work and updates have killed things
as far as content providing.

call it /usr/local/bin/chkip_abuse.pl

        # turn off buffering for the remainder
        $| = 1;
        use strict;
        use lib '/usr/local/perl5.8.8/lib';
        use Cache::FileCache;
        my $allowed = 30;
        my $cache = new Cache::FileCache(
                                { 'cache_root' => '/tmp',
                                  'namespace' => 'abuse',
                                  'default_expires_in' => 1800 });
        while (<STDIN>) {
           my $ip = $_;
           my $attempts = $cache->get( $ip );
           $cache->set($ip, $attempts);
           if ( $attempts > $allowed ) {
              print 1;
           } else {
              print 0;
           print "\n";

Here is the snippet for reverse proxy httpd.conf

	RewriteMap chkip_abuse prg:/usr/local/bin/chkip_abuse.pl
	RewriteCond %{REQUEST_URI} ^/some-dtd-url
	RewriteCond ${chkip_abuse:%{REMOTE_ADDR}} =1
	RewriteRule . - [F]

Now, given that this is a quick hack, this may or may not work perfectly
and DOES NOT do any cleanup in /tmp. I use something similar to this for
distributed denial of service attacks on Apache servers and it
effectively QUELLS them nicely (especially used in a proxy setup!).
Though I use 300 seconds and 5 requests in that period for the URIs in

My cleanup script is a cronjob, run as needed, usually every couple of


        cd /tmp/abuse
        if [ "$1" = 'count' ]
        then debug='yes'
        if [ "$debug" = 'yes' ]
           echo "Files to remove:"
           find . -type f -atime 2 -print | wc -l
           echo "Empty dirs to remove:"
           find . -type d -empty -print | wc -l
           find . -type d -empty -print | wc -l
           find . -type d -empty -print | wc -l
           find . -type f -atime 2 -print | xargs -r rm 
           find . -type d -empty -print | xargs -r rmdir 
           find . -type d -empty -print | xargs -r rmdir 
           find . -type d -empty -print | xargs -r rmdir
This purges all files with older than 2 days and empty directories. So,
yeah go ahead and complain this ain't clean or what have you... it works
and has quelled up-to 10M hits a day to a Missionary Ministry that I
host, quite effectively.

Now.. I just KNOW someone out there is going to complain. But this could
really be useful for this purpose.
PGP key 1024D/B524687C 2003-08-05
Fingerprint: E1D3 E3D7 5850 957E FED0  2B3A ED66 6971 B524 687C
Alternate Fingerprint: 09F9 1102 9D74  E35B D841 56C5 6356 88C0
Alternate Fingerprint: 455F E104 22CA  29C4 933F 9505 2B79 2AB2
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://allium.zgp.org/pipermail/linux-elitists/attachments/20080212/e8e9c296/attachment.pgp 

More information about the linux-elitists mailing list