[linux-elitists] web server software for tarpitting?

Gerald Oskoboiny gerald@impressive.net
Tue Feb 12 10:33:17 PST 2008


* Evan Prodromou <evan@prodromou.name> [2008-02-12 12:37-0500]
>On Sun, 2008-02-10 at 23:06 -0800, Gerald Oskoboiny wrote:
>> The other day we posted an article [1] about excessive traffic
>> for DTD files on www.w3.org: up to 130 million requests/day, with
>> some IP addresses re-requesting the same files thousands of times
>> per day. (up to 300k times/day, rarely)
>>
>> The article goes into more details for those interested, but the
>> solution I'm thinking will work best (suggested by Don Marti
>> among others) is to tarpit the offenders.
>
>...and not punish everybody else, right?

Right, just punish those who are abusive.

>>      W3C's current traffic is something like:
>>
>>        - 66% DTD/schema files (.dtd/ent/mod/xsd)
>>        - 25% valid HTML/CSS/WAI icons
>>        - 9% other
>
>It sounds like W3C has been having a problem satisfying its promises,
>then. When you publicize an URL, like a DTD or schema, you're giving
>some tacit permission to use that URL.

Yes, but a single IP address re-fetching the same URL thousands
or hundreds of thousands of times a day seems excessive.

>It seems to me the way to solve your problem is to:
>
>     1. Clarify and publicize best practises for using W3C resources
>        into a server use policy. How often is it OK to hit a W3C-hosted
>        DTD? Once a day? Once an hour? Once a minute?

Yeah, we'll have to figure something out there.

>     2. For absolutely terrible bad-behavers, block them by IP number --
>        or return a brief-as-possible HTTP 403 response with a link to
>        your server use policy . It sounds like a quick way to cut down
>        on your traffic and save some headaches.

We have been doing this since May 2006 to no effect.

Every 10 minutes, a cron job wakes up and scans the logs for the
previous 10 minutes, and any IPs who requested the same resource
more than 500 times in 10 minutes, or who made more than 6000
requests in 10 minutes get blocked from the entire site for the
next 24 hours with custom responses depending on the abuse:

http://www.w3.org/Help/abuse/re-reqs
http://www.w3.org/Help/abuse/fast-reqs

But that hasn't accomplished much and we're still getting
hammered so we're looking at tarpitting.

>     3. Build a content-distribution network (CDN) to free up your
>        servers for the important stuff. You could either pony up the
>        cash for a commercial CDN, or you could use W3C's goodwill in
>        the Web community to put together a free and informal system of
>        mirrors.

We do have an automatic mirroring system and it's easy to add
more mirrors, but it seems silly to scale up to handle traffic
that doesn't have much business being there in the first place
(in my opinion. Others on staff think we should just serve all
these requests as quickly as we can.)

>The whole tarpit thing sounds too smart by half. I think a more direct
>approach is more ethical, and also sets a good example for other Web
>publishers.

I'd expect most medium/large sites have some kind of defensive
measures in place to deal with abuse. Google and Wikipedia block
all access from generic user-agents like Java/x and
Python-urllib/x.

-- 
Gerald Oskoboiny <gerald@impressive.net>
http://impressive.net/people/gerald/



More information about the linux-elitists mailing list