Don Marti

Thu 24 Dec 2009 10:06:50 AM PST

Webspam feeds?

So here's the basic idea. You have a web site with a bunch of user-generated links that may or may not be spam. These links can come from anywhere: comments, wiki edits, trackbacks, referers. Meanwhile, on some other site, a page on your site could be the target of a spam link. (For example, somebody makes an account, and puts a bunch of crap in his or her profile page.)

So you want to find the spammy links on your site, and you want the rest of the webmasters in the world to clean up the spammy pages on their sites.

So what do you do? Well, when you clean up, you post the URIs that you don't like to a link reputation clearinghouse (LRC). (You can also post the good URIs that appear on your site as good, to help the LRC decide that you're a legit user, and to help prevent them from showing up as spam.) You might report to more than one LRC, since they all accept basically the same HTTP POSTs. All easy to automate as part of the moderation process in your CMS if you want.

The LRC does some digestion (naturally, spammers are going to try to clobber it with bogus reports, and naturally, some links are going to be reported the wrong way by mistake.) Each one does its own digestion and reputation magick internally.

Then the LRC generates RSS feeds by domain. You subscribe to one or more feeds from one or more services, and when you see a possibly bad page on your domain, you check it out. You can pick and choose among LRCs, since some will end up doing better digestion than others. Since LRC feeds can be roughly standardized too, you can automate this process, too. (New user, nobody answers his or her forum posts, and the profile page shows up in an LRC feed? Put on probation until a moderator can approve.)

Big web sites that host a lot of user-generated content might want to run their own LRCs. Another logical place to put one is at a site that does link sharing or URL shortening. Individual webmasters might subscribe to just one LRC, and LRCs might subscribe to each other.

So here's a simple, easy-to-use one: Aloodo. Right now it's seeded with good and bad links from this site, along with a few other public sources. There's also a simple way to query the good and bad lists, so, for example, you can check out a new user's profile page and forum postings before deciding whether to make them public. If you have a webspam problem, let's talk about how this could be useful to you—either as a customized subscription or as an in-house install.