Don Marti

Thu 16 Jul 2009 12:06:35 PM PDT

Protecting newsrooms from Web parasites

Greg Sterling covers the latest in the newspapers-versus-web beef. "The Huffington Post’s use of an Associated Press version of SI’s report was initially tops on Google, which meant that it, and not SI.com, tended to be the place readers clicking through to get the gist of the breaking scandal would land."

That's bad. Whatever you say about the copyright status of facts, it's a bad outcome when a real news operation does the work of reporting a story, then either a meta site or a site that re-reports the story gets the web traffic. As long as Google News treats the original story and the echo the same, the incentives are for the whole news business to revolve around not who can get the story, but who can game the search engines better.

The problem with Google News isn't that it crawls the news, it's that it crawls original stories, re-reports, and parasitic excerpts the same way. Search engines aren't smart enough to tell which of several textually similar stories is the original. A search engine could go with the first crawled, but that's too likely to reward the site with the best web tweaker, not the people who who got the story—especially if the original site wants to send a notification to subscribers first.

The situation is bad enough to make some people argue for ripping up antitrust and copyright law. But that's the kind of measure that will end up rewarding not newsrooms, but the kind of sharp lawyers and lobbyists who pay attention to antitrust and copyright law. So let's learn to use the web as the web. The web is not just Free Paperboy, Spawn of Santa Claus. It's a "web" because it has connected nodes.

Of course, news is a flow of stories, not a web of hypertext. But there's one vital exception. There has always been a citation norm in Journalism. If another paper gets the story first, and you read it before you write your story, you have to write something like, "as reported in the Dubuque Star."

What we need is to keep that norm, and apply it to the web in the most natural way. On the web, citation should include the web's form of citation, a link. Either use an "a" tag with an appropriate "rev", or use a "link" tag. Something like this:

<a href="http://example.com/2009/xyzzy"
rel="original-report">Example News reported,</a>

Give the news crawlers a hint, and give the other publication's staff their due.

That's a good start, but that depends on the re-reporting site to acknowledge its source. Bona fide reporters and editors would do this, because they'll want to be cited in return, but what about meta sites that never break new stories? To handle those, the origin sites can start adding the corresponding HTML "rev." When someone re-reports or summarizes your story, link to the site with rev="original-report". Let the crawlers know that the page of Google-bait summary and comments is about your story. Watch the referer log, and when you see a parasite picking up your story, add the tag. Pretty straightforward to (mostly) automate.

News organizations can't roll back the progress of tools for finding and recommending news. If, somehow, publishers could shut down Google News, recommendations will continue at some other site, or even with a desktop application. What news organizations can do is work with users and with crawlers, against parasitic meta sites.