Saturday, September 17, 2005

Who is stealing GruntDoc's blog posts?

Another medblogging milestone...but this time, it's not good.
GruntDoc reports "Content Theft":
Scanning my Technorati watchlist (vanity: it tells bloggers who is linking to them) today I noticed quite a lot of links from a site called "Physician-Desk-Reference", which is apparently not associated with the actual PDR that's used as a source of last resort when looking up medications.

Looking at the site it occurred to me that I'd seen these posts before, ALL of them, as I'd written them. This site is reposting my posts with about a 5 day delay, then linking to me as "more" at the end of the entry. I have no idea why anyone would do this. The contact info on the front page is blank, so I cannot ask whoever set this up. (I didn't and this isn't an inside job if you're wondering).
What is going on? Marketing Sherpa describes two types of content thieves: (a) admiring fans of your blog, who lift entire posts because they like them, and (b) admiring fans of Google Adsense revenue.
The second group of thieves are profit-driven...They publish as many blogs as possible populated with lifted content, and sit back to collect commission checks from Google on ad clicks. Some have created automated programs that suck up content from around the Web and post it without need for a human editor.

Worried publishers are forming task forces now to begin to address this threat. Ideas include limiting bots' site access and requiring registration. In the end, more walls go up around the Web and an atmosphere of distrust reigns. Too bad....
How to avoid content theft? Ann's Sherpablog has suggestions: add a formal copyright line and "Terms & Conditions" to your blog; shorten your RSS feeds, releasing excerpts instead of full-text; embed an "invisible" copyright line in your posts. Furthermore,
(T)ell Google in writing if someone steals your copyrighted materials.

As I noted last week, one reason some people steal others' content is because they want to get Google AdSense revenue with content-rich pages without the effort of actually creating content.

To that end, many sites I've seen appear to be using automated bots to scrape content from other sites, and then post hundreds, even thousands of pages online with AdSense listings. I'm not going to accuse any sites in particular here, suffice to say it's a quickly increasing problem and loads of folks in the online publishing community have been noticing it.

Here's what Barry Schnitt in Google's PR department said in response to my query about this problem:

"Copyright violations are against our policies. We ask that the owner of the copyrighted material comply with the Digital Millennium Copyright Act (the text of which can be found at the U.S. Copyright Office website: http://lcWeb.loc.gov/copyright/) and other applicable intellectual property laws. In this case, this means that if we receive proper notice of infringement, we will forward that notice to the responsible web site publisher. To file a notice of infringement with us, you must provide a written communication."

My take on this? It's not awfully reassuring. Google seems to want to put the policing ball in the copyright owner's corner despite the fact that few of these stolen content sites would exist if it were not for AdSense revenues.

Plus, he didn't comment at all on my second question, which was in essence, what about policing those sites -- known in the industry as "Google Spam" -- that post such short snippets of scraped content that they don't actually break copyright law. They dance around the law and usually present no real value to the visitor.

Again, these sites are a burgeoning cottage industry that appears to be wholly funded by AdSense revenue potential...
Update: an article about spam blogs.

2 Comments:

Anonymous Jonathan said...

I don't think that it's fair to put the burden for this new cottage industry on adsense. The pay per click model of advertising has been around almost as long as the Internet. There's nothing new here, just that there's one single source for most people to go now and the technology now allows for automatic plagiarism.

It's much easier for copyright holders to search for their works than it is for Google to search all of their sites against the entire Web. Even with Google's search capabilities, that's a tremendous burden.

Besides, we have to stand up for our own rights and that's something I try to teach on my site. A little bit of plagiarism "self defense" goes a long way.

We can't rely on anyone to protect us, not on the Internet especially. There are effective and simple things that Webmasters can do to protect their works and deal with incidents of plagiarism. It's just a matter of taking the time to learn how to do/use them.

But that's just my opinion, I could be wrong.

7:37 AM  
Anonymous Anonymous said...

Thanks for the mention. As I just changed my blogging software, and failed to keep the original post links, here's the working link to the blog post above:
http://www.gruntdoc.com/2005/09/physiciandeskreference_weirdne.php

GruntDoc

1:43 PM  

Post a Comment

<< Home

Click for Eugene, Oregon Forecast