2 July 2008
Using Google for duplicate content detection
A month or so ago I was looking at a camping equipment website called outdoorpros.com. I love this site and would recomend it to anyone. Being an SEO, however, I couldn’t help but notice that they were using some suspicious looking paginated links on their categories pages, so after getting all excited about my new camping stove I decided to take a quick look in their Google site index to see how search engines might be indexing the site.
This post covers some basic tips on “site diagnostics”, specifically; duplicate content detection by using Google search. Checks that every SEO should do as part of investigating potential issues that may negatively impact search engine positioning.
Here’s the approach I always follow, using outdoorpros.com as an example site:
1) Use your common sense
Let’s start by doing a site:www.outdoorpros.com in Google search.

As you can see from the screen grab, Google is reporting 72,100 indexed pages. Is that too many? If so you may have some kind of duplicate content issue.
2) Skip around the index and see if you spot something weird
Ok, not terribly technical advice, but it doesn’t have to be.

Click to around page 10 and take a quick look at the indexed URL’s. If you don’t see anything weird, skip ahead another 10 pages. Go as far to the back of the index as you possibly can, because that’s where the good bad stuff usually hides. You’re looking out for malformed urls, query strings (like ?=sessionid or ?first_page etc) or many repeated results with the same title / description.
In the case of our friends at outdoorpros.com you can see straight away that something doesn’t look right
That set of results tells me a lot about this site, and I’ve only been looking at it for 30 seconds. We’ve identified some query strings in the index. They might be causing duplicate content. How do we confirm that though?
3) Assessing if there really is a problem on individual page types
Take one of the query strings we saw in the index. Let’s use:
?attribute_value_string
Is that indexed string causing a problem? Let’s see. The url was:
http://www.outdoorpros.com/Brands/Kershaw/96?attribute_value_string%7CColor=Pink
It looks like a brand / category page for Kershaw Knives. Checking if that page is indexed with and without the query string is the first step. Here’s the cached page with a query and without. Woops. There are at least two copies of this page in the index.
But those pages have different content? Well, yes in that products the page links to are different, but, the brand category page is the same every time. Each copy of the page has the same meta title, description - it’s duplicating! It may be why Outdoorpros don’t rank organically for “Kershaw” or “Kershaw knives”
4) Deciding how may URLs you have in the index are duplicated
That’s quite easy. To get a feel for the number of URLs that are duplicating, just do a query like
site:www.outdoorpros.com inurl:attribute_value_string
This site looks to have at least 13,000 urls that contain the query string. Drill down a little by picking a few different titles from indexed pages such as:
site:www.outdoorpros.com intitle:”Buck Knives - OutdoorPros.com”
There are 65 pages with that exact <title>. Doh!
5) How do I fix this?!
Ok, first of all let me recap on what we’ve done so far. We’ve used a basic site: command and taken a common sense snapshot of how many pages there are in the index. When you’re an e-commerce site with 100,000 indexed pages and only 5,000 products, you might need to think about it.
Next, we drilled down by just checking Google’s index in random positions to see if there was anything that didn’t look right. Something was definitely wrong. By carrying out a query that told us how many instances of the query string were present, we had a total number of indexed pages using that string. Finally, we picked a specific page <title> and found 65 instances of the same page.
There is a solution, and sadly just nofollowing paginated links won’t work. The damage has been done - you have some indexed urls and some housekeeping to do.
I’m going to offer some advice in this post, but I’m going to cover fixing duplicate content issues in my next post soon. Add my RSS feed to get that post when it’s done. In the meantime, my best advice to outdoorpros.com is they need to create a list of all of the query strings that describe paginated pages and set up a rule to noindex,follow anything above the value of the first page.
Here’s my example:
Let’s look at their pants page.
It’s a perfectly good pants page and I’ll hear no sniggering at the back of the class please..
The main url to this page is:
http://www.outdoorpros.com/Cat/Pants/5/List
![]()
Check out the paginated navigational links. Each one of them produces a different url that looks like this:
http://www.outdoorpros.com/Cat/Pants/5/List?first_answer=13
The fix? A simple noindex,follow should be added in the page head whenever that query string is generated.
<html> <head> <title>...</title> <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"> </head>
This way, the many versions of the same page will be crawled but not indexed. All links on the page will be followed so the products will still be added to Google’s index. You’ve identified the canonical version of your pants page and Google will be grateful. Job done.
|
14 Comments currently posted.
Dr. Pete says:
richardbaxterseo says:
Definitely Pete, - thanks for dropping by!
Tom says:
Nice post! Solid tips
Only thing is I think your code at the end is wrong - should say noindex, follow not noindex, nofollow
More of this kind of thing please!
richardbaxterseo says:
Nicely spotted! Fixed - Cheers Tom, hope you’re well!!
BottomTurn says:
very good post Richard.
If I may ask : couldn’t you use the google webmaster tool in order to get directly informations like duplicate titles? it works faster for you and scan the whole site.
richardbaxterseo says:
Hi There BtoomTurn, you’re right you could use webmaster tools to get dupicate titles. Webmaster is definitely one source of information, though you won’t get the details that you need to perform a complete diagnosis. WMT is a very important step, and I’d put that part of the diagnostic under “Use your common sense” - good call.
Sonali Sengupta says:
thanks for sharing the handy tips.
Anu says:
Great Post!!!
Very Handy Trick.
Google is best SEO tool, isn’t it? No need to spends tons on that affiliate crap, right :).
Jerry Okorie says:
Very helpful and like to said might be difficult for a site with over 100,000 odd pages. SEO is more about thinking like a tester but with a good understanding of SE, - Search for what doesn’t work in a site and you will find it,then apply your knowledge to the tools available and you find a solution.
homeyessite says:
no watch ocean minor stay busy ocean stone free man ocean
richardbaxterseo says:
Oh, that’s great, homeyyessite - compelling and rich commentary. Quite relevant too.
Busby says:
Simple tools, extreme result.
Thanks guys
Encontre Conteúdo Duplicado Com o Google | Mestre SEO says:
[...] fonte: seogadget [...]
Lodewijk says:
Interesting post, didnt know it was that easy and fast to find some malfunctions on the site.



Nice review of how a couple of basic tools can turn into advanced techniques. Just like good games, good SEO tools take moments to learn and a lifetime to master.