Search engine optimisation is a tricky field at the best of times, let alone when the discussion turns to how different elements affect crawlability and PageRank flow. A search engine spider has a simple directive, if it encounters a link it follows it and indexes the content on that particular page and so it continues on its merry way.
As a Webmaster you need to decide if there are any sections of your site that you do not want to allow search engines to access. If you are like me, there is a pretty good chance that you do not want to allow the search engines into every nook and cranny as it really is just a waste of resources, bandwidth and in some cases your link equity that you have worked so hard for.
Why would I want to block search engines from my site?
This is a very good question and something that those new to search engine optimisation sometimes struggle with. If you have frequented any blogs or forums that speak about SEO you would have no doubt encountered debate about topics such as link building, keyphrase analysis, internal linking, Pagerank sculpting and duplicate content issues.
Put briefly, if your online shop has a large product range and is going to compete in any type of competitive niche you’re more than likely going to have to control search engine access to tackle a few of these issues, none more so than making sure that your site can direct its link equity around to get all your product pages indexed by the search engines, and the often unavoidable duplicate content issues that are caused by most e-commerce software.
Tackling Duplicate Content Issues
Duplicate content issues arise when a search engine has indexed more than one page that it deems is not sufficiently different, or is an exact duplicate of another page. Duplicate content can occur in a number of ways, whether it be through content theft or inadvertently from the general functionality of your e-commerce software.
Through no fault of their own most shopping cart applications can present numerous duplicate content issues to search engines. For example, if every product page on your website creates a link to add this product to your shopping cart it is possible that each of these links give the search engine a technically different way of indexing what is pretty much exactly the same page (in terms of content anyway). All of these different iterations of the shopping cart page are extremely important for the general customer functionality of your site, but to a search engine are totally useless and a waste of resources.
One of the ways that we can generally tackle this problem is to give the search engine a set of instructions telling them not to include the shopping cart page, in whatever format, within its search index. These instructions apply only to the search engine spider and do not disrupt the general functionality of your website.
Using robots.txt file
The robots.txt file is a document that you place at the root level of your domain name, e.g. yourdomain.com/robots.txt. Each time a search engine crawler visits your site it will go to this document and treat the information within it as a set of instructions on how to index your website.
If for example you decide that you do not wish any search engine to visit your shopping cart page you would use the following formatting:
Using meta robots tag
Another way of giving a search engine spider instructions can be achieved at an individual page level by using the meta robots tag element. The meta robots tag can be placed in the <head> area of any page on your website and it looks like this:
<meta name=”robots” content=”index”>
<meta name=”robots” content=”noindex”>
You can also give the search engine spider further instruction using the meta robots tag by telling it whether it following the links on that particular page:
<meta name=”robots” content=”noindex, follow”>
<meta name=”robots” content=”noindex, nofollow”>
An in-depth tutorial on using robots.txt can be found at SEO Book!
Should I use this strategy on my shopping cart?
If you have more than a handful of products on your website there is a real strong chance that you will at some stage need to deploy some tactics to handle duplicate content and/or PageRank flow. Generally the decision that you make on how to tackle duplicate content issues will come down to more than one factor, one of which is how PageRank will continue to flow through your website after you make any changes.
A basic understanding about how PageRank flows throughout your website is probably one of the more important concepts to grasp, particularly on large, complex websites such as an online shop. For the purpose of this article, PageRank is essentially the term used to describe Google’s link analysis algorithm, which analyses and assigns a numerical value to a web page to compare its importance all the other pages within the dataset.
Put simply, each link that a web page receives, whether it is from an external site or internally from links within your own site, acts as a vote of confidence in the quality of content provided on the page that has been linked to. Not only does PageRank look at the number of links that a page receives, it also looks at how “important” the page that has provided the link is. A page viewed higher in importance passes more credibility to the page it is linking to, ultimately helping it also become more “important”.
If you would like to learn more about how PageRank flows and passes link equity from page to page I highly recommend reading this article by Si Fishkin, who does a far better job explaining it than I ever could.
Which method should I use on my e-commerce store?
Ultimately the method that you choose will be determined by your shopping cart software functionality, the areas of your site that you wish to block and your confidence level in making such changes. If you have a choice and have complete control over disallowing access to individual pages it is always my preference to go with the meta robots tag. I’ll explain why!
Using the robots.txt file method of blocking access can present a couple of less than ideal results when disallowing search engines at the page level. In saying that, if you are looking to block an entire directory such as your images folder it is the perfect tool to use, but that is a story for another day.
The diagram below shows us that if you block a URL using the robots.txt file you will stop the search engine spider in its tracks, it will never visit the page you have disallowed. This method may prevent the search engine spider from visiting the page, however it will not prevent Google passing on any of your hard earned PageRank to this page.
The reason this becomes a problem is that while the search engine spider has been blocked and the page will not end up in the search engine index, your hard earned PageRank will continue to flow to an area that is now completely invisible. Unfortunately any links that are located on the page you have opted to block with robots.txt will not receive any benefit of the assigned PageRank because the Search Engine spider has been told not to visit that page and it is totally unaware of the existence of these links. Some people refer to this as PageRank leakage.
Alternatively if you decide to use the meta robots tag you will find that the search engine spider needs to firstly access the page before it discovers that you do not wish to include it in their index. Ideally in this situation you would opt to use “noindex” combined with the “follow” directive so that the links embedded on the page continue to pass PageRank (ie <meta name=”robots” content=”noindex, follow”>).
This is definitely somewhat of a technical article that probably raises plenty of questions. I know that it made my brain hurt writing it. If you would like any further explanation on any of the information above please leave a comment below, I would only be too happy to try and answer it for you.
If you decide to use any of the tactics described above on your website please remember that you do so at your own risk and that they will have an impact on the way search engines interact with your website and if implemented incorrectly could result in your website being totally removed from any search engine that obeys the robots protocol (and yes that does include Google).