Playing hide and seek with robots

I'm going to talk about the most common search engine control methods that are used on the Internet nowadays. Please note that these techniques are employed not only to hide links, but some of them may even be used in much more sophisticated ways.

Robots meta tag

All of the most popular search engine robots (Google, Yahoo, MSN/Live, Ask/Teoma) support robots meta tag which controls the way your page appears (or not) in search results. A typical robots tag looks like this:

<meta name="robots" content="noindex, nofollow" />

Where name="robots" attribute means that this tag refers to robots of all the search engines. You can specify an exact robot name if you want different rules for different search engines. E.g.:

<meta name="googlebot" content="noindex, nofollow" />
<meta name="slurp" content="noindex" />
<meta name="teoma" content="nofollow" />

Here the first line refers to Google (its robot is called "googlebot"), the second rule is for Yahoo (robot "slurp"), and the third one is for Ask (its robot is identified by "teoma"). For the record: MSN bot is called "msnbot". For a complete list of robots you can visit robots database.

The second attribute, content, is used to give robots directions on how to behave on your page. If you need multiple content values, you need to separate them with commas. Spaces and capitalization are ignored. You can write these rules in lowercase, uppercase, or in any other way you want.

The most common meta="robots" content values are these:

robots.txt - site wide robot rules in one file

Robots.txt is a file that should be placed in top level directory within your site. The rules defined in that file are referred to as Robots Exclusion Protocol or robots.txt protocol. This protocol is a convention to control robot behavior on your site (the initial purpose of this protocol was only to prevent some or all of the robots from spidering particular parts of a web site).

The example of robots.txt file looks like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/
Allow: /private/some-public-docs/

The first line "User-agent: *" indicates that the rules below are for all the robots. If you want to specify a particular robot use its name instead of asterisk, e.g.:

User-agent: googlebot

You can use as many User-agent strings as you want in robots.txt. The rules below User-agent setting are applied to that particular robot untill the next User-agent string is met.

User-agent: googlebot
Disallow: /images/
# ... 

User-agent: msnbot
Disallow: /downloads/

Robots use the closest match User-agent to process rules. For example, if you specify some rules for User-agent: * and then some more rules with User-agent: googlebot - Googlebot won't apply robot * rules when spidering.

The disallow rules tell robots what directories or files should not be spidered and the allow rules specifiy what directories should be spidered. Again, robots use the closest match to determine if a directory should be spidered or not. In any case, if no specific rule is defined in robots.txt - search engine robots will spider the directory or file.

I won't go further into robots.txt specifics (there are some more tricks, but this covers the basics and that should be enough). For further reading I recommend Aaron Wall robots.txt article. You should also check his robots.txt generator and robots.txt analysis tool.

Back to the redirection thing. A typical redirection hiding scenario is the following:

1. Server administrator creates robots.txt file with content similar to this:

User-agent: *
Disallow: /redirect.php
... # there might be more rules

2. A link in a particular page is written like:

<a href="redirect.php?link_id=1004">Anchor text goes here</a>

As you can see, the redirect.php script is receiving link id as a parameter. Of course, these parameters can be named in any other way. The only requirement for this is that redirect.php could find the original link from the parameter given.

As an alternative you can specify exact url of the page you want to redirect, but this is not recommended. There are speculations in webmaster world that search engine robots can parse and extract links from html href attribute.

3. redirect.php finds a requested link and a redirect user elsewhere:

<?php
// fetch url from database or text file or find it in any other way
// log user click (for statistical purposes)
  header("location: http://www.someothersite.com"); // the actual redirect
?>

There might be other ways to redirect user instead of using php's header function. For example, redirect script could output a plain html redirect:

<html>
<head>
<meta http-equiv="refresh" content="0;url=http://www.someothersite.com"/>
</head>
<body>
If the page doesn't redirect -
please click <a href="http://www.someothersite.com">here</a>.
</body>
</html>

However, there is almost no difference in what technology is used. The main task of hiding link is done via robots.txt. Also there is no point in returning any server response codes (e.g., doing "301 redirect" or "302 redirect") - since robots won't see it.

Please note that this technique can be applied to any server side scripting language, not only php.

Advice for linkbuilders: this technique is very popular among shady (and not so shady) webmasters. So, check the robots.txt to make sure that your link is not hidden. By the way, some webmasters use this redirecting technique without hiding a redirect via robots.txt - so in that case your link is counted.

Bonus tip: when using meta refresh, the referrer information is also hidden (in FireFox and Internet Explorer). In this case it won't be any chance to see on destination server where a user came from.

This article is a part of SEO design patterns - hiding links. You could also be interested in other articles from this series:

Footnotes

  1. There are some disagreements in webmaster community regarding meta nofollow and rel nofollow tags. An alternate theory is that when nofollow attribute is used on a particular link, search engines still spider the linked page, and nofollow attribute simply tells to assign less to none value to that link.