PDQ Library:  Robots File

Using a Robots file in your Web site

It's easy to specify parts of your web site that should be private and non-searchable. The solution is a simple file called robots.txt that is an industry standard for search services. This file lets you control how search engines access the site at several levels - the entire site, individual directories, pages of a specific type, and even individual pages. You create it with a plain text editor and store it as "robots.txt" in your web server root (with your home page).

Search engines like Google index information on a huge number of web pages. To accomplish this, a large number of computers, known as the Googlebot, continually crawl the web, creating an index of all the information they find. This allows it to show a list of pages that match a query from a user.

Usually, you want the Googlebot to access your site so your web pages can be found by people searching on Google. However, you may have pages on your site you don't want in Google's index. For example, a directory with site statistics, or you may have information that requires payment to access. You can exclude pages from Google's crawler by creating a text file called robots.txt containing a list of pages and directories that search engines should not access.

Examples of robots.txt

Here is a simple example of a robots.txt file that tells all search bots not to index any folders under those called images, stats, and logs plus image files with the extension "jpg" found anywhere in your site.

User-agent: *
Disallow: /images/
Disallow: /stats/
Disallow: /logs
Disallow: /*.jpg$

The User-Agent line specifies that the next section is a set of instructions for all bots. All the major search engines honour the instructions you put in robots.txt. The Disallow line tells Googlebot not to access files in the various sub-directories of your site. The contents of the pages you put into these directories will not show up in Google search results.  You can specify different rules for different search engines - to create a section just for the Googlebot, start with a line containing "User-Agent: Googlebot".

META Tags

In addition to the robots.txt file, you can use META tags for extra control over individual pages on your site. META tags must be added to the <HEAD> section of a web page. A META tags with name="robots" will control how all search robots should index the page; name="Googlebot" instructs only the Googlebot. Together, robots.txt and META tags give you complete flexibility to specify complex access policies.

The first two examples will prevent indexing of the page, and also disallows following any links on the page - these are default settings. "all" is the same as "index,follow" and "none" means "noindex,nofollow". Examples 3 and 4 allow indexing and following links. Examples 5 and 6 show other valid control settings.

<meta name="robots" content="noindex,nofollow">
<meta name="robots" content="none">
<meta name="robots" content="index,follow">
<meta name="robots" content="all">  -- same as previous
<meta name="robots" content="noindex,follow">
<meta name="robots" content="index,nofollow">

META tags control indexing of individual files, and robots.txt controls indexing of folders or the entire site.

Search "Cache" Links

Google normally takes a "snapshot" of each page it crawls and archives it. This "cached" version allows a webpage to be retrieved if the original page is ever unavailable. To prevent search engines from showing a "Cached" link for your site, place the first tag in the <HEAD> section of your page.

<meta name="robots" content="noarchive">

Remove your Site from Search Engines

To remove your site from search engines and prevent all robots from crawling it in the future, place the following robots.txt file in your server root:
User-agent: *
Disallow: /

Links:

TOP back