robots.txt
Table of contents |
Previous |
Next
The robots.txt file is a Web standard file that is used on the whole World Wide Web to declare what search engines should not index from a Web site. This is an "old" technique, but is still helpful. By using this file, you can select which files not to index, avoiding the display of private files on search engines. This file is flexible and allows you to implement several rules in the same file to ensure distinct behavior for bots.
The robots.txt files was created around 1994 by the members of the Robots mailing list. There is no standards message or RFC for this issue. It is important to remember that robots.txt should not be used to flag what should be indexed, but to indicate what should not be indexed. You need robots.txt, for example, in an intranet with WWW access that has sensitive information for that company. Restrict areas and personal documents that are hosted on your server in a specific directory for backup reasons that are possible assets that you may want to prevent from getting indexed.
If you want a search engine to index your whole site, do not use robots.txt.
When creating a robots.txt file, keep the following considerations in mind:
- The robots.txt file is a text file that must be created by using plain ASCII text and should be saved by using the "txt" file extension.
- This file should be on the root directory of your Web site. This is the first that a spider visits on a Web site.
- The file should be written in lowercase and have the proper public read access to the world. If your Web site root directory is your NSF file, you can upload your robots.txt to your NSF database as a file resource. You can also create a page named robots.txt, insert its content on the design body, and change its content type to text. For further information about this topic, refer to Page design elements.
- Since Web crawlers consider subdirectories or subdomains as completely different Web sites, keep a new version of the robots.txt file on every subdirectory with a new site or with sensitive data. For example, if you have the unique option to create a page, as described previously. You may also consider using the robots metatag approach, described later in this section.
Important
If your Web site root directory is not your NSF database, you must upload the robots.txt file to your Lotus Domino Web server root directory. For further information about this topic, refer to topology.
|
There are basically two rules to declare on this file: User-Agent and Disallow. The User-Agent is used to declare a specific agent. A User-Agent in this context is a search engine spider, like the Googlebot from Google.
If you want all agents (and not only the Google robot) to index the content, use an asterisk as the value, so that the search engines do not index.
To block the whole site, use the root directory bar, as in the following example:
To block a specific directory, enter the directory path, as in the following example:
Disallow: /private_directory/
|
To block a specific file, enter the file path, as in the following example:
Disallow: /private_file.html
|
You can use as many Disallow rules as you want. Start a new line on your file.
Important
Remember that URLs are case sensitive. Therefore, a page called Coffee.htm cannot be declared as coffee.html.
|
robots.txt examples
The following example prohibits any robot from indexing the whole site:
User-agent: *
Disallow: /
|
An asterisk indicates everything or that all the robots should follow that rule. A practical example is preventing indexing folders on your site from containing private information. The code in the following example prevents for directories from being indexed.
User-agent: *
Disallow: /cgi-bin/ #scripts e and programs
Disallow: /login/
Disallow: /tmp/ #testing area
Disallow: /private/ #corporate files
|
The number sign (# ) is used for comments. You can use this sign to explain the reason for excluding the file, without impacting its usage.
If you do not have a robots.txt file, the tool indexes your site normally. It is the same as having the following robots.txt file:
The following example shows a more complex example. In the first lines, we declare that /directory/ should not be indexed by any robot. This rule should be followed by all spiders. Then, on lines 3 to 5, we define that the Google robot, Googlebot, should not index /cgi-bin/ and /corporate/hr/ directories. Then, in lines 6 and 7, we define that Yahoo robot, Slurp, should not index /corporate/accountancy/. Then, to finish, on lines 8 and 9, we define that the MSNĀ® robot, msnbot, should not index /msoffice_docs/ directory.
User-agent: *
Disallow: /private/
User-agent: Googlebot # Google (line 3)
Disallow: /cgi-bin/
Disallow: /corporate/hr/
User-agent: Slurp # Yahoo (line 6)
Disallow: / corporate/accountancy/
User-agent: msnbot # MSN (line 8)
Disallow: /msoffice_docs/
|
Tip
The robots.txt file does not affect the search results returned by Domino on the Web. If you want to see which pages are brought by a search result, you may want to review your search query and view selections, and implement security by using access control lists (ACLs) and Readers fields. For further information about how to implement security on Domino applications, refer to security considerations.
|
If you do not have access to the robots.txt file, you can use another approach to prevent a page from getting indexed. There is an HTML meta tag, called
robots, that prevents spiders from indexing a Web site. This tag has a property that can have a pair of values, brought by the combination of these options: index, follow, noindex, and nofollow. Index and follow are the implicit defaults for this tag. The index option allows a page to be indexed, and follow allows its links to be indexed. The noindex option prevents indexing the page, which means not to put the page in the search results. The nofollow value prohibits following the links on this page in the index. If no other pages point to the same pages as the links on this page, this can have the same effect on those pages as a noindex on those pages. However, since anyone using any Web page can deep-link to those pages, this can fail. In the following example, a robot indexes the page and follows all the links on the page:
<meta name="robots" content="index,follow" />
|
In the following example, a robot indexes the page, but treats it as a "dead end" and does not follow any of the links on it.:
<meta name="robots" content="index,nofollow" />
|
In the following example, a robot skips over the page, without indexing its content, but continues indexing all the other pages to which this page links:
<meta name="robots" content="noindex,follow" />
|
In the following example, an ethical robot neither indexes the page nor follows any of its links. It considers this page as nonexistent on their indexes.
<meta name="robots" content="noindex,nofollow" />
|
This approach has the problem of being implemented in hypertext files, preventing its use for such files as PDFs or DOCs. Thus, robots.txt have a higher scale than this approach, but both have their importance.
For further information about how to insert meta tags on your pages, refer to Common design properties on Web applications.
Important
Despite the fact that the most reliable search engine robots respect the Web site indexing rules defined on the robots.txt file, do not expose sensitive data to the World Wide Web. "Thief spiders" can crawl your Web site to search for sensitive data. To avoid this, you must implement efficient security by using such techniques as firewalls, files access control, and ACLs. For further information about how to implement security on Domino applications, refer to security considerations.
|