Table of contents
The robots.txt file is a Web standard file that is used on the whole World Wide Web to declare what search engines should not index from a Web site. This is an "old" technique, but is still helpful. By using this file, you can select which files not to index, avoiding the display of private files on search engines. This file is flexible and allows you to implement several rules in the same file to ensure distinct behavior for bots.
The robots.txt files was created around 1994 by the members of the Robots mailing list. There is no standards message or RFC for this issue. It is important to remember that robots.txt should not be used to flag what should be indexed, but to indicate what should not be indexed. You need robots.txt, for example, in an intranet with WWW access that has sensitive information for that company. Restrict areas and personal documents that are hosted on your server in a specific directory for backup reasons that are possible assets that you may want to prevent from getting indexed.
If you want a search engine to index your whole site, do not use robots.txt.
When creating a robots.txt file, keep the following considerations in mind:
- The robots.txt file is a text file that must be created by using plain ASCII text and should be saved by using the "txt" file extension.
- This file should be on the root directory of your Web site. This is the first that a spider visits on a Web site.
- The file should be written in lowercase and have the proper public read access to the world. If your Web site root directory is your NSF file, you can upload your robots.txt to your NSF database as a file resource. You can also create a page named robots.txt, insert its content on the design body, and change its content type to text. For further information about this topic, refer to Page design elements.
- Since Web crawlers consider subdirectories or subdomains as completely different Web sites, keep a new version of the robots.txt file on every subdirectory with a new site or with sensitive data. For example, if you have the unique option to create a page, as described previously. You may also consider using the robots metatag approach, described later in this section.
If your Web site root directory is not your NSF database, you must upload the robots.txt file to your Lotus Domino Web server root directory. For further information about this topic, refer to topology
There are basically two rules to declare on this file: User-Agent and Disallow. The User-Agent is used to declare a specific agent. A User-Agent in this context is a search engine spider, like the Googlebot from Google.
If you want all agents (and not only the Google robot) to index the content, use an asterisk as the value, so that the search engines do not index.
To block the whole site, use the root directory bar, as in the following example:
To block a specific directory, enter the directory path, as in the following example:
To block a specific file, enter the file path, as in the following example:
You can use as many Disallow rules as you want. Start a new line on your file.
Remember that URLs are case sensitive. Therefore, a page called Coffee.htm cannot be declared as coffee.html.
The following example prohibits any robot from indexing the whole site:
An asterisk indicates everything or that all the robots should follow that rule. A practical example is preventing indexing folders on your site from containing private information. The code in the following example prevents for directories from being indexed.
Disallow: /cgi-bin/ #scripts e and programs
Disallow: /tmp/ #testing area
Disallow: /private/ #corporate files
The number sign (# ) is used for comments. You can use this sign to explain the reason for excluding the file, without impacting its usage.
If you do not have a robots.txt file, the tool indexes your site normally. It is the same as having the following robots.txt file:
The following example shows a more complex example. In the first lines, we declare that /directory/ should not be indexed by any robot. This rule should be followed by all spiders. Then, on lines 3 to 5, we define that the Google robot, Googlebot, should not index /cgi-bin/ and /corporate/hr/ directories. Then, in lines 6 and 7, we define that Yahoo robot, Slurp, should not index /corporate/accountancy/. Then, to finish, on lines 8 and 9, we define that the MSN® robot, msnbot, should not index /msoffice_docs/ directory.
User-agent: Googlebot # Google (line 3)
User-agent: Slurp # Yahoo (line 6)
Disallow: / corporate/accountancy/
User-agent: msnbot # MSN (line 8)
The robots.txt file does not affect the search results returned by Domino on the Web. If you want to see which pages are brought by a search result, you may want to review your search query and view selections, and implement security by using access control lists (ACLs) and Readers fields. For further information about how to implement security on Domino applications, refer to security considerations
If you do not have access to the robots.txt file, you can use another approach to prevent a page from getting indexed. There is an HTML meta tag, called robots
, that prevents spiders from indexing a Web site. This tag has a property that can have a pair of values, brought by the combination of these options: index, follow, noindex, and nofollow. Index and follow are the implicit defaults for this tag. The index option allows a page to be indexed, and follow allows its links to be indexed. The noindex option prevents indexing the page, which means not to put the page in the search results. The nofollow value prohibits following the links on this page in the index. If no other pages point to the same pages as the links on this page, this can have the same effect on those pages as a noindex on those pages. However, since anyone using any Web page can deep-link to those pages, this can fail. In the following example, a robot indexes the page and follows all the links on the page:
<meta name="robots" content="index,follow" />
In the following example, a robot indexes the page, but treats it as a "dead end" and does not follow any of the links on it.:
<meta name="robots" content="index,nofollow" />
In the following example, a robot skips over the page, without indexing its content, but continues indexing all the other pages to which this page links:
<meta name="robots" content="noindex,follow" />
In the following example, an ethical robot neither indexes the page nor follows any of its links. It considers this page as nonexistent on their indexes.
<meta name="robots" content="noindex,nofollow" />
This approach has the problem of being implemented in hypertext files, preventing its use for such files as PDFs or DOCs. Thus, robots.txt have a higher scale than this approach, but both have their importance.
For further information about how to insert meta tags on your pages, refer to Common design properties on Web applications.
Despite the fact that the most reliable search engine robots respect the Web site indexing rules defined on the robots.txt file, do not expose sensitive data to the World Wide Web. "Thief spiders" can crawl your Web site to search for sensitive data. To avoid this, you must implement efficient security by using such techniques as firewalls, files access control, and ACLs. For further information about how to implement security on Domino applications, refer to security considerations