Quick Tip #5: How to write Robots.txt

Print this articlePrint this article

When hosting your own site, there might be necessity to configure behavior of bots – programs that crawl your website collecting information about its structure and content, and feed it to appropriate search engine.

For that sake, Robot.txt was created – a file that defines the way bots will behave on your site. Its configuration is pretty straight forward.

Create a text file (in Nano, Vim, Mousepad; Notepad), type in directives and upload it to root folder of your website.

Directives

User-agent: – name of bot
Allow: – path allowed to crawl. By default, equals to any path (defined as ‘*).
Disallow: – path denied from crawling.

For example, this can be defined as:

User-agent: *
Disallow: /admin

To make sure than none bot is accessing, or indexing, your administrative area.

Or like this:

User-agent: Googlebot
Disallow: /mail

To deny ONLY Google bot to crawl your /mail directory, but allow all other bots to do so.

NOTE: By default, most crawlers only support “User-agent” and “Disallow” directives.

You can define multiple rules for each bot. For example:

User-agent: Googlebot
Allow: /indexable-content
Disallow: /admin
User-agent: BaiduSpider
Disallow: /indexable-content
Disallow: /mail
Disallow: /admin

You can also prevent all bots from spidering your site, by writing in Robots.txt:

User-agent: *
Disallow: /

Or, define crawl rate using “Crawl-delay” directive:

User-agent: Googlebot
Crawl-delay: 5

This will limit Googlebot requests rate to 1 per 5 seconds.

NOTE: Some crawlers do ignore Robots.txt completely, so it cannot be considered as absolute protection from content and structure indexing, and also cannot be considered an effective crawler rate limitation technique.

There is also an unofficial “Sitemap” directive, allowing you to define sitemap in Robots.txt. For example:

Sitemap: http://www.your-site.com/sitemap2.xml

NOTE: This cannot be used to limit access to a site by defining browser string in “User-Agent” directive, because browsers ignore Robots.txt.

Common Bots List

Googlebot – Google's crawler.
Baiduspider – Baidu's crawler (Japanese Chinese search engine)
MSNBOT – Bing's crawler
Gigabot – Gigablast's crawler.
Yahoo! Slurp – Yahoo's crawler

Full list of crawlers can be found at UserAgentString.com.

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd> <span>
  • Lines and paragraphs break automatically.
  • Each email address will be obfuscated in a human readable fashion or (if JavaScript is enabled) replaced with a spamproof clickable link.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
  • Each email address will be obfuscated in a human readable fashion or (if JavaScript is enabled) replaced with a spamproof clickable link.
Image CAPTCHA
Enter the characters shown in the image.