When hosting your own site, there might be necessity to configure behavior of bots – programs that crawl your website collecting information about its structure and content, and feed it to appropriate search engine.
For that sake, Robot.txt was created – a file that defines the way bots will behave on your site. Its configuration is pretty straight forward.
Create a text file (in Nano, Vim, Mousepad; Notepad), type in directives and upload it to root folder of your website.
User-agent: – name of botAllow: – path allowed to crawl. By default, equals to any path (defined as ‘*’).Disallow: – path denied from crawling.
For example, this can be defined as:
User-agent: *Disallow: /admin
To make sure than none bot is accessing, or indexing, your administrative area.
Or like this:
User-agent: GooglebotDisallow: /mail
To deny ONLY Google bot to crawl your /mail directory, but allow all other bots to do so.
NOTE: By default, most crawlers only support “User-agent” and “Disallow” directives.
You can define multiple rules for each bot. For example:
User-agent: GooglebotAllow: /indexable-contentDisallow: /adminUser-agent: BaiduSpiderDisallow: /indexable-contentDisallow: /mailDisallow: /admin
You can also prevent all bots from spidering your site, by writing in Robots.txt:
User-agent: *Disallow: /
Or, define crawl rate using “Crawl-delay” directive:
User-agent: GooglebotCrawl-delay: 5
This will limit Googlebot requests rate to 1 per 5 seconds.
NOTE: Some crawlers do ignore Robots.txt completely, so it cannot be considered as absolute protection from content and structure indexing, and also cannot be considered an effective crawler rate limitation technique.
There is also an unofficial “Sitemap” directive, allowing you to define sitemap in Robots.txt. For example:
Sitemap: http://www.your-site.com/sitemap2.xml
NOTE: This cannot be used to limit access to a site by defining browser string in “User-Agent” directive, because browsers ignore Robots.txt.
Googlebot – Google's crawler.
Baiduspider – Baidu's crawler (Japanese Chinese search engine)
MSNBOT – Bing's crawler
Gigabot – Gigablast's crawler.
Yahoo! Slurp – Yahoo's crawler
Full list of crawlers can be found at UserAgentString.com.
Add new comment