What Is Robots.txt?
The robots.txt file serves as a guide for bots, code and crawlers (such as search engines), specifying which parts of your website can be crawled (visited and read). It adheres to the Robots Exclusion Standard and consists of various rules that either permit or block crawlers from accessing particular parts (pages, files and folders) of your website.
If no rules are specified, all files on your website are allowed for crawling.
Placement of Robots.txt
Your robots.txt file should be located in the root directory of your domain.
For instance, if your domain is onescales.com, your robots.txt file should be accessible at https://onescales.com/robots.txt
The robots.txt file's influence is limited to the domain, subdomain, and protocol (http / https) where it resides.
How to Write Robots.txt: Basic Guidelines
- Rules: These define what a crawler can or can't do on your website.
- Groups: A robots.txt file is divided into groups, each containing a set of rules (one or more).
- User Agent: Every group starts with a User-agent line, specifying the name of the crawler and the rules applied to.
For Example: If you want to block Bing from crawling your site, you would specify:
User-agent: Bingbot
Disallow: /
Explanation: The above blocks the bing bot from crawling the entire website via 1 group.
Important Notes:
- Crawlers interpret these groups from top to bottom.
- The first group that matches a user agent is applied.
- All paths not disallowed are considered allowed.
- Rules are case-sensitive, the order matters
Directives in Robots.txt
Here are the valid commands you can use in a robots.txt file.
User-Agent
Specifies the web crawler name the rule applies to.
Example:
User-agent: Googlebot
Allow
Permits crawlers to access specific parts of your site.
Example:
Allow: /public/
Disallow
Prevents crawlers from accessing certain parts of your site.
Example:
Disallow: /private/
Crawl-Delay
Sets a delay between pages crawled (in seconds) so that crawler doesn't overload your website and crawls slowly.
Example:
Crawl-delay: 10
Sitemap
Indicates the location of your website's sitemap. (which is a list of all your pages)
Example:
Sitemap: https://onescales.com/sitemap.xml
Comments
If you use a hashtag (#) in the beginning of the line, it means the line is for comments only and not used at all as a rule. It's more for making notes to remind you of a specific point.
Wildcard Usage
You can specify a text that can match any word(s) instead of the wildcard.
For example - Disallow: /*/collections/
Will not allow bots to crawl /files/collections/, /myimages/collections/ and any word(s) that are instead of the asterisk (*)
/image*
- Equivalent to /image
In this case, /image*
and /image
yield the same results; the trailing wildcard *
is ignored. This directive will match any URL that starts with the string /image
.
Matches:
-
/image
: Matches because it starts with/image
. -
/image.html
: Matches because it starts with/image
. -
/image/page.html
: Matches because it starts with/image
. -
/imagefiles
: Matches because it starts with/image
. -
/imagefiles/pages.html
: Matches because it starts with/image
. -
/image.php?id=anything
: Matches because it starts with/image
.
Doesn't Match:
-
/Image.asp
: Doesn't match because it is case-sensitive and expects a lowercase 'i'. -
/myimage
: Doesn't match because the URL path doesn't start with/image
. -
/?id=image
: Doesn't match because it starts with/?
, not/image
. -
/folder/image
: Doesn't match because the URL path doesn't start with/image
.
This rule blocks (or allows, depending on whether you're using Disallow
or Allow
) access to all URLs that start with /image
, but not those that contain "image" elsewhere in the URL or those that use a different case like /Image.asp
.
/*.php$
- Matches Any Path That Ends with .php
The directive /*.php$
will match any URL path that ends exactly with the .php
extension. The $
symbol specifies that the URL should end with .php
.
Matches:
-
/filename.php
: Matches because it ends with.php
. -
/folder/filename.php
: Matches because it also ends with.php
.
Doesn't Match:
-
/filename.php?parameters
: Doesn't match because it doesn't end with.php
due to the query parameters. -
/filename.php/
: Doesn't match because the trailing slash means it doesn't end with.php
. -
/filename.php5
: Doesn't match because it ends with.php5
and not.php
. -
/windows.PHP
: Doesn't match due to case sensitivity; it expects a lowercase.php
.
This rule will block (or allow, if using Allow
) access to all URLs that end exactly with .php
, but won't apply to URLs that contain additional characters after .php
or use a different case like /page.PHP
.
Other Options to Robots.txt
robots.txt
is great for general rules, but sometimes you need to specify it in the code or server on your site instead. You can also use meta tags and HTTP headers.
Note that NOT all bots will follow these non robots.txt guidelines but it's good to know as some platforms don't allow you to edit your robots.txt or you need to specify these rules in your specific pages code.
For example - Google does follow the below.
Meta Tags
Place the following code in the <head>
section of your HTML to prevent a page from being indexed:
<meta name="robots" content="noindex">
- What it does: Tells search engines not to index the page.
HTTP Headers
For non-HTML content, use the following HTTP header to achieve the same effect:
X-Robots-Tag: noindex
- What it does: Prevents the content from being indexed by search engines.
Limitations
- Only "good bots" honor robots.txt. Good bots such as legit companies include Google and Bing. Some bots, scrapers and malicious bots do not follow your guidelines and you will have to consider blocking them via code, firewall, captcha and/or other methods.
- Some bots follow instructions differently or don't follow all rules the same.
- Using robots.txt is not a foolproof way to prevent search engines from visiting your site or from indexing your site.
Real Example
Let's take onescales.com robots.txt and review whats inside.
The robots.txt
file for onescales.com specifies various rules for web crawlers, including what they can and cannot access. Here's a quick rundown in 10 bullet points:
-
General Rules: The
User-agent: *
section sets rules for all web crawlers. Pages like/admin
,/cart
,/orders
, etc., are disallowed for crawling. -
Adsbot-Google: Specific rules are set for Google's Adsbot, disallowing it from crawling pages like
/checkouts/
,/checkout
,/carts
, etc. -
Nutch: All crawling is disallowed for the Nutch web crawler (
User-agent: Nutch
). -
AhrefsBot and AhrefsSiteAudit: These have specific delays (
Crawl-delay: 10
) and similar disallowed paths to the general rules. -
MJ12bot: It has a specified crawl delay of 10 seconds (
Crawl-Delay: 10
). -
Pinterest: It has a specified crawl delay of 1 second (
Crawl-delay: 1
). -
Sorting Filters: Disallowed crawling of URLs with sorting parameters in
/collections/
. -
Query Parameters: Blocks crawling of URLs containing specific query parameters like
oseid=
,preview_theme_id
, etc. -
Sitemap: Specifies the location of the sitemap as
https://onescales.com/sitemap.xml
. -
Platform: The comment at the top indicates that onescales.com uses Shopify as their ecommerce platform.
Two Methods for Blocking
There's 2 main methods for blocking bots:
- Specify the entire list of all bots you want to block and specify the rule
- Specify the bots you want to allow and block all the rest.
Testing Tool
Use a testing tool to verify your robots.txt
rules against example URLs. After inputting the rules and the URL you want to check, the tool will indicate whether the URL is allowed or disallowed. Results will be color-coded—green for allowed and red for disallowed—and the tool will highlight the specific rule that applies to the URL you're examining.
Allowed Example
Disallowed Example
Helpful Links
- Robots.txt Standard - The original instructions to the written guidelines
- Google Robots.txt
- Bots Blocked Research - Data found on top websites online and what bots they block.