Introduction to Robots.txt

September 21, 2023

What Is Robots.txt?

The robots.txt file serves as a guide for bots, code and crawlers (such as search engines), specifying which parts of your website can be crawled (visited and read). It adheres to the Robots Exclusion Standard and consists of various rules that either permit or block crawlers from accessing particular parts (pages, files and folders) of your website.

If no rules are specified, all files on your website are allowed for crawling.

Don't Feel Like Reading? We Made a Video Explaining the Full Details of the Topic. Click Here to Scroll Down to The Video

Placement of Robots.txt

Your robots.txt file should be located in the root directory of your domain.

For instance, if your domain is onescales.com, your robots.txt file should be accessible at https://onescales.com/robots.txt

The robots.txt file's influence is limited to the domain, subdomain, and protocol (http / https) where it resides.

How to Write Robots.txt: Basic Guidelines

Rules: These define what a crawler can or can't do on your website.
Groups: A robots.txt file is divided into groups, each containing a set of rules (one or more).
User Agent: Every group starts with a User-agent line, specifying the name of the crawler and the rules applied to.

For Example: If you want to block Bing from crawling your site, you would specify:

User-agent: Bingbot

Disallow: /

Explanation: The above blocks the bing bot from crawling the entire website via 1 group.

Important Notes:

Crawlers interpret these groups from top to bottom.
The first group that matches a user agent is applied.
All paths not disallowed are considered allowed.
Rules are case-sensitive, the order matters

Directives in Robots.txt

Here are the valid commands you can use in a robots.txt file.

User-Agent

Specifies the web crawler name the rule applies to.

Example:

User-agent: Googlebot

Allow

Permits crawlers to access specific parts of your site.

Example:

Allow: /public/

Disallow

Prevents crawlers from accessing certain parts of your site.

Example:

Disallow: /private/

Crawl-Delay

Sets a delay between pages crawled (in seconds) so that crawler doesn't overload your website and crawls slowly.

Example:

Crawl-delay: 10

Sitemap

Indicates the location of your website's sitemap. (which is a list of all your pages)

Example:

Sitemap: https://onescales.com/sitemap.xml

Comments

If you use a hashtag (#) in the beginning of the line, it means the line is for comments only and not used at all as a rule. It's more for making notes to remind you of a specific point.

Wildcard Usage

You can specify a text that can match any word(s) instead of the wildcard.

For example - Disallow: /*/collections/

Will not allow bots to crawl /files/collections/, /myimages/collections/ and any word(s) that are instead of the asterisk (*)

`/image*` - Equivalent to `/image`

In this case, /image* and /image yield the same results; the trailing wildcard * is ignored. This directive will match any URL that starts with the string /image.

Matches:

/image: Matches because it starts with /image.
/image.html: Matches because it starts with /image.
/image/page.html: Matches because it starts with /image.
/imagefiles: Matches because it starts with /image.
/imagefiles/pages.html: Matches because it starts with /image.
/image.php?id=anything: Matches because it starts with /image.

Doesn't Match:

/Image.asp: Doesn't match because it is case-sensitive and expects a lowercase 'i'.
/myimage: Doesn't match because the URL path doesn't start with /image.
/?id=image: Doesn't match because it starts with /?, not /image.
/folder/image: Doesn't match because the URL path doesn't start with /image.

This rule blocks (or allows, depending on whether you're using Disallow or Allow) access to all URLs that start with /image, but not those that contain "image" elsewhere in the URL or those that use a different case like /Image.asp.

`/*.php$` - Matches Any Path That Ends with `.php`

The directive /*.php$ will match any URL path that ends exactly with the .php extension. The $ symbol specifies that the URL should end with .php.

Matches:

/filename.php: Matches because it ends with .php.
/folder/filename.php: Matches because it also ends with .php.

Doesn't Match:

/filename.php?parameters: Doesn't match because it doesn't end with .php due to the query parameters.
/filename.php/: Doesn't match because the trailing slash means it doesn't end with .php.
/filename.php5: Doesn't match because it ends with .php5 and not .php.
/windows.PHP: Doesn't match due to case sensitivity; it expects a lowercase .php.

This rule will block (or allow, if using Allow) access to all URLs that end exactly with .php, but won't apply to URLs that contain additional characters after .php or use a different case like /page.PHP.

Other Options to Robots.txt

robots.txt is great for general rules, but sometimes you need to specify it in the code or server on your site instead. You can also use meta tags and HTTP headers.

Note that NOT all bots will follow these non robots.txt guidelines but it's good to know as some platforms don't allow you to edit your robots.txt or you need to specify these rules in your specific pages code.

For example - Google does follow the below.

Meta Tags

Place the following code in the <head> section of your HTML to prevent a page from being indexed:

<meta name="robots" content="noindex">

What it does: Tells search engines not to index the page.

HTTP Headers

For non-HTML content, use the following HTTP header to achieve the same effect:

X-Robots-Tag: noindex

What it does: Prevents the content from being indexed by search engines.

Limitations

Only "good bots" honor robots.txt. Good bots such as legit companies include Google and Bing. Some bots, scrapers and malicious bots do not follow your guidelines and you will have to consider blocking them via code, firewall, captcha and/or other methods.
Some bots follow instructions differently or don't follow all rules the same.
Using robots.txt is not a foolproof way to prevent search engines from visiting your site or from indexing your site.

Real Example

Let's take onescales.com robots.txt and review whats inside.

The robots.txt file for onescales.com specifies various rules for web crawlers, including what they can and cannot access. Here's a quick rundown in 10 bullet points:

General Rules: The User-agent: * section sets rules for all web crawlers. Pages like /admin, /cart, /orders, etc., are disallowed for crawling.
Adsbot-Google: Specific rules are set for Google's Adsbot, disallowing it from crawling pages like /checkouts/, /checkout, /carts, etc.
Nutch: All crawling is disallowed for the Nutch web crawler (User-agent: Nutch).
AhrefsBot and AhrefsSiteAudit: These have specific delays (Crawl-delay: 10) and similar disallowed paths to the general rules.
MJ12bot: It has a specified crawl delay of 10 seconds (Crawl-Delay: 10).
Pinterest: It has a specified crawl delay of 1 second (Crawl-delay: 1).
Sorting Filters: Disallowed crawling of URLs with sorting parameters in /collections/.
Query Parameters: Blocks crawling of URLs containing specific query parameters like oseid=, preview_theme_id, etc.
Sitemap: Specifies the location of the sitemap as https://onescales.com/sitemap.xml.
Platform: The comment at the top indicates that onescales.com uses Shopify as their ecommerce platform.

Two Methods for Blocking

There's 2 main methods for blocking bots:

Specify the entire list of all bots you want to block and specify the rule
Specify the bots you want to allow and block all the rest.

Testing Tool

Google Robots.txt Testing Tool

Use a testing tool to verify your robots.txt rules against example URLs. After inputting the rules and the URL you want to check, the tool will indicate whether the URL is allowed or disallowed. Results will be color-coded—green for allowed and red for disallowed—and the tool will highlight the specific rule that applies to the URL you're examining.

Allowed Example

Disallowed Example

Helpful Links

Robots.txt Standard - The original instructions to the written guidelines
Google Robots.txt
Bots Blocked Research - Data found on top websites online and what bots they block.

View Entire Topic Via Our Youtube Walkthrough:

Introduction to Robots.txt Youtube Video

Back to blog

Introduction to Robots.txt

What Is Robots.txt?

Placement of Robots.txt

How to Write Robots.txt: Basic Guidelines

Directives in Robots.txt

User-Agent

Allow

Disallow

Crawl-Delay

Sitemap

Comments

Wildcard Usage

`/image*` - Equivalent to `/image`

`/*.php$` - Matches Any Path That Ends with `.php`

Other Options to Robots.txt

Meta Tags

HTTP Headers

Limitations

Real Example

Two Methods for Blocking

Testing Tool

Helpful Links

View Entire Topic Via Our Youtube Walkthrough:

Leave a comment

Tags

Thank You For Reading Our Articles!

About Us

Introduction to Robots.txt

What Is Robots.txt?

Placement of Robots.txt

How to Write Robots.txt: Basic Guidelines

Directives in Robots.txt

User-Agent

Allow

Disallow

Crawl-Delay

Sitemap

Comments

Wildcard Usage

/image* - Equivalent to /image

/*.php$ - Matches Any Path That Ends with .php

Other Options to Robots.txt

Meta Tags

HTTP Headers

Limitations

Real Example

Two Methods for Blocking

Testing Tool

Helpful Links

View Entire Topic Via Our Youtube Walkthrough:

Leave a comment

Tags

Thank You For Reading Our Articles!

Join our Newsletter

Size chart

Follow Us on Social Media

`/image*` - Equivalent to `/image`

`/*.php$` - Matches Any Path That Ends with `.php`