Introduction to Robots.txt

Introduction to Robots.txt

What Is Robots.txt?

The robots.txt file serves as a guide for bots, code and crawlers (such as search engines), specifying which parts of your website can be crawled (visited and read). It adheres to the Robots Exclusion Standard and consists of various rules that either permit or block crawlers from accessing particular parts (pages, files and folders) of your website.

If no rules are specified, all files on your website are allowed for crawling. 


Don't Feel Like Reading? We Made a Video Explaining the Full Details of the Topic. Click Here to Scroll Down to The Video

Placement of Robots.txt

Your robots.txt file should be located in the root directory of your domain.

For instance, if your domain is onescales.com, your robots.txt file should be accessible at https://onescales.com/robots.txt

The robots.txt file's influence is limited to the domain, subdomain, and protocol (http / https) where it resides.

How to Write Robots.txt: Basic Guidelines

  1. Rules: These define what a crawler can or can't do on your website.
  2. Groups: A robots.txt file is divided into groups, each containing a set of rules (one or more).
  3. User Agent: Every group starts with a User-agent line, specifying the name of the crawler and the rules applied to.

    For Example: If you want to block Bing from crawling your site, you would specify:

    User-agent: Bingbot

    Disallow: /

     

    Explanation: The above blocks the bing bot from crawling the entire website via 1 group.

     

    Important Notes:

    • Crawlers interpret these groups from top to bottom.
    • The first group that matches a user agent is applied.
    • All paths not disallowed are considered allowed.
    • Rules are case-sensitive, the order matters

    Directives in Robots.txt

    Here are the valid commands you can use in a robots.txt file.

     

    User-Agent

    Specifies the web crawler name the rule applies to.

    Example:

    User-agent: Googlebot

    Allow

    Permits crawlers to access specific parts of your site.

    Example:

    Allow: /public/

    Disallow

    Prevents crawlers from accessing certain parts of your site.

    Example:

    Disallow: /private/

    Crawl-Delay

    Sets a delay between pages crawled (in seconds) so that crawler doesn't overload your website and crawls slowly.

    Example:

    Crawl-delay: 10

    Sitemap

    Indicates the location of your website's sitemap. (which is a list of all your pages)

    Example:

    Sitemap: https://onescales.com/sitemap.xml

    Comments

    If you use a hashtag (#) in the beginning of the line, it means the line is for comments only and not used at all as a rule. It's more for making notes to remind you of a specific point.

    Wildcard Usage

    You can specify a text that can match any word(s) instead of the wildcard.

    For example - Disallow: /*/collections/

    Will not allow bots to crawl /files/collections/, /myimages/collections/ and any word(s) that are instead of the asterisk (*)

    /image* - Equivalent to /image

    In this case, /image* and /image yield the same results; the trailing wildcard * is ignored. This directive will match any URL that starts with the string /image.

    Matches:

    • /image: Matches because it starts with /image.
    • /image.html: Matches because it starts with /image.
    • /image/page.html: Matches because it starts with /image.
    • /imagefiles: Matches because it starts with /image.
    • /imagefiles/pages.html: Matches because it starts with /image.
    • /image.php?id=anything: Matches because it starts with /image.

    Doesn't Match:

    • /Image.asp: Doesn't match because it is case-sensitive and expects a lowercase 'i'.
    • /myimage: Doesn't match because the URL path doesn't start with /image.
    • /?id=image: Doesn't match because it starts with /?, not /image.
    • /folder/image: Doesn't match because the URL path doesn't start with /image.

    This rule blocks (or allows, depending on whether you're using Disallow or Allow) access to all URLs that start with /image, but not those that contain "image" elsewhere in the URL or those that use a different case like /Image.asp.

    /*.php$ - Matches Any Path That Ends with .php

    The directive /*.php$ will match any URL path that ends exactly with the .php extension. The $ symbol specifies that the URL should end with .php.

    Matches:

    • /filename.php: Matches because it ends with .php.
    • /folder/filename.php: Matches because it also ends with .php.

    Doesn't Match:

    • /filename.php?parameters: Doesn't match because it doesn't end with .php due to the query parameters.
    • /filename.php/: Doesn't match because the trailing slash means it doesn't end with .php.
    • /filename.php5: Doesn't match because it ends with .php5 and not .php.
    • /windows.PHP: Doesn't match due to case sensitivity; it expects a lowercase .php.

    This rule will block (or allow, if using Allow) access to all URLs that end exactly with .php, but won't apply to URLs that contain additional characters after .php or use a different case like /page.PHP.

     

    Other Options to Robots.txt

    robots.txt is great for general rules, but sometimes you need to specify it in the code or server on your site instead. You can also use meta tags and HTTP headers. 

    Note that NOT all bots will follow these non robots.txt guidelines but it's good to know as some platforms don't allow you to edit your robots.txt or you need to specify these rules in your specific pages code. 

    For example - Google does follow the below.

    Meta Tags

    Place the following code in the <head> section of your HTML to prevent a page from being indexed:

    <meta name="robots" content="noindex">
    • What it does: Tells search engines not to index the page.

    HTTP Headers

    For non-HTML content, use the following HTTP header to achieve the same effect:

    X-Robots-Tag: noindex
    • What it does: Prevents the content from being indexed by search engines.

    Limitations

    1. Only "good bots" honor robots.txt. Good bots such as legit companies include Google and Bing. Some bots, scrapers and malicious bots do not follow your guidelines and you will have to consider blocking them via code, firewall, captcha and/or other methods.
    2. Some bots follow instructions differently or don't follow all rules the same.
    3. Using robots.txt is not a foolproof way to prevent search engines from visiting your site or from indexing your site.

    Real Example

    Let's take onescales.com robots.txt and review whats inside.

    The robots.txt file for onescales.com specifies various rules for web crawlers, including what they can and cannot access. Here's a quick rundown in 10 bullet points:

    1. General Rules: The User-agent: * section sets rules for all web crawlers. Pages like /admin, /cart, /orders, etc., are disallowed for crawling.

    2. Adsbot-Google: Specific rules are set for Google's Adsbot, disallowing it from crawling pages like /checkouts/, /checkout, /carts, etc.

    3. Nutch: All crawling is disallowed for the Nutch web crawler (User-agent: Nutch).

    4. AhrefsBot and AhrefsSiteAudit: These have specific delays (Crawl-delay: 10) and similar disallowed paths to the general rules.

    5. MJ12bot: It has a specified crawl delay of 10 seconds (Crawl-Delay: 10).

    6. Pinterest: It has a specified crawl delay of 1 second (Crawl-delay: 1).

    7. Sorting Filters: Disallowed crawling of URLs with sorting parameters in /collections/.

    8. Query Parameters: Blocks crawling of URLs containing specific query parameters like oseid=, preview_theme_id, etc.

    9. Sitemap: Specifies the location of the sitemap as https://onescales.com/sitemap.xml.

    10. Platform: The comment at the top indicates that onescales.com uses Shopify as their ecommerce platform.

     

    Two Methods for Blocking

    There's 2 main methods for blocking bots:

    1. Specify the entire list of all bots you want to block and specify the rule
    2. Specify the bots you want to allow and block all the rest.

      Testing Tool

      1. Google Robots.txt Testing Tool

      Use a testing tool to verify your robots.txt rules against example URLs. After inputting the rules and the URL you want to check, the tool will indicate whether the URL is allowed or disallowed. Results will be color-coded—green for allowed and red for disallowed—and the tool will highlight the specific rule that applies to the URL you're examining.

      Allowed Example

      Disallowed Example

      Helpful Links

      1. Robots.txt Standard - The original instructions to the written guidelines
      2. Google Robots.txt
      3. Bots Blocked Research - Data found on top websites online and what bots they block.

      View Entire Topic Via Our Youtube Walkthrough:

      Thank You For Reading Our Articles!

      We're dedicated to providing real answers, valuable insights & efficient knowledge online. Through our content, we strive to share information that matters, leveraging technology to multiply efforts & minimize waste.

      Your support is invaluable, so please subscribe, share, & help with your insight and engage with our amazing community!

      Leave a comment

      Please note, comments need to be approved before they are published.