Robots.txt Guide: The Hidden Ruleset Your Website Needs

Robots.txt Guide: The Hidden Ruleset Your Website Needs


Feb 18, 2025
by jessicadunbar

There is an entire range of website components we can work on to ensure the website is helpful to users and can perform great in search. Keeping your content up to date and paying attention to off-page and technical aspects can impact user experience and SEO.

One crucial element of your site can influence an even more important factor—how your site gets crawled by the search engine. This element is robots.txt file. Let’s find out what it is and how it can influence your website performance.

What is a Robots.txt File?

The most important part of your site—at least for SEO purposes—is a text document that weighs just a few kilobytes. 

The file contains rules for different crawlers on how to navigate your site. By default, crawlers will go through your entire website following the links. Robots.txt can block some parts of the website, like specific pages, folders, or file types, from being crawled (and, as a result—indexed) by Google crawlers and other search bots.

Here’s how robots.txt file looks like in one of the websites built with Concrete CMS.

User-agent: *

Sitemap: https://www.softcat.com/sitemaps/sitemap.xml

Disallow: /application/attributes
Disallow: /application/authentication
Disallow: /application/bootstrap
Disallow: /application/config
Disallow: /application/controllers
Disallow: /application/elements
Disallow: /application/helpers
Disallow: /application/jobs
Disallow: /application/languages
Disallow: /application/mail
Disallow: /application/models
Disallow: /application/page_types
Disallow: /application/single_pages
Disallow: /application/tools
Disallow: /application/views
Disallow: /ccm/system/captcha/picture

# System
Disallow: /LICENSE.TXT
Disallow: /phpcs.xml
Disallow: /phpstan.neon
Disallow: /composer.*

# Existing
Disallow: /*.php$
Disallow: /CVS
Disallow: /*.svn$
Disallow: /*.idea$
Disallow: /*.sql$
Disallow: /*.tgz$
        

This file blocks all crawlers from accessing certain pages in the /application/ folder and file formats like PHP files. Robots.txt isn’t required, you can never create one and still have a well-performing website. But using this rulebook for crawlers gives you more control over your website’s crawlability and is usually easier to deal with than other methods of providing instructions to search bots.

Importance of Robots.txt File

To understand why robots.txt is such an important file, you must understand what crawling and crawling budget are. Google and other search engines discover your pages with crawler bots that follow links. These crawlers start with either a sitemap you’ve submitted to Google yourself or by visiting your site from a link from another site.

For every site, there’s a crawling budget—how many links Google bots will crawl on your website within a given timeframe. And it’s not unlimited, even for huge websites with good reputation. Therefore, if all pages of your website are accessible to crawlers, you may find your crawl budget spent on unimportant (or even private) pages while ones primary for your business are left uncrawled and as a result–not indexed.

To avoid this, block unimportant parts of your site, private and admin areas from being crawled. Using robots.txt disallow directive suggests to the crawlers not to visit that page from your site. However, Robots.txt doesn’t guarantee a page won’t appear in Google search results eventually.

Google crawlers can still discover a page by following an external link pointing to your site. The page can be added to index and shown in Google search.

If you want to make sure a page never makes it to public Google search, use the noindex value for robots meta tag or X-robots-tag HTTP response header.

Crawlers going through your site also take up server bandwidth and blocking some pages from crawling can slightly decrease server load. Of course, this is not the main reason for creating a robots.txt file, but it’s a nice side feature.

How to Create a Robots.txt for Your Site

The robots.txt file is simple, but small details help it work properly and avoid blocking pages you want search engines to see. For instance, if you want to block the tags page from being crawled and phrase it as:

Disallow: /tags

You’ll actually block all the pages in that directory, including:

/tags/gifts/product-page-that-needs-to-be-crawled

To avoid this, here’s a short guide on how to use robots.txt the right way.

Understand & Use Robots.txt Syntax Correctly

The first step is writing the document correctly. Here are the most important things to know about robots.txt syntax.

Google supports the following four fields (also called rules and directives):

  • User-agent
  • Allow
  • Disallow
  • Sitemap

Google doesn’t support crawl-delay. You can include this field for Bing and Yahoo. To influence this parameter in Google, adjust the Crawl rate in your Google Search Console.

User-agent

This field instructs which crawler bots should follow the instructions. Here’s how you would write it in a file:

User-agent: [agent]
Allow: [path]

Now, the agents you specified would be allowed to crawl a specific path.

You can have multiple user agent directives:

User-agent: [agent]
Disallow: [path-1]
Disallow: [path-2]
Allow: [path-3]

Now, all user agents will be forbidden from crawling paths one and two and allowed to crawl path three.

Multiple User Agents

Giving the same instructions to multiple user agents is also possible:

User-agent: [agent-1]
User-agent: [agent-2]
Disallow: [path-1]
Disallow: [path-2]
Allow: [path-3]

Common User Agents

There are multiple bot names for user agents. Some search engines use a single user agent name, whereas Google has multiple:

  • Googlebot-Image
  • Googlebot-Mobile
  • Googlebot-News
  • Googlebot-Video
  • Storebot-Google
  • Mediapartners-Google
  • AdsBot-Google

This is useful if you want to block the Google News bot because you don’t need Google News traffic and want to save server resources and crawl budget. In this case, block your whole site to user agent Googlebot-News and allow all others.

Blocking SEO & Other Bots

There are hundreds of web crawling tools, and you can block some bots besides search engines. For example, you can block SEO tool bots if you don’t want competitors to see your ranking keywords.

But remember that a robots.txt file is a document with suggestions, not commands. Only search engine bots will consult it, and while some companies will respect it, many won’t—especially malicious bots.

If you want to block a third-party bot for good, use .htaccess.

Disallow

Disallow crawl directives specify a path that should be blocked from crawling. Content on the page won’t be discovered and indexed, but if Google finds an external link pointing to that page, it might index the URL itself.

To disallow multiple paths for one user agent, write each path in a new disallow field:

User-agent: *
Disallow: /admin/
Disallow: /confidential/

To disallow multiple paths for one user agent, write each path in a new disallow field.

An empty disallow field allows all robots access to crawl the entire site, which is often the default in robots.txt files like this one from a site that uses Concrete:

User-Agent: *
Sitemap: https://www.annonayrhoneagglo.fr/sitemaps/index_default.xml
Sitemap: https://www.annonay.fr/sitemaps/index_ville_annonay.xml
Sitemap: https://www.villevocance.fr/sitemaps/index_villevocance.xml
Sitemap: https://www.saint-clair.fr/sitemaps/index_saint_clair.xml
Sitemap: https://www.serrieres.fr/sitemaps/index_serrieres.xml
Sitemap: https://www.ardoix.fr/sitemaps/index_ardoix.xml
Sitemap: https://www.boulieu.fr/sitemaps/index_boulieu_les_annonay.xml
Sitemap: https://www.talencieux.fr/sitemaps/index_talencieux.xml
Disallow: /application/attributes
Disallow: /application/authentication
Disallow: /application/bootstrap
Disallow: /application/config
Disallow: /application/controllers
Disallow: /application/elements
Disallow: /application/helpers
Disallow: /application/jobs
Disallow: /application/languages
Disallow: /application/mail
Disallow: /application/models
Disallow: /application/page_types
Disallow: /application/single_pages
Disallow: /application/tools
Disallow: /application/views
Disallow: /concrete
Disallow: /packages
Disallow: /tools
Disallow: /updates
Disallow: /login
Allow: */css/*
Allow: */js/*
Allow: */images/*
Allow: */fonts/*
Allow: /concrete/css/*
Allow: /concrete/js/*
Allow: /packages/*/*.js
Allow: /packages/*/*.css
Allow: /packages/*/fonts/*

Source: https://www.annonay.fr/robots.txt

Allow Directive

This directive specifies a path that’s allowed for crawling to the user agents in the line above. By default, web crawlers will go through every page that’s not disallowed, so there’s no need to specify this.

The best use for this field is allowing the crawling of a page or folder within a disallowed folder. It will override the disallow directive regardless of where it’s placed within the group of rules for a user agent. Here’s what it can look like:

User-agent: *
Disallow: /forbidden-folder
Allow: /forbidden-folder/page.html

Comments in Robots.txt

If you want to explain what a directive should do, write a comment by starting a new line with the hashtag symbol. Crawlers will disregard the whole line, and it won’t break the robots.txt file.

Path Rules in Robots.txt

There are only a few rules in the path syntax of robots.txt:

  • Path rules are case-sensitive.
  • / matches the root and everything in it. It’s used to refer to the whole website.
  • /path matches every path that starts with this expression. Example: /path/folder/subfolder/page.html
  • /path/ only matches the contents of this folder. The path above won’t be covered, but /path/page.html would.

Using Wildcards

Google and other search engines recognize two wildcards:

  • * means all instances of an expression.
  • $ means the end of the URL.

The first wildcard in the user agent field means all bots should follow the rules below. It will be ignored only by user agents that are otherwise specified. Here, Google Image bot will follow its own rules instead of browsing the site freely.

User-agent: *
Allow: /

User-agent: Googlebot-Image
Disallow: /

The wildcards are also useful for blocking all instances of a file type from being crawled. This rule would make sure no user agent can crawl .gif files on your site:

User-agent: *
Disallow: *.gif$

Blocking Search Result Pages

Another use case is blocking all instances of a commonly repeated URL segment. For instance, blocking all search results pages with this rule:

User-agent: *
Disallow: *?s=*

You can use the asterisk wildcard to block all URLs with query parameters. If you do that, make sure the phrase you’re including in the disallow directive can’t be repeated in regular URLs, as they would also be blocked from crawling.

Implementing Best Practices and Other Rules

Now that you know the basic syntax of robots.txt, here are a few best practices that you should follow.

You can start or finish your robots.txt by including a sitemap. It’s a document that points search engine crawlers to important links on your site and gives priority to each. Here’s an example:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>https://www.softcat.com/sitemaps/uk-sitemap.xml</loc>
        <lastmod>2024-06-26T21:49:59+01:00</lastmod>
    </sitemap>
    <sitemap>
        <loc>https://www.softcat.com/sitemaps/ie-sitemap.xml</loc>
        <lastmod>2024-06-26T21:49:59+01:00</lastmod>
    </sitemap>
    </sitemapindex>

Include it as a full link in your robots.txt file:

Sitemap: https://www.yoursite.com/sitemap.xml

You can submit it to Google Search Console directly. Adding it to robots.txt improves crawling quality.

Put each directive into a new line. Bots will still be able to read your document, but it will be harder for you to understand it and find possible mistakes.

Web crawlers can find and group rules for the same user agent, even if they are scattered in your document. But it's better to keep all the rules for a user agent in one place. Splitting them up makes it harder to find and fix issues and may create conflicting rules.

Robots.txt is viable for a single subdomain and protocol. So, all of these URLs are different sites for search engines, and different robots.txt files will apply:

  • https://yoursite.com/
  • http://yoursite.com/
  • https://www.yoursite.com/
  • http://www.yoursite.com/

This feature can lead to problems with crawling and indexing. Most won’t arise if you have properly configured 301 redirects from different versions of your site.

For subdomains, you’ll have to either upload additional robots.txt files or use a redirect from the main file.

While you can block a file type from crawling, you shouldn’t block CSS and JavaScript files. This can prevent crawlers from rendering your pages correctly and understanding what it's about.

Also don’t use robots.txt as the only way to restrict access to private content like databases or customer profiles. Hide sensitive content behind a log-in wall.

The final advice is to stay on the safe side with robots.txt. Only huge sites with hundreds of thousands of URLs need large robots.txt files.

Upload Your File

Once you’re done with creating crawling rules for your site, save the text file, and name it “robots.txt.” It has to be lowercase, otherwise the crawlers will ignore it.

Upload it to the directory. The final URL should read:

https://yoursite.com/robots.txt

Visit this URL to confirm the file is uploaded correctly, and you’re all set. Google should discover the new file within 24 hours. You can request a recrawl in Google Search Console’s robots.txt report if you want it done faster.

Test Your Robots.txt

Test this file before Google has a chance to stop crawling an important page you’ve blocked by accident.

The first tool for testing is the robots.txt report found in Google Search Console. It will show you the the last found files and show problems with the last version like rules being ignored.

After Google fetches the latest file, you can test pages with the GSC URL inspection tool. It will show if a URL can be indexed or if robots.txt is blocking it.

Google does provide a free robots.txt parser, but that’s a tool that requires a lot of coding knowledge.

Summary

Robots.txt is a useful tool that can help you block entire sections of your website from being crawled and significantly decrease the likelihood of those URLS being indexed.

Use it to block pages with query parameters from wasting your crawl budget or prevent certain file types from being indexed. Use correct syntax to make sure robots.txt is doing what it’s supposed to and test it with GSC.

Don’t forget that there are other tools for preventing access to parts of your site. Noindex meta tags are better for blocking individual pages from indexing, .htaccess rules are better for blocking malicious bots, and protecting parts of a website like a private intranet with passwords is better for cybersecurity.