To disallow multiple paths for one user agent, write each path in a new disallow field.
An empty disallow field allows all robots access to crawl the entire site, which is often the default in robots.txt
files like this one from a site that uses Concrete:
User-Agent: *
Sitemap: https://www.annonayrhoneagglo.fr/sitemaps/index_default.xml
Sitemap: https://www.annonay.fr/sitemaps/index_ville_annonay.xml
Sitemap: https://www.villevocance.fr/sitemaps/index_villevocance.xml
Sitemap: https://www.saint-clair.fr/sitemaps/index_saint_clair.xml
Sitemap: https://www.serrieres.fr/sitemaps/index_serrieres.xml
Sitemap: https://www.ardoix.fr/sitemaps/index_ardoix.xml
Sitemap: https://www.boulieu.fr/sitemaps/index_boulieu_les_annonay.xml
Sitemap: https://www.talencieux.fr/sitemaps/index_talencieux.xml
Disallow: /application/attributes
Disallow: /application/authentication
Disallow: /application/bootstrap
Disallow: /application/config
Disallow: /application/controllers
Disallow: /application/elements
Disallow: /application/helpers
Disallow: /application/jobs
Disallow: /application/languages
Disallow: /application/mail
Disallow: /application/models
Disallow: /application/page_types
Disallow: /application/single_pages
Disallow: /application/tools
Disallow: /application/views
Disallow: /concrete
Disallow: /packages
Disallow: /tools
Disallow: /updates
Disallow: /login
Allow: */css/*
Allow: */js/*
Allow: */images/*
Allow: */fonts/*
Allow: /concrete/css/*
Allow: /concrete/js/*
Allow: /packages/*/*.js
Allow: /packages/*/*.css
Allow: /packages/*/fonts/*
Source: https://www.annonay.fr/robots.txt
Allow Directive
This directive specifies a path that’s allowed for crawling to the user agents in the line above. By default, web crawlers will go through every page that’s not disallowed, so there’s no need to specify this.
The best use for this field is allowing the crawling of a page or folder within a disallowed folder. It will override the disallow directive regardless of where it’s placed within the group of rules for a user agent. Here’s what it can look like:
User-agent: *
Disallow: /forbidden-folder
Allow: /forbidden-folder/page.html
Comments in Robots.txt
If you want to explain what a directive should do, write a comment by starting a new line with the hashtag symbol. Crawlers will disregard the whole line, and it won’t break the robots.txt
file.
Path Rules in Robots.txt
There are only a few rules in the path syntax of robots.txt
:
- Path rules are case-sensitive.
/
matches the root and everything in it. It’s used to refer to the whole website.
/path
matches every path that starts with this expression. Example: /path/folder/subfolder/page.html
/path/
only matches the contents of this folder. The path above won’t be covered, but /path/page.html
would.
Using Wildcards
Google and other search engines recognize two wildcards:
*
means all instances of an expression.
$
means the end of the URL.
The first wildcard in the user agent field means all bots should follow the rules below. It will be ignored only by user agents that are otherwise specified. Here, Google Image bot will follow its own rules instead of browsing the site freely.
User-agent: *
Allow: /
User-agent: Googlebot-Image
Disallow: /
The wildcards are also useful for blocking all instances of a file type from being crawled. This rule would make sure no user agent can crawl .gif
files on your site:
User-agent: *
Disallow: *.gif$
Blocking Search Result Pages
Another use case is blocking all instances of a commonly repeated URL segment. For instance, blocking all search results pages with this rule:
User-agent: *
Disallow: *?s=*
You can use the asterisk wildcard to block all URLs with query parameters. If you do that, make sure the phrase you’re including in the disallow directive can’t be repeated in regular URLs, as they would also be blocked from crawling.
Implementing Best Practices and Other Rules
Now that you know the basic syntax of robots.txt
, here are a few best practices that you should follow.
You can start or finish your robots.txt
by including a sitemap. It’s a document that points search engine crawlers to important links on your site and gives priority to each. Here’s an example:
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://www.softcat.com/sitemaps/uk-sitemap.xml</loc>
<lastmod>2024-06-26T21:49:59+01:00</lastmod>
</sitemap>
<sitemap>
<loc>https://www.softcat.com/sitemaps/ie-sitemap.xml</loc>
<lastmod>2024-06-26T21:49:59+01:00</lastmod>
</sitemap>
</sitemapindex>
Include it as a full link in your robots.txt
file:
Sitemap: https://www.yoursite.com/sitemap.xml
You can submit it to Google Search Console directly. Adding it to robots.txt improves crawling quality.
Put each directive into a new line. Bots will still be able to read your document, but it will be harder for you to understand it and find possible mistakes.
Web crawlers can find and group rules for the same user agent, even if they are scattered in your document. But it's better to keep all the rules for a user agent in one place. Splitting them up makes it harder to find and fix issues and may create conflicting rules.
Robots.txt is viable for a single subdomain and protocol. So, all of these URLs are different sites for search engines, and different robots.txt files will apply:
- https://yoursite.com/
- http://yoursite.com/
- https://www.yoursite.com/
- http://www.yoursite.com/
This feature can lead to problems with crawling and indexing. Most won’t arise if you have properly configured 301 redirects from different versions of your site.
For subdomains, you’ll have to either upload additional robots.txt files or use a redirect from the main file.
While you can block a file type from crawling, you shouldn’t block CSS and JavaScript files. This can prevent crawlers from rendering your pages correctly and understanding what it's about.
Also don’t use robots.txt as the only way to restrict access to private content like databases or customer profiles. Hide sensitive content behind a log-in wall.
The final advice is to stay on the safe side with robots.txt. Only huge sites with hundreds of thousands of URLs need large robots.txt files.
Upload Your File
Once you’re done with creating crawling rules for your site, save the text file, and name it “robots.txt.” It has to be lowercase, otherwise the crawlers will ignore it.
Upload it to the directory. The final URL should read:
https://yoursite.com/robots.txt
Visit this URL to confirm the file is uploaded correctly, and you’re all set. Google should discover the new file within 24 hours. You can request a recrawl in Google Search Console’s robots.txt report if you want it done faster.
Test Your Robots.txt
Test this file before Google has a chance to stop crawling an important page you’ve blocked by accident.
The first tool for testing is the robots.txt report found in Google Search Console. It will show you the the last found files and show problems with the last version like rules being ignored.
After Google fetches the latest file, you can test pages with the GSC URL inspection tool. It will show if a URL can be indexed or if robots.txt is blocking it.
Google does provide a free robots.txt parser, but that’s a tool that requires a lot of coding knowledge.
Summary
Robots.txt is a useful tool that can help you block entire sections of your website from being crawled and significantly decrease the likelihood of those URLS being indexed.
Use it to block pages with query parameters from wasting your crawl budget or prevent certain file types from being indexed. Use correct syntax to make sure robots.txt is doing what it’s supposed to and test it with GSC.
Don’t forget that there are other tools for preventing access to parts of your site. Noindex meta tags are better for blocking individual pages from indexing, .htaccess rules are better for blocking malicious bots, and protecting parts of a website like a private intranet with passwords is better for cybersecurity.