We have been listening to the Robot.txt file for quite many years. It is nothing but a short text file to instruct Googlebot or web crawlers on what pages they are allowed to crawl on your website. Robots.txt plays a vital role in crawling all the required pages first and does not allow the bots to visit the least important pages. You can also restrict the bot completely from crawling a page if needed.
Robots.txt is a file placed in the root directory of your website that you can use to control how search engines crawl and index your site. Using robots.txt, you can specify which pages you want to be crawled and indexed by search engines and which pages are off-limits.
Table of Contents
Where can you find the robots.txt file?
It’s not hard to find a robots.txt file, and to check it out, go to your homepage and add “/robots.txt” at the end, and click enter. The link will look like this:
https://yourdomain.com/robots.txt
It is a public folder that allows you to check the files on your system for any website. Every big website, like Apple, Amazon, Twitter, etc., will use robots.txt file to give access to their important pages.
Why is the Robots.txt file important for Web Pages?
Robots.txt file is important because the web pages of your site need to be crawled before they are displayed in the search results. This file informs the crawlers on which page to access and interact.
But, sometimes, we also have to restrict the bots to crawl a few pages, for instance, login pages, pages that have no content or not much to display. To achieve this, we have to use the robots.txt file so that it can first crawl the pages and then display what is allowed to the users.
Remember that we can prohibit crawlers from visiting certain pages, but we cannot stop them from indexing your pages. Search engines tend to index pages that are excluded from being crawled due to external links pointing to them from other web pages. They appear in the search results but have no value as the bots couldn’t crawl any information from those pages.
We also have a few SEO benefits by using the Robots.txt file in certain situations
-
Avoid Duplicate content
The robots.txt file can help you avoid having similar or duplicate content on the pages indexed by search engines and control how search engines interact with your website. You can use robots.txt to manage content that doesn’t need to be crawled by search engines (such as PDFs). If a page of yours has identical content across multiple URLs (e.g., www vs non-www), it’s worth adding a Disallow rule for that page in the robots meta tag.
You can tell Googlebot to ignore those pages by adding the following line to your robots.txt file:
User-agent: Googlebot
Disallow: /duplicate-content/
Once you’ve added this line to your robots.txt file, Google will no longer crawl or index those pages on your website.
-
Crawl Budget Optimization
The crawl budget limits the number of pages that Googlebot will crawl within a certain period, as most websites have many non-important pages that do not need to be frequently crawled and indexed.
At this time, using a robots.txt file is much needed to tell the search engines the important pages to crawl and what pages to avoid, as this increases efficiency and optimizes the frequency of crawling.
-
Prevents Server Crashing and Loading
Using robots.txt can help prevent your website server from crashing by letting Googlebot know how quickly it should crawl your site. You may wish to block access to crawlers that visit your site too frequently. In these instances, robots.txt can tell crawlers which parts of your site to visit and which not to visit.
Blocking bots that cause site issues can help reduce server load and improve site performance. It is especially important if a bad bot is sending too many requests or a scraper is trying to copy all your website’s content.
How does the Robots.txt file work?
The main purpose of a robots.txt file is to instruct search engine crawlers on how to proceed with finding, accessing and indexing the content on a site. The file consists of 2 main elements: rules that dictate what a crawler should do and directives that specify what the crawler should not do.
User-agents: It can be set when directing specific crawlers to avoid certain pages.
Directives: They are instructions for user-agents to follow on what they should do with certain pages.
-
UserAgent
Let’s see an example including both the elements:
User-agent: Bingbot
Disallow: /wp-admin/
A user agent is a name assigned to a specific crawler that will be instructed by directives about how to crawl your website. There are many user agents we can use; for Google crawler, it is “Googlebot”, for Bing crawler, it is “Bingbot”, etc.
If you want to mark all web crawlers for a given directive at once, you need to use the “*” symbol. It will allow the bots to obey the robots.txt file. For example:
User-agent: *
Disallow: /wp-admin/
-
Directives
The directives in robots.txt tell the crawlers which parts of your website they should and should not index. By default, crawlers will crawl every page available; you must specify which pages or sections of your website should not be indexed.
They are three rules:
“Allow” – indicates that crawlers can access some pages from the disallowed site section.
“Disallow” – To disallow crawlers from accessing certain pages or parts of your website, add a “Disallow” directive to the robots.txt file and list the page URLs. You can have multiple disallow instructions on one line.
“Sitemap” – If you have created an XML sitemap, “robots.txt” can direct the web crawlers to your site map so that they can find pages you wish to crawl and index.
For example:
User-agent: Bingbot
Disallow: /wp-admin/
Allow: /wp-admin/random-content.php
Sitemap: https://www.example.com/sitemap.xml
-
Wildcard ( * )
The asterisk (*) is an indication of searching for any characters with matching patterns. This directive is especially useful for websites containing automatically generated content, filtered product pages and the like.
For example, instead of disallowing every product page individually under the /products/ section.
User-agent: *
Disallow: /products/shirts?
Disallow: /products/t-shirts?
Disallow: /products/belts?
We can use “*” to disallow all at once:
User-agent: *
Disallow: /products/*?
-
#
The # (hash) sign indicates a comment or annotation, not a directive.
For example:
# Crawlers should not visit our login page!
User-agent: *
Disallow: /wp-admin/
-
$
The dollar sign ($) indicates the end of a URL. Instructions to search engine crawlers can be given that they should or shouldn’t crawl URLs that end in a certain way.
For example, tell the bots to ignore all URLs that end with “.jpg”.
User-agent: *
Disallow: /*.jpg$
Conclusion
The robots.txt file can get complicated, so it is best to keep things simple. Here are a few tips for creating and updating your robots.txt file:
- To keep things simple, you should use separate robots.txt files for each subdomain that you own.
- Make sure that the directives in your robots.txt file are logically grouped according to which user agent they apply. It will help you create a simple, well-organized file.
- It is important to specify exact URL paths.
For more information on the Robots.txt file and how it works, contact Ahbiv Digital Agency, where our team awaits to help you understand this topic.