in

Robots.txt

Did you wonder if there is a way to tell the search engine’s bot where they can go and can’t go on your website?

It’s like putting the “do not disturb” sign in your hotel room’s door to prevent the hotel officer to skip your room for daily cleaning.

Fortunately, that is possible, and one of the main way is by using robots.txt.

In this lesson, you’ll going to learn about what is robots.txt, why do you need to use them, and how to use them.


What is Robots.txt?

A robots.txt or also known as Robots Exclusion Protocol is a text file that the webmaster create to instruct the search engine how to crawl on their site.

The instruction is specified to allow or disallow the search engine bots crawl on part of the website.

From the earlier lesson, you have learnt that search engine bots discover and index page by crawling pages.

But before these bots crawl any of the sites, it will open the domain’s robots.txt file first.

And from here, they’ll know exactly which are the available part of the website to crawl and which are forbidden to crawl that set by the webmaster.


Why Should We Use Robots.txt?

You might be wondering why you should block some of the pages from being discovered by search engine?

The main reason to do that is to save crawl budget.

There are tons of Googlebot spiders on the web, but they’re not infinite.

With the overwhelming number of new pages on the web everyday, it is impossible for Googlebot to catch every single thing.

This is why, the spider bots set a budget or limitation for each website they found on the web to save their valuable time.

By blocking all the less important pages on your site, you could end up saving the crawl budget and let all the important pages on your site getting crawled by the search engine.


Robots.txt Syntax

The basic of robots.txt file is actually consist of two lines:

User-agent:

Disallow:

So, if you have these two lines pasted in your robots.txt file, you have completed the robots.txt foundation.

Now, let’s understand what are these two lines mean:

robots.txt syntax

1. User-Agent

The first line (User-agent) indicates the specific bot name that you’re talking about.

You can either have one block for all search engines by adding *:

User-agent: * 

or specific blocks for specific search engines by adding the bot name:

User-agent: Googlebot

As you know, there are tons of other search engine besides Google and all of them have their own bot name. You can see the full list here.

With this the search engine spider will always pick the block that best matches its name.

2. Disallow

The second line (Disallow) indicates sections and pages on your website that you wish the bot not to crawl.

You can exclude your entire website by adding a forward slash (/):

Disallow: /

or specific section on your website by adding the directory after the forward slash. Below are the example:

Disallow: /junk/ Prevent search engine bot to crawl junk section of your site.

Disallow: /calendar/ Prevent search engine bot to crawl calendar section of your site.

You also can be more specific such as blocking a single webpage on your site:

Disallow: /private_file.html

The last one, you can also block a specific file type from being crawl on your website by adding /*file type$. Here are the example:

Disallow: /*.gif$Prevent search engine bot to crawl any gif file type on your site.

Disallow: /*.xls$Prevent search engine bot to crawl any xls type of your site.

3. Allow

The line (Allow) indicates sections and pages on your website that you allowed the bot to crawl.

This line is not commonly used in robots.txt since not disallowing it automatically allowing the search engine bot to crawl.

However, this line can be useful if you want to include a single thing from a section that you have excluded.

Here is the example:

User-agent: *

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php

From the above example, we want the search engine to still crawl the ajax file while excluding any other else from it parent section, wp-admin.


Implementation

When combined together, these User-agent, Disallow, and Allow (optional) become one syntax rule.

However, please be remind to just put one user-agent per syntax while you can put multiple disallow and allow in one syntax.

You can create as many syntax rule as you want, by just adding another the sets again after another syntax.

Here is the example:

User-agent: discobot

Disallow: /

User-agent: *

Disallow: /plugin/

Disallow: /dashboard/

From the above example, we intended to block discobot from crawling all the website page, while allowing other search engine bots but still prevent them to crawl plugin and dashboard section of the website.


Crawl Delay

Some webmasters are adding crawl delay to their robots.txt syntax and it look something like this:

User-agent: Slurp

Disallow: /plugin/

Crawl delay: 4

So, what does this crawl delay represent?

Some search engines like Bing, Yahoo, Yandex, and Slurp are at times crawl hungry which will request large number of request that can slow down the server.

With crawl delay, you can slow them down by setting a rule.

Although, some of the search engine like Google does not support crawl delay and will just ignore this line if applied.

So, if you figure out that you’re not getting much traffic from any specific search engine, you could use crawl delay to deprioritize them.


How to Implement Robots.txt on Your Website?

Fortunately, creating robots.txt is not as complicated as creating the syntax. Here are the steps:

1. Create Robots.txt File

The very first step, is create a robots.txt file using Windows notepad.

Now, add a new notepad file, and name the file as “robots.txt” file.

2. Fill the Robots.txt File With The Syntax

Now that you have a blank notepad file with you, it is time to fill the syntax that you learnt above.

Here, you need to find any section on your site that does not have any value for the robot to crawl and hence save the crawl budget.

For the first timer, you might have no idea on what you should block on your site.

So here’s some ideas that you might consider to block:

  • Tag: if you find that your site have a lot of unique tags, you might consider to block them from getting crawled by search engine bot since it could eat up the crawl budget.
  • Privacy and Term of Service pages: While these pages are only created for the user policy purpose, there is no reason for the search engine bot to waste their time on crawling these pages.
  • WP-Admin: If you’re using WordPress, excluding the wp-admin page is a good idea since it is not useful to be crawled.
  • Login & Account Page: If your website allowed the user to login, excluding the login and account pages will be a good idea since it does not provide any value to be crawled.

Hence, it will look something like this:

User-agent: *

Disallow: /tag/

Disallow: /privacy.html/

Disallow: /wp-admin/

Disallow: /wp-login/

This will be a good starter if you have no idea what to block on your site. As the times goes, you might add additional syntax if you feel that necessary.

3. Upload the Robots.txt File

Once you saved the robots.txt file on your computer, you’re ready to upload it to your website depending on your site and server architecture.

4. Test Your Robots.txt

In order to see the robots.txt file of any website, it can be done easily by adding “/robots.txt after the domain name.

For example, you can view our robots.txt on www.searchenginementor.com/robots.txt.

You need to be careful on setting up robots.txt.

One mistake and your entire site could get deindexed.

Fortunately, Google has provided their robot testing tool to check your robots.txt.

robot.txt tester by Google

Check if there is any errors or warnings pointed out on the bottom left.