Categories
Uncategorized

Update or delete? Cleaning up old content on your site

Sometimes, content on your website becomes irrelevant or out of date, and you need to decide whether to update it or delete it. It’s part of your regular content maintenance activities. There are several ways to go about this and this article helps you decide what’s the best solution for your old content!

Update old content that is still valid

Let’s start with an example: On our blog, we have an article on meta descriptions that needs constant updating to keep it relevant. We just have to make sure it stays up to date with all the changes Google keeps making to the way it handles meta descriptions. Sometimes it seems they can be a bit longer and sometimes they seem to go back to the old length again.

Our post helps writers and editors to write meta descriptions, even though the advice changes over time. Although the article itself might be what we call cornerstone content, its content must be updated to keep up with the latest standards – constantly.

You can easily create new, valuable content from your old posts if you update it and make it current again: old wine in new bottles, as the saying goes. You could, for example, merge three old blog posts about the same subject into one new post or simply replace older parts of your post with updated content.

Creating a clone of your post with Yoast Duplicate Post, so you have a draft to work on, makes this process a lot easier. The new Rewrite & Republish feature even allows you to create a clone of your post, work on it and publish this updated version of your post without too much hassle. This will automatically merge your updated post into the original one, while automatically cleaning up the draft for you. That way, working on your post and publishing these changes becomes an easy and seamless process. Plus you can choose to schedule these updates if you want to publish your updated content at a later date!

Read more: Keep your content fresh and up to date »

Delete irrelevant posts or pages

It’s likely that you have old posts or pages on your site that you don’t need anymore. Think along the lines of a blog post about a product you stopped selling a while ago and have no intention of ever selling again, an announcement of an event that took place a long time ago or old pages with little or no content – so-called thin content pages.

These are just some examples, but I’m sure you know which posts and/or pages I’m talking about. This old content adds no value anymore, now or for the foreseeable future. In that case, you need to either tell Google to forget about these old posts or pages or give the URL another purpose.

When I talk about deleting old content, I don’t mean just pressing “delete” and then forgetting about it. If you do that, the content might show up in Google for weeks after deletion. The URL might actually have some link value as well, which would be a shame to waste.

So, what should you do? Here are two options:

“301 Redirect” the old post to a related one

When a URL still holds value because, say, you have a number of quality links pointing to that page, you want to leverage that value by redirecting the URL to a related one. With a 301 Redirect you’ll tell search engines and visitors there’s a better or newer version of this content elsewhere on your site. The 301 redirect automatically sends people and Google to this page.

Say you have an old post on a specific dog breed. You need to delete it, so the logical next step would be to redirect that post to a newer post about this dog breed. If you don’t have that post, choose a post about the closest breed possible. If that post isn’t available, you could redirect it to the category page for these posts (e.g. “dog breeds”) and if that is also not an option, redirect to the homepage. That last one might be about “pets”, for example. It’s a bit of a last resort though, there probably are better options on your site.

Creating a 301 Redirect (for instance in WordPress) isn’t hard, but doing it with Yoast SEO Premium is easy as pie. If you don’t have it yet, find out about all the extras that are in Yoast SEO Premium here.

Tell search engines the content is intentionally gone

If there isn’t a relevant page on your site you can redirect to, it’s wise to tell Google to forget about your old post entirely by serving a “410 Deleted” status to Google. This status code will tell Google and visitors the content didn’t just disappear; you’ve deleted it with a reason.

When Google can’t find a post, the server will usually return a “404 Not Found” status to the search engine’s bot. You’ll also find a 404 crawl error in your Google Search Console for that page. Eventually, Google will work it out and the URL will gradually vanish from the search result pages. But this takes time.

The 410 is more powerful in the sense that it tells Google that the page is gone forever, never to return. You deleted it on purpose, period. Google will act on that faster than with a 404. Read up about the server status codes if this is all gibberish to you.

Keep reading: How to properly delete a page from your site »

Do you have old content to deal with?

Cleaning up old content should be part of your content maintenance routine to keep your website fit. If you don’t go through your old posts regularly, you’re bound to run into issues sooner or later. You might show incorrect information to visitors or hurt your own rankings by having too many pages about the same topic, increasing chances of keyword cannibalization, which is a lot of work to fix later on. Therefore, go through your old posts, and decide what to do: update, merge or delete.

If you find it hard to decide what to keep and what to delete and redirect, this step-by-step guide to fix keyword cannibalization might help you out!

Good luck cleaning up your site.

Read on: Should you keep old content? »

The post Update or delete? Cleaning up old content on your site appeared first on Yoast.

Sometimes, content on your website becomes irrelevant or out of date, and you need to decide whether to update it or delete it. It’s part of your regular content maintenance activities. There are several ways to go about this and this article helps you decide what’s the best solution for your old content! Update old
The post Update or delete? Cleaning up old content on your site appeared first on Yoast.Read MoreContent SEO, Redirects, SEO Copywriting, Site Structure, Website MaintenanceYoast

Categories
Uncategorized

Update or delete? Cleaning up old content on your site

Sometimes, content on your website becomes irrelevant or out of date, and you need to decide whether to update it or delete it. It’s part of your regular content maintenance activities. There are several ways to go about this and this article helps you decide what’s the best solution for your old content!

Update old content that is still valid

Let’s start with an example: On our blog, we have an article on meta descriptions that needs constant updating to keep it relevant. We just have to make sure it stays up to date with all the changes Google keeps making to the way it handles meta descriptions. Sometimes it seems they can be a bit longer and sometimes they seem to go back to the old length again.

Our post helps writers and editors to write meta descriptions, even though the advice changes over time. Although the article itself might be what we call cornerstone content, its content must be updated to keep up with the latest standards – constantly.

You can easily create new, valuable content from your old posts if you update it and make it current again: old wine in new bottles, as the saying goes. You could, for example, merge three old blog posts about the same subject into one new post or simply replace older parts of your post with updated content.

Creating a clone of your post with Yoast Duplicate Post, so you have a draft to work on, makes this process a lot easier. The new Rewrite & Republish feature even allows you to create a clone of your post, work on it and publish this updated version of your post without too much hassle. This will automatically merge your updated post into the original one, while automatically cleaning up the draft for you. That way, working on your post and publishing these changes becomes an easy and seamless process. Plus you can choose to schedule these updates if you want to publish your updated content at a later date!

Read more: Keep your content fresh and up to date »

Delete irrelevant posts or pages

It’s likely that you have old posts or pages on your site that you don’t need anymore. Think along the lines of a blog post about a product you stopped selling a while ago and have no intention of ever selling again, an announcement of an event that took place a long time ago or old pages with little or no content – so-called thin content pages.

These are just some examples, but I’m sure you know which posts and/or pages I’m talking about. This old content adds no value anymore, now or for the foreseeable future. In that case, you need to either tell Google to forget about these old posts or pages or give the URL another purpose.

When I talk about deleting old content, I don’t mean just pressing “delete” and then forgetting about it. If you do that, the content might show up in Google for weeks after deletion. The URL might actually have some link value as well, which would be a shame to waste.

So, what should you do? Here are two options:

“301 Redirect” the old post to a related one

When a URL still holds value because, say, you have a number of quality links pointing to that page, you want to leverage that value by redirecting the URL to a related one. With a 301 Redirect you’ll tell search engines and visitors there’s a better or newer version of this content elsewhere on your site. The 301 redirect automatically sends people and Google to this page.

Say you have an old post on a specific dog breed. You need to delete it, so the logical next step would be to redirect that post to a newer post about this dog breed. If you don’t have that post, choose a post about the closest breed possible. If that post isn’t available, you could redirect it to the category page for these posts (e.g. “dog breeds”) and if that is also not an option, redirect to the homepage. That last one might be about “pets”, for example. It’s a bit of a last resort though, there probably are better options on your site.

Creating a 301 Redirect (for instance in WordPress) isn’t hard, but doing it with Yoast SEO Premium is easy as pie. If you don’t have it yet, find out about all the extras that are in Yoast SEO Premium here.

Tell search engines the content is intentionally gone

If there isn’t a relevant page on your site you can redirect to, it’s wise to tell Google to forget about your old post entirely by serving a “410 Deleted” status to Google. This status code will tell Google and visitors the content didn’t just disappear; you’ve deleted it with a reason.

When Google can’t find a post, the server will usually return a “404 Not Found” status to the search engine’s bot. You’ll also find a 404 crawl error in your Google Search Console for that page. Eventually, Google will work it out and the URL will gradually vanish from the search result pages. But this takes time.

The 410 is more powerful in the sense that it tells Google that the page is gone forever, never to return. You deleted it on purpose, period. Google will act on that faster than with a 404. Read up about the server status codes if this is all gibberish to you.

Keep reading: How to properly delete a page from your site »

Do you have old content to deal with?

Cleaning up old content should be part of your content maintenance routine to keep your website fit. If you don’t go through your old posts regularly, you’re bound to run into issues sooner or later. You might show incorrect information to visitors or hurt your own rankings by having too many pages about the same topic, increasing chances of keyword cannibalization, which is a lot of work to fix later on. Therefore, go through your old posts, and decide what to do: update, merge or delete.

If you find it hard to decide what to keep and what to delete and redirect, this step-by-step guide to fix keyword cannibalization might help you out!

Good luck cleaning up your site.

Read on: Should you keep old content? »

The post Update or delete? Cleaning up old content on your site appeared first on Yoast.

Sometimes, content on your website becomes irrelevant or out of date, and you need to decide whether to update it or delete it. It’s part of your regular content maintenance activities. There are several ways to go about this and this article helps you decide what’s the best solution for your old content! Update old
The post Update or delete? Cleaning up old content on your site appeared first on Yoast.Read MoreContent SEO, Redirects, SEO Copywriting, Site Structure, Website MaintenanceYoast

Categories
Uncategorized

Update or delete? Cleaning up old content on your site

Sometimes, content on your website becomes irrelevant or out of date, and you need to decide whether to update it or delete it. It’s part of your regular content maintenance activities. There are several ways to go about this and this article helps you decide what’s the best solution for your old content!

Update old content that is still valid

Let’s start with an example: On our blog, we have an article on meta descriptions that needs constant updating to keep it relevant. We just have to make sure it stays up to date with all the changes Google keeps making to the way it handles meta descriptions. Sometimes it seems they can be a bit longer and sometimes they seem to go back to the old length again.

Our post helps writers and editors to write meta descriptions, even though the advice changes over time. Although the article itself might be what we call cornerstone content, its content must be updated to keep up with the latest standards – constantly.

You can easily create new, valuable content from your old posts if you update it and make it current again: old wine in new bottles, as the saying goes. You could, for example, merge three old blog posts about the same subject into one new post or simply replace older parts of your post with updated content.

Creating a clone of your post with Yoast Duplicate Post, so you have a draft to work on, makes this process a lot easier. The new Rewrite & Republish feature even allows you to create a clone of your post, work on it and publish this updated version of your post without too much hassle. This will automatically merge your updated post into the original one, while automatically cleaning up the draft for you. That way, working on your post and publishing these changes becomes an easy and seamless process. Plus you can choose to schedule these updates if you want to publish your updated content at a later date!

Read more: Keep your content fresh and up to date »

Delete irrelevant posts or pages

It’s likely that you have old posts or pages on your site that you don’t need anymore. Think along the lines of a blog post about a product you stopped selling a while ago and have no intention of ever selling again, an announcement of an event that took place a long time ago or old pages with little or no content – so-called thin content pages.

These are just some examples, but I’m sure you know which posts and/or pages I’m talking about. This old content adds no value anymore, now or for the foreseeable future. In that case, you need to either tell Google to forget about these old posts or pages or give the URL another purpose.

When I talk about deleting old content, I don’t mean just pressing “delete” and then forgetting about it. If you do that, the content might show up in Google for weeks after deletion. The URL might actually have some link value as well, which would be a shame to waste.

So, what should you do? Here are two options:

“301 Redirect” the old post to a related one

When a URL still holds value because, say, you have a number of quality links pointing to that page, you want to leverage that value by redirecting the URL to a related one. With a 301 Redirect you’ll tell search engines and visitors there’s a better or newer version of this content elsewhere on your site. The 301 redirect automatically sends people and Google to this page.

Say you have an old post on a specific dog breed. You need to delete it, so the logical next step would be to redirect that post to a newer post about this dog breed. If you don’t have that post, choose a post about the closest breed possible. If that post isn’t available, you could redirect it to the category page for these posts (e.g. “dog breeds”) and if that is also not an option, redirect to the homepage. That last one might be about “pets”, for example. It’s a bit of a last resort though, there probably are better options on your site.

Creating a 301 Redirect (for instance in WordPress) isn’t hard, but doing it with Yoast SEO Premium is easy as pie. If you don’t have it yet, find out about all the extras that are in Yoast SEO Premium here.

Tell search engines the content is intentionally gone

If there isn’t a relevant page on your site you can redirect to, it’s wise to tell Google to forget about your old post entirely by serving a “410 Deleted” status to Google. This status code will tell Google and visitors the content didn’t just disappear; you’ve deleted it with a reason.

When Google can’t find a post, the server will usually return a “404 Not Found” status to the search engine’s bot. You’ll also find a 404 crawl error in your Google Search Console for that page. Eventually, Google will work it out and the URL will gradually vanish from the search result pages. But this takes time.

The 410 is more powerful in the sense that it tells Google that the page is gone forever, never to return. You deleted it on purpose, period. Google will act on that faster than with a 404. Read up about the server status codes if this is all gibberish to you.

Keep reading: How to properly delete a page from your site »

Do you have old content to deal with?

Cleaning up old content should be part of your content maintenance routine to keep your website fit. If you don’t go through your old posts regularly, you’re bound to run into issues sooner or later. You might show incorrect information to visitors or hurt your own rankings by having too many pages about the same topic, increasing chances of keyword cannibalization, which is a lot of work to fix later on. Therefore, go through your old posts, and decide what to do: update, merge or delete.

If you find it hard to decide what to keep and what to delete and redirect, this step-by-step guide to fix keyword cannibalization might help you out!

Good luck cleaning up your site.

Read on: Should you keep old content? »

The post Update or delete? Cleaning up old content on your site appeared first on Yoast.

Sometimes, content on your website becomes irrelevant or out of date, and you need to decide whether to update it or delete it. It’s part of your regular content maintenance activities. There are several ways to go about this and this article helps you decide what’s the best solution for your old content! Update old
The post Update or delete? Cleaning up old content on your site appeared first on Yoast.Read MoreContent SEO, Redirects, SEO Copywriting, Site Structure, Website MaintenanceYoast

Categories
Uncategorized

Update or delete? Cleaning up old content on your site

Sometimes, content on your website becomes irrelevant or out of date, and you need to decide whether to update it or delete it. It’s part of your regular content maintenance activities. There are several ways to go about this and this article helps you decide what’s the best solution for your old content!

Update old content that is still valid

Let’s start with an example: On our blog, we have an article on meta descriptions that needs constant updating to keep it relevant. We just have to make sure it stays up to date with all the changes Google keeps making to the way it handles meta descriptions. Sometimes it seems they can be a bit longer and sometimes they seem to go back to the old length again.

Our post helps writers and editors to write meta descriptions, even though the advice changes over time. Although the article itself might be what we call cornerstone content, its content must be updated to keep up with the latest standards – constantly.

You can easily create new, valuable content from your old posts if you update it and make it current again: old wine in new bottles, as the saying goes. You could, for example, merge three old blog posts about the same subject into one new post or simply replace older parts of your post with updated content.

Creating a clone of your post with Yoast Duplicate Post, so you have a draft to work on, makes this process a lot easier. The new Rewrite & Republish feature even allows you to create a clone of your post, work on it and publish this updated version of your post without too much hassle. This will automatically merge your updated post into the original one, while automatically cleaning up the draft for you. That way, working on your post and publishing these changes becomes an easy and seamless process. Plus you can choose to schedule these updates if you want to publish your updated content at a later date!

Read more: Keep your content fresh and up to date »

Delete irrelevant posts or pages

It’s likely that you have old posts or pages on your site that you don’t need anymore. Think along the lines of a blog post about a product you stopped selling a while ago and have no intention of ever selling again, an announcement of an event that took place a long time ago or old pages with little or no content – so-called thin content pages.

These are just some examples, but I’m sure you know which posts and/or pages I’m talking about. This old content adds no value anymore, now or for the foreseeable future. In that case, you need to either tell Google to forget about these old posts or pages or give the URL another purpose.

When I talk about deleting old content, I don’t mean just pressing “delete” and then forgetting about it. If you do that, the content might show up in Google for weeks after deletion. The URL might actually have some link value as well, which would be a shame to waste.

So, what should you do? Here are two options:

“301 Redirect” the old post to a related one

When a URL still holds value because, say, you have a number of quality links pointing to that page, you want to leverage that value by redirecting the URL to a related one. With a 301 Redirect you’ll tell search engines and visitors there’s a better or newer version of this content elsewhere on your site. The 301 redirect automatically sends people and Google to this page.

Say you have an old post on a specific dog breed. You need to delete it, so the logical next step would be to redirect that post to a newer post about this dog breed. If you don’t have that post, choose a post about the closest breed possible. If that post isn’t available, you could redirect it to the category page for these posts (e.g. “dog breeds”) and if that is also not an option, redirect to the homepage. That last one might be about “pets”, for example. It’s a bit of a last resort though, there probably are better options on your site.

Creating a 301 Redirect (for instance in WordPress) isn’t hard, but doing it with Yoast SEO Premium is easy as pie. If you don’t have it yet, find out about all the extras that are in Yoast SEO Premium here.

Tell search engines the content is intentionally gone

If there isn’t a relevant page on your site you can redirect to, it’s wise to tell Google to forget about your old post entirely by serving a “410 Deleted” status to Google. This status code will tell Google and visitors the content didn’t just disappear; you’ve deleted it with a reason.

When Google can’t find a post, the server will usually return a “404 Not Found” status to the search engine’s bot. You’ll also find a 404 crawl error in your Google Search Console for that page. Eventually, Google will work it out and the URL will gradually vanish from the search result pages. But this takes time.

The 410 is more powerful in the sense that it tells Google that the page is gone forever, never to return. You deleted it on purpose, period. Google will act on that faster than with a 404. Read up about the server status codes if this is all gibberish to you.

Keep reading: How to properly delete a page from your site »

Do you have old content to deal with?

Cleaning up old content should be part of your content maintenance routine to keep your website fit. If you don’t go through your old posts regularly, you’re bound to run into issues sooner or later. You might show incorrect information to visitors or hurt your own rankings by having too many pages about the same topic, increasing chances of keyword cannibalization, which is a lot of work to fix later on. Therefore, go through your old posts, and decide what to do: update, merge or delete.

If you find it hard to decide what to keep and what to delete and redirect, this step-by-step guide to fix keyword cannibalization might help you out!

Good luck cleaning up your site.

Read on: Should you keep old content? »

The post Update or delete? Cleaning up old content on your site appeared first on Yoast.

Sometimes, content on your website becomes irrelevant or out of date, and you need to decide whether to update it or delete it. It’s part of your regular content maintenance activities. There are several ways to go about this and this article helps you decide what’s the best solution for your old content! Update old
The post Update or delete? Cleaning up old content on your site appeared first on Yoast.Read MoreContent SEO, Redirects, SEO Copywriting, Site Structure, Website MaintenanceYoast

Categories
Uncategorized

The ultimate guide to robots.txt

The robots.txt file is one of the main ways of telling a search engine where it can and can’t go on your website. All major search engines support the basic functionality it offers, but some of them respond to some extra rules which can be useful too. This guide covers all the ways to use robots.txt on your website.

Warning!

Any mistakes you make in your robots.txt can seriously harm your site, so make sure you read and understand the whole of this article before you dive in.

What is a robots.txt file?

Crawl directives

The robots.txt file is one of a number of crawl directives. We have guides on all of them and you’ll find them here.

A robots.txt file is a text file which is read by search engine (and other systems). Also called the “Robots Exclusion Protocol”, the robots.txt file is the result of a consensus among early search engine developers. It’s not an official standard set by any standards organization; although all major search engines adhere to it.

What does the robots.txt file do?

Caching

Search engines typically cache the contents of the robots.txt so that they don’t need to keep downloading it, but will usually refresh it several times a day. That means that changes to instructions are typically reflected fairly quickly.

Search engines discover and index the web by crawling pages. As they crawl, they discover and follow links. This takes them from site A to site B to site C, and so on. But before a search engine visits any page on a domain it hasn’t encountered before, it will open that domain’s robots.txt file. That lets them know which URLs on that site they’re allowed to visit (and which ones they’re not).

Where should I put my robots.txt file?

The robots.txt file should always be at the root of your domain. So if your domain is www.example.com, it should be found at https://www.example.com/robots.txt.

It’s also very important that your robots.txt file is actually called robots.txt. The name is case sensitive, so get that right or it just won’t work.

Pros and cons of using robots.txt

Pro: managing crawl budget

It’s generally understood that a search spider arrives at a website with a pre-determined “allowance” for how many pages it will crawl (or, how much resource/time it’ll spend, based on a site’s authority/size/reputation, and how efficiently the server responds). SEOs call this the crawl budget.

If you think that your website has problems with crawl budget, then blocking search engines from ‘wasting’ energy in unimportant parts of your site might mean that they focus instead on the sections which do matter.

It can sometimes be beneficial to block the search engines from crawling problematic sections of your site, especially on sites where a lot of SEO clean-up has to be done. Once you’ve tidied things up, you can let them back in.

A note on blocking query parameters

One situation where crawl budget is particularly important is when your site uses a lot of query string parameters to filter or sort lists. Let’s say you have 10 different query parameters, each with different values that can be used in any combination (like t-shirts in multiple colors and sizes). This leads to lots of possible valid URLs, all of which might get crawled. Blocking query parameters from being crawled will help make sure the search engine only spiders your site’s main URLs and won’t go into the enormous trap that you’d otherwise create.

Con: not removing a page from search results

Even though you can use the robots.txt file to tell a spider where it can’t go on your site, you can’t use it tell a search engine which URLs not to show in the search results – in other words, blocking it won’t stop it from being indexed. If the search engine finds enough links to that URL, it will include it, it will just not know what’s on that page. So your result will look like this:

If you want to reliably block a page from showing up in the search results, you need to use a meta robots noindex tag. That means that, in order to find the noindex tag, the search engine has to be able to access that page, so don’t block it with robots.txt.

Noindex directives

It used to be possible to add ‘noindex’ directives in your robots.txt, to remove URLs from Google’s search results, and to avoid these ‘fragments’ showing up. This is no longer supported (and technically, never was).

Con: not spreading link value

If a search engine can’t crawl a page, it can’t spread the link value across the links on that page. When a page is blocked with robots.txt, it’s a dead-end. Any link value which might have flowed to (and through) that page is lost.

Robots.txt syntax

WordPress robots.txt

We have an entire article on how best to setup your robots.txt for WordPress. Don’t forget you can edit your site’s robots.txt file in the Yoast SEO Tools → File editor section.

A robots.txt file consists of one or more blocks of directives, each starting with a user-agent line. The “user-agent” is the name of the specific spider it addresses. You can either have one block for all search engines, using a wildcard for the user-agent, or specific blocks for specific search engines. A search engine spider will always pick the block that best matches its name.

These blocks look like this (don’t be scared, we’ll explain below):

User-agent: * 
Disallow: /

User-agent: Googlebot
Disallow:

User-agent: bingbot
Disallow: /not-for-bing/

Directives like Allow and Disallow should not be case sensitive, so it’s up to you whether you write them lowercase or capitalize them. The values are case sensitive however, /photo/ is not the same as /Photo/. We like to capitalize directives because it makes the file easier (for humans) to read.

The user-agent directive

The first bit of every block of directives is the user-agent, which identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent, so for instance the most common spider from Google has the following user-agent:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 

So if you want to tell this spider what to do, a relatively simple User-agent: Googlebot line will do the trick.

Most search engines have multiple spiders. They will use a specific spider for their normal index, for their ad programs, for images, for videos, etc.

Search engines will always choose the most specific block of directives they can find. Say you have 3 sets of directives: one for *, one for Googlebot and one for Googlebot-News. If a bot comes by whose user-agent is Googlebot-Video, it would follow the Googlebot restrictions. A bot with the user-agent Googlebot-News would use the more specific Googlebot-News directives.

The most common user agents for search engine spiders

Here’s a list of the user-agents you can use in your robots.txt file to match the most commonly used search engines:

Search engine Field User-agent
Baidu General baiduspider
Baidu Images baiduspider-image
Baidu Mobile baiduspider-mobile
Baidu News baiduspider-news
Baidu Video baiduspider-video
Bing General bingbot
Bing General msnbot
Bing Images & Video msnbot-media
Bing Ads adidxbot
Google General Googlebot
Google Images Googlebot-Image
Google Mobile Googlebot-Mobile
Google News Googlebot-News
Google Video Googlebot-Video
Google AdSense Mediapartners-Google
Google AdWords AdsBot-Google
Yahoo! General slurp
Yandex General yandex

The disallow directive

The second line in any block of directives is the Disallow line. You can have one or more of these lines, specifying which parts of the site the specified spider can’t access. An empty Disallow line means you’re not disallowing anything, so basically it means that a spider can access all sections of your site.

The example below would block all search engines that “listen” to robots.txt from crawling your site.

User-agent: * 
Disallow: /

With only one character less, the example below would allow all search engines to crawl your entire site.

User-agent: * 
Disallow:

The example below would block Google from crawling the Photo directory on your site – and everything in it.

User-agent: googlebot 
Disallow: /Photo

This means all the subdirectories of the /Photo directory would also not be spidered. It would not block Google from crawling the /photo directory, as these lines are case sensitive.

This would also block Google from accessing URLs containing /Photo, such as /Photography/.

How to use wildcards/regular expressions

“Officially”, the robots.txt standard doesn’t support regular expressions or wildcards, however, all major search engines do understand it. This means you can use lines like this to block groups of files:

Disallow: /*.php 
Disallow: /copyrighted-images/*.jpg

In the example above, * is expanded to whatever filename it matches. Note that the rest of the line is still case sensitive, so the second line above will not block a file called /copyrighted-images/example.JPG from being crawled.

Some search engines, like Google, allow for more complicated regular expressions, but be aware that some search engines might not understand this logic. The most useful feature this adds is the $, which indicates the end of a URL. In the following example you can see what this does:

Disallow: /*.php$

This means /index.php can’t be indexed, but /index.php?p=1 could be. Of course, this is only useful in very specific circumstances and also pretty dangerous: it’s easy to unblock things you didn’t actually want to unblock.

Non-standard robots.txt crawl directives

As well as the Disallow and User-agent directives there are a couple of other crawl directives you can use. These directives are not supported by all search engine crawlers so make sure you’re aware of their limitations.

The allow directive

While not in the original “specification”, there was talk very early on of an allow directive. Most search engines seem to understand it, and it allows for simple, and very readable directives like this:

Disallow: /wp-admin/ 
Allow: /wp-admin/admin-ajax.php

The only other way of achieving the same result without an allow directive would have been to specifically disallow every single file in the wp-admin folder.

The host directive

Supported by Yandex (and not by Google, despite what some posts say), this directive lets you decide whether you want the search engine to show example.com or www.example.com. Simply specifying it like this does the trick:

host: example.com

But because only Yandex supports the host directive, we wouldn’t advise you to rely on it, especially as it doesn’t allow you to define a scheme (http or https) either. A better solution that works for all search engines would be to 301 redirect the hostnames that you don’t want in the index to the version that you do want. In our case, we redirect www.yoast.com to yoast.com.

The crawl-delay directive

Bing and Yandex can sometimes be fairly crawl-hungry, but luckily they all respond to the crawl-delay directive, which slows them down. And while these search engines have slightly different ways of reading the directive, the end result is basically the same.

A line like the one below would instruct those search engines to change how frequently they’ll request pages on your site.

crawl-delay: 10

Different interpretations

Note that Bing interprets this as an instruction to wait 10 seconds after a crawl action, while Yandex interprets it as a directive to only access your site once in every 10 seconds. It’s a semantic difference, but still interesting to know.

Do take care when using the crawl-delay directive. By setting a crawl delay of 10 seconds you’re only allowing these search engines to access 8,640 pages a day. This might seem plenty for a small site, but on large sites it isn’t very many. On the other hand, if you get next to no traffic from these search engines, it’s a good way to save some bandwidth.

The sitemap directive for XML Sitemaps

Using the sitemap directive you can tell search engines – specifically Bing, Yandex and Google – where to find your XML sitemap. You can, of course, also submit your XML sitemaps to each search engine using their respective webmaster tools solutions, and we strongly recommend you do, because search engine webmaster tools programs will give you lots of valuable information about your site. If you don’t want to do that, adding a sitemap line to your robots.txt is a good quick alternative.

Sitemap: https://www.example.com/my-sitemap.xml

Validate your robots.txt

There are various tools out there that can help you validate your robots.txt, but when it comes to validating crawl directives, we always prefer to go to the source. Google has a robots.txt testing tool in its Google Search Console (under the ‘Old version’ menu) and we’d highly recommend using that:

robots.txt Tester

Be sure to test your changes thoroughly before you put them live! You wouldn’t be the first to accidentally use robots.txt to block your entire site, and to slip into search engine oblivion!

See the code

In July 2019, Google announced that they were making their robots.txt parser open source. That means that, if you really want to get into the nuts and bolts, you can go and see how their code works (and, even use it yourself, or propose modifications to it).

The post The ultimate guide to robots.txt appeared first on Yoast.

The robots.txt file is a file you can use to tell search engines where they can and cannot go on your site. Learn how to use it to your advantage!
The post The ultimate guide to robots.txt appeared first on Yoast.Read MoreCrawl directives, Technical SEOYoast

Categories
Uncategorized

The ultimate guide to robots.txt

The robots.txt file is one of the main ways of telling a search engine where it can and can’t go on your website. All major search engines support the basic functionality it offers, but some of them respond to some extra rules which can be useful too. This guide covers all the ways to use robots.txt on your website.

Warning!

Any mistakes you make in your robots.txt can seriously harm your site, so make sure you read and understand the whole of this article before you dive in.

What is a robots.txt file?

Crawl directives

The robots.txt file is one of a number of crawl directives. We have guides on all of them and you’ll find them here.

A robots.txt file is a text file which is read by search engine (and other systems). Also called the “Robots Exclusion Protocol”, the robots.txt file is the result of a consensus among early search engine developers. It’s not an official standard set by any standards organization; although all major search engines adhere to it.

What does the robots.txt file do?

Caching

Search engines typically cache the contents of the robots.txt so that they don’t need to keep downloading it, but will usually refresh it several times a day. That means that changes to instructions are typically reflected fairly quickly.

Search engines discover and index the web by crawling pages. As they crawl, they discover and follow links. This takes them from site A to site B to site C, and so on. But before a search engine visits any page on a domain it hasn’t encountered before, it will open that domain’s robots.txt file. That lets them know which URLs on that site they’re allowed to visit (and which ones they’re not).

Where should I put my robots.txt file?

The robots.txt file should always be at the root of your domain. So if your domain is www.example.com, it should be found at https://www.example.com/robots.txt.

It’s also very important that your robots.txt file is actually called robots.txt. The name is case sensitive, so get that right or it just won’t work.

Pros and cons of using robots.txt

Pro: managing crawl budget

It’s generally understood that a search spider arrives at a website with a pre-determined “allowance” for how many pages it will crawl (or, how much resource/time it’ll spend, based on a site’s authority/size/reputation, and how efficiently the server responds). SEOs call this the crawl budget.

If you think that your website has problems with crawl budget, then blocking search engines from ‘wasting’ energy in unimportant parts of your site might mean that they focus instead on the sections which do matter.

It can sometimes be beneficial to block the search engines from crawling problematic sections of your site, especially on sites where a lot of SEO clean-up has to be done. Once you’ve tidied things up, you can let them back in.

A note on blocking query parameters

One situation where crawl budget is particularly important is when your site uses a lot of query string parameters to filter or sort lists. Let’s say you have 10 different query parameters, each with different values that can be used in any combination (like t-shirts in multiple colors and sizes). This leads to lots of possible valid URLs, all of which might get crawled. Blocking query parameters from being crawled will help make sure the search engine only spiders your site’s main URLs and won’t go into the enormous trap that you’d otherwise create.

Con: not removing a page from search results

Even though you can use the robots.txt file to tell a spider where it can’t go on your site, you can’t use it tell a search engine which URLs not to show in the search results – in other words, blocking it won’t stop it from being indexed. If the search engine finds enough links to that URL, it will include it, it will just not know what’s on that page. So your result will look like this:

If you want to reliably block a page from showing up in the search results, you need to use a meta robots noindex tag. That means that, in order to find the noindex tag, the search engine has to be able to access that page, so don’t block it with robots.txt.

Noindex directives

It used to be possible to add ‘noindex’ directives in your robots.txt, to remove URLs from Google’s search results, and to avoid these ‘fragments’ showing up. This is no longer supported (and technically, never was).

Con: not spreading link value

If a search engine can’t crawl a page, it can’t spread the link value across the links on that page. When a page is blocked with robots.txt, it’s a dead-end. Any link value which might have flowed to (and through) that page is lost.

Robots.txt syntax

WordPress robots.txt

We have an entire article on how best to setup your robots.txt for WordPress. Don’t forget you can edit your site’s robots.txt file in the Yoast SEO Tools → File editor section.

A robots.txt file consists of one or more blocks of directives, each starting with a user-agent line. The “user-agent” is the name of the specific spider it addresses. You can either have one block for all search engines, using a wildcard for the user-agent, or specific blocks for specific search engines. A search engine spider will always pick the block that best matches its name.

These blocks look like this (don’t be scared, we’ll explain below):

User-agent: * 
Disallow: /

User-agent: Googlebot
Disallow:

User-agent: bingbot
Disallow: /not-for-bing/

Directives like Allow and Disallow should not be case sensitive, so it’s up to you whether you write them lowercase or capitalize them. The values are case sensitive however, /photo/ is not the same as /Photo/. We like to capitalize directives because it makes the file easier (for humans) to read.

The user-agent directive

The first bit of every block of directives is the user-agent, which identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent, so for instance the most common spider from Google has the following user-agent:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 

So if you want to tell this spider what to do, a relatively simple User-agent: Googlebot line will do the trick.

Most search engines have multiple spiders. They will use a specific spider for their normal index, for their ad programs, for images, for videos, etc.

Search engines will always choose the most specific block of directives they can find. Say you have 3 sets of directives: one for *, one for Googlebot and one for Googlebot-News. If a bot comes by whose user-agent is Googlebot-Video, it would follow the Googlebot restrictions. A bot with the user-agent Googlebot-News would use the more specific Googlebot-News directives.

The most common user agents for search engine spiders

Here’s a list of the user-agents you can use in your robots.txt file to match the most commonly used search engines:

Search engine Field User-agent
Baidu General baiduspider
Baidu Images baiduspider-image
Baidu Mobile baiduspider-mobile
Baidu News baiduspider-news
Baidu Video baiduspider-video
Bing General bingbot
Bing General msnbot
Bing Images & Video msnbot-media
Bing Ads adidxbot
Google General Googlebot
Google Images Googlebot-Image
Google Mobile Googlebot-Mobile
Google News Googlebot-News
Google Video Googlebot-Video
Google AdSense Mediapartners-Google
Google AdWords AdsBot-Google
Yahoo! General slurp
Yandex General yandex

The disallow directive

The second line in any block of directives is the Disallow line. You can have one or more of these lines, specifying which parts of the site the specified spider can’t access. An empty Disallow line means you’re not disallowing anything, so basically it means that a spider can access all sections of your site.

The example below would block all search engines that “listen” to robots.txt from crawling your site.

User-agent: * 
Disallow: /

With only one character less, the example below would allow all search engines to crawl your entire site.

User-agent: * 
Disallow:

The example below would block Google from crawling the Photo directory on your site – and everything in it.

User-agent: googlebot 
Disallow: /Photo

This means all the subdirectories of the /Photo directory would also not be spidered. It would not block Google from crawling the /photo directory, as these lines are case sensitive.

This would also block Google from accessing URLs containing /Photo, such as /Photography/.

How to use wildcards/regular expressions

“Officially”, the robots.txt standard doesn’t support regular expressions or wildcards, however, all major search engines do understand it. This means you can use lines like this to block groups of files:

Disallow: /*.php 
Disallow: /copyrighted-images/*.jpg

In the example above, * is expanded to whatever filename it matches. Note that the rest of the line is still case sensitive, so the second line above will not block a file called /copyrighted-images/example.JPG from being crawled.

Some search engines, like Google, allow for more complicated regular expressions, but be aware that some search engines might not understand this logic. The most useful feature this adds is the $, which indicates the end of a URL. In the following example you can see what this does:

Disallow: /*.php$

This means /index.php can’t be indexed, but /index.php?p=1 could be. Of course, this is only useful in very specific circumstances and also pretty dangerous: it’s easy to unblock things you didn’t actually want to unblock.

Non-standard robots.txt crawl directives

As well as the Disallow and User-agent directives there are a couple of other crawl directives you can use. These directives are not supported by all search engine crawlers so make sure you’re aware of their limitations.

The allow directive

While not in the original “specification”, there was talk very early on of an allow directive. Most search engines seem to understand it, and it allows for simple, and very readable directives like this:

Disallow: /wp-admin/ 
Allow: /wp-admin/admin-ajax.php

The only other way of achieving the same result without an allow directive would have been to specifically disallow every single file in the wp-admin folder.

The host directive

Supported by Yandex (and not by Google, despite what some posts say), this directive lets you decide whether you want the search engine to show example.com or www.example.com. Simply specifying it like this does the trick:

host: example.com

But because only Yandex supports the host directive, we wouldn’t advise you to rely on it, especially as it doesn’t allow you to define a scheme (http or https) either. A better solution that works for all search engines would be to 301 redirect the hostnames that you don’t want in the index to the version that you do want. In our case, we redirect www.yoast.com to yoast.com.

The crawl-delay directive

Bing and Yandex can sometimes be fairly crawl-hungry, but luckily they all respond to the crawl-delay directive, which slows them down. And while these search engines have slightly different ways of reading the directive, the end result is basically the same.

A line like the one below would instruct those search engines to change how frequently they’ll request pages on your site.

crawl-delay: 10

Different interpretations

Note that Bing interprets this as an instruction to wait 10 seconds after a crawl action, while Yandex interprets it as a directive to only access your site once in every 10 seconds. It’s a semantic difference, but still interesting to know.

Do take care when using the crawl-delay directive. By setting a crawl delay of 10 seconds you’re only allowing these search engines to access 8,640 pages a day. This might seem plenty for a small site, but on large sites it isn’t very many. On the other hand, if you get next to no traffic from these search engines, it’s a good way to save some bandwidth.

The sitemap directive for XML Sitemaps

Using the sitemap directive you can tell search engines – specifically Bing, Yandex and Google – where to find your XML sitemap. You can, of course, also submit your XML sitemaps to each search engine using their respective webmaster tools solutions, and we strongly recommend you do, because search engine webmaster tools programs will give you lots of valuable information about your site. If you don’t want to do that, adding a sitemap line to your robots.txt is a good quick alternative.

Sitemap: https://www.example.com/my-sitemap.xml

Validate your robots.txt

There are various tools out there that can help you validate your robots.txt, but when it comes to validating crawl directives, we always prefer to go to the source. Google has a robots.txt testing tool in its Google Search Console (under the ‘Old version’ menu) and we’d highly recommend using that:

robots.txt Tester

Be sure to test your changes thoroughly before you put them live! You wouldn’t be the first to accidentally use robots.txt to block your entire site, and to slip into search engine oblivion!

See the code

In July 2019, Google announced that they were making their robots.txt parser open source. That means that, if you really want to get into the nuts and bolts, you can go and see how their code works (and, even use it yourself, or propose modifications to it).

The post The ultimate guide to robots.txt appeared first on Yoast.

The robots.txt file is a file you can use to tell search engines where they can and cannot go on your site. Learn how to use it to your advantage!
The post The ultimate guide to robots.txt appeared first on Yoast.Read MoreCrawl directives, Technical SEOYoast

Categories
Uncategorized

The ultimate guide to robots.txt

The robots.txt file is one of the main ways of telling a search engine where it can and can’t go on your website. All major search engines support the basic functionality it offers, but some of them respond to some extra rules which can be useful too. This guide covers all the ways to use robots.txt on your website.

Warning!

Any mistakes you make in your robots.txt can seriously harm your site, so make sure you read and understand the whole of this article before you dive in.

What is a robots.txt file?

Crawl directives

The robots.txt file is one of a number of crawl directives. We have guides on all of them and you’ll find them here.

A robots.txt file is a text file which is read by search engine (and other systems). Also called the “Robots Exclusion Protocol”, the robots.txt file is the result of a consensus among early search engine developers. It’s not an official standard set by any standards organization; although all major search engines adhere to it.

What does the robots.txt file do?

Caching

Search engines typically cache the contents of the robots.txt so that they don’t need to keep downloading it, but will usually refresh it several times a day. That means that changes to instructions are typically reflected fairly quickly.

Search engines discover and index the web by crawling pages. As they crawl, they discover and follow links. This takes them from site A to site B to site C, and so on. But before a search engine visits any page on a domain it hasn’t encountered before, it will open that domain’s robots.txt file. That lets them know which URLs on that site they’re allowed to visit (and which ones they’re not).

Where should I put my robots.txt file?

The robots.txt file should always be at the root of your domain. So if your domain is www.example.com, it should be found at https://www.example.com/robots.txt.

It’s also very important that your robots.txt file is actually called robots.txt. The name is case sensitive, so get that right or it just won’t work.

Pros and cons of using robots.txt

Pro: managing crawl budget

It’s generally understood that a search spider arrives at a website with a pre-determined “allowance” for how many pages it will crawl (or, how much resource/time it’ll spend, based on a site’s authority/size/reputation, and how efficiently the server responds). SEOs call this the crawl budget.

If you think that your website has problems with crawl budget, then blocking search engines from ‘wasting’ energy in unimportant parts of your site might mean that they focus instead on the sections which do matter.

It can sometimes be beneficial to block the search engines from crawling problematic sections of your site, especially on sites where a lot of SEO clean-up has to be done. Once you’ve tidied things up, you can let them back in.

A note on blocking query parameters

One situation where crawl budget is particularly important is when your site uses a lot of query string parameters to filter or sort lists. Let’s say you have 10 different query parameters, each with different values that can be used in any combination (like t-shirts in multiple colors and sizes). This leads to lots of possible valid URLs, all of which might get crawled. Blocking query parameters from being crawled will help make sure the search engine only spiders your site’s main URLs and won’t go into the enormous trap that you’d otherwise create.

Con: not removing a page from search results

Even though you can use the robots.txt file to tell a spider where it can’t go on your site, you can’t use it tell a search engine which URLs not to show in the search results – in other words, blocking it won’t stop it from being indexed. If the search engine finds enough links to that URL, it will include it, it will just not know what’s on that page. So your result will look like this:

If you want to reliably block a page from showing up in the search results, you need to use a meta robots noindex tag. That means that, in order to find the noindex tag, the search engine has to be able to access that page, so don’t block it with robots.txt.

Noindex directives

It used to be possible to add ‘noindex’ directives in your robots.txt, to remove URLs from Google’s search results, and to avoid these ‘fragments’ showing up. This is no longer supported (and technically, never was).

Con: not spreading link value

If a search engine can’t crawl a page, it can’t spread the link value across the links on that page. When a page is blocked with robots.txt, it’s a dead-end. Any link value which might have flowed to (and through) that page is lost.

Robots.txt syntax

WordPress robots.txt

We have an entire article on how best to setup your robots.txt for WordPress. Don’t forget you can edit your site’s robots.txt file in the Yoast SEO Tools → File editor section.

A robots.txt file consists of one or more blocks of directives, each starting with a user-agent line. The “user-agent” is the name of the specific spider it addresses. You can either have one block for all search engines, using a wildcard for the user-agent, or specific blocks for specific search engines. A search engine spider will always pick the block that best matches its name.

These blocks look like this (don’t be scared, we’ll explain below):

User-agent: * 
Disallow: /

User-agent: Googlebot
Disallow:

User-agent: bingbot
Disallow: /not-for-bing/

Directives like Allow and Disallow should not be case sensitive, so it’s up to you whether you write them lowercase or capitalize them. The values are case sensitive however, /photo/ is not the same as /Photo/. We like to capitalize directives because it makes the file easier (for humans) to read.

The user-agent directive

The first bit of every block of directives is the user-agent, which identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent, so for instance the most common spider from Google has the following user-agent:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 

So if you want to tell this spider what to do, a relatively simple User-agent: Googlebot line will do the trick.

Most search engines have multiple spiders. They will use a specific spider for their normal index, for their ad programs, for images, for videos, etc.

Search engines will always choose the most specific block of directives they can find. Say you have 3 sets of directives: one for *, one for Googlebot and one for Googlebot-News. If a bot comes by whose user-agent is Googlebot-Video, it would follow the Googlebot restrictions. A bot with the user-agent Googlebot-News would use the more specific Googlebot-News directives.

The most common user agents for search engine spiders

Here’s a list of the user-agents you can use in your robots.txt file to match the most commonly used search engines:

Search engine Field User-agent
Baidu General baiduspider
Baidu Images baiduspider-image
Baidu Mobile baiduspider-mobile
Baidu News baiduspider-news
Baidu Video baiduspider-video
Bing General bingbot
Bing General msnbot
Bing Images & Video msnbot-media
Bing Ads adidxbot
Google General Googlebot
Google Images Googlebot-Image
Google Mobile Googlebot-Mobile
Google News Googlebot-News
Google Video Googlebot-Video
Google AdSense Mediapartners-Google
Google AdWords AdsBot-Google
Yahoo! General slurp
Yandex General yandex

The disallow directive

The second line in any block of directives is the Disallow line. You can have one or more of these lines, specifying which parts of the site the specified spider can’t access. An empty Disallow line means you’re not disallowing anything, so basically it means that a spider can access all sections of your site.

The example below would block all search engines that “listen” to robots.txt from crawling your site.

User-agent: * 
Disallow: /

With only one character less, the example below would allow all search engines to crawl your entire site.

User-agent: * 
Disallow:

The example below would block Google from crawling the Photo directory on your site – and everything in it.

User-agent: googlebot 
Disallow: /Photo

This means all the subdirectories of the /Photo directory would also not be spidered. It would not block Google from crawling the /photo directory, as these lines are case sensitive.

This would also block Google from accessing URLs containing /Photo, such as /Photography/.

How to use wildcards/regular expressions

“Officially”, the robots.txt standard doesn’t support regular expressions or wildcards, however, all major search engines do understand it. This means you can use lines like this to block groups of files:

Disallow: /*.php 
Disallow: /copyrighted-images/*.jpg

In the example above, * is expanded to whatever filename it matches. Note that the rest of the line is still case sensitive, so the second line above will not block a file called /copyrighted-images/example.JPG from being crawled.

Some search engines, like Google, allow for more complicated regular expressions, but be aware that some search engines might not understand this logic. The most useful feature this adds is the $, which indicates the end of a URL. In the following example you can see what this does:

Disallow: /*.php$

This means /index.php can’t be indexed, but /index.php?p=1 could be. Of course, this is only useful in very specific circumstances and also pretty dangerous: it’s easy to unblock things you didn’t actually want to unblock.

Non-standard robots.txt crawl directives

As well as the Disallow and User-agent directives there are a couple of other crawl directives you can use. These directives are not supported by all search engine crawlers so make sure you’re aware of their limitations.

The allow directive

While not in the original “specification”, there was talk very early on of an allow directive. Most search engines seem to understand it, and it allows for simple, and very readable directives like this:

Disallow: /wp-admin/ 
Allow: /wp-admin/admin-ajax.php

The only other way of achieving the same result without an allow directive would have been to specifically disallow every single file in the wp-admin folder.

The host directive

Supported by Yandex (and not by Google, despite what some posts say), this directive lets you decide whether you want the search engine to show example.com or www.example.com. Simply specifying it like this does the trick:

host: example.com

But because only Yandex supports the host directive, we wouldn’t advise you to rely on it, especially as it doesn’t allow you to define a scheme (http or https) either. A better solution that works for all search engines would be to 301 redirect the hostnames that you don’t want in the index to the version that you do want. In our case, we redirect www.yoast.com to yoast.com.

The crawl-delay directive

Bing and Yandex can sometimes be fairly crawl-hungry, but luckily they all respond to the crawl-delay directive, which slows them down. And while these search engines have slightly different ways of reading the directive, the end result is basically the same.

A line like the one below would instruct those search engines to change how frequently they’ll request pages on your site.

crawl-delay: 10

Different interpretations

Note that Bing interprets this as an instruction to wait 10 seconds after a crawl action, while Yandex interprets it as a directive to only access your site once in every 10 seconds. It’s a semantic difference, but still interesting to know.

Do take care when using the crawl-delay directive. By setting a crawl delay of 10 seconds you’re only allowing these search engines to access 8,640 pages a day. This might seem plenty for a small site, but on large sites it isn’t very many. On the other hand, if you get next to no traffic from these search engines, it’s a good way to save some bandwidth.

The sitemap directive for XML Sitemaps

Using the sitemap directive you can tell search engines – specifically Bing, Yandex and Google – where to find your XML sitemap. You can, of course, also submit your XML sitemaps to each search engine using their respective webmaster tools solutions, and we strongly recommend you do, because search engine webmaster tools programs will give you lots of valuable information about your site. If you don’t want to do that, adding a sitemap line to your robots.txt is a good quick alternative.

Sitemap: https://www.example.com/my-sitemap.xml

Validate your robots.txt

There are various tools out there that can help you validate your robots.txt, but when it comes to validating crawl directives, we always prefer to go to the source. Google has a robots.txt testing tool in its Google Search Console (under the ‘Old version’ menu) and we’d highly recommend using that:

robots.txt Tester

Be sure to test your changes thoroughly before you put them live! You wouldn’t be the first to accidentally use robots.txt to block your entire site, and to slip into search engine oblivion!

See the code

In July 2019, Google announced that they were making their robots.txt parser open source. That means that, if you really want to get into the nuts and bolts, you can go and see how their code works (and, even use it yourself, or propose modifications to it).

The post The ultimate guide to robots.txt appeared first on Yoast.

The robots.txt file is a file you can use to tell search engines where they can and cannot go on your site. Learn how to use it to your advantage!
The post The ultimate guide to robots.txt appeared first on Yoast.Read MoreCrawl directives, Technical SEOYoast

Categories
Uncategorized

The ultimate guide to robots.txt

The robots.txt file is one of the main ways of telling a search engine where it can and can’t go on your website. All major search engines support the basic functionality it offers, but some of them respond to some extra rules which can be useful too. This guide covers all the ways to use robots.txt on your website.

Warning!

Any mistakes you make in your robots.txt can seriously harm your site, so make sure you read and understand the whole of this article before you dive in.

What is a robots.txt file?

Crawl directives

The robots.txt file is one of a number of crawl directives. We have guides on all of them and you’ll find them here.

A robots.txt file is a text file which is read by search engine (and other systems). Also called the “Robots Exclusion Protocol”, the robots.txt file is the result of a consensus among early search engine developers. It’s not an official standard set by any standards organization; although all major search engines adhere to it.

What does the robots.txt file do?

Caching

Search engines typically cache the contents of the robots.txt so that they don’t need to keep downloading it, but will usually refresh it several times a day. That means that changes to instructions are typically reflected fairly quickly.

Search engines discover and index the web by crawling pages. As they crawl, they discover and follow links. This takes them from site A to site B to site C, and so on. But before a search engine visits any page on a domain it hasn’t encountered before, it will open that domain’s robots.txt file. That lets them know which URLs on that site they’re allowed to visit (and which ones they’re not).

Where should I put my robots.txt file?

The robots.txt file should always be at the root of your domain. So if your domain is www.example.com, it should be found at https://www.example.com/robots.txt.

It’s also very important that your robots.txt file is actually called robots.txt. The name is case sensitive, so get that right or it just won’t work.

Pros and cons of using robots.txt

Pro: managing crawl budget

It’s generally understood that a search spider arrives at a website with a pre-determined “allowance” for how many pages it will crawl (or, how much resource/time it’ll spend, based on a site’s authority/size/reputation, and how efficiently the server responds). SEOs call this the crawl budget.

If you think that your website has problems with crawl budget, then blocking search engines from ‘wasting’ energy in unimportant parts of your site might mean that they focus instead on the sections which do matter.

It can sometimes be beneficial to block the search engines from crawling problematic sections of your site, especially on sites where a lot of SEO clean-up has to be done. Once you’ve tidied things up, you can let them back in.

A note on blocking query parameters

One situation where crawl budget is particularly important is when your site uses a lot of query string parameters to filter or sort lists. Let’s say you have 10 different query parameters, each with different values that can be used in any combination (like t-shirts in multiple colors and sizes). This leads to lots of possible valid URLs, all of which might get crawled. Blocking query parameters from being crawled will help make sure the search engine only spiders your site’s main URLs and won’t go into the enormous trap that you’d otherwise create.

Con: not removing a page from search results

Even though you can use the robots.txt file to tell a spider where it can’t go on your site, you can’t use it tell a search engine which URLs not to show in the search results – in other words, blocking it won’t stop it from being indexed. If the search engine finds enough links to that URL, it will include it, it will just not know what’s on that page. So your result will look like this:

If you want to reliably block a page from showing up in the search results, you need to use a meta robots noindex tag. That means that, in order to find the noindex tag, the search engine has to be able to access that page, so don’t block it with robots.txt.

Noindex directives

It used to be possible to add ‘noindex’ directives in your robots.txt, to remove URLs from Google’s search results, and to avoid these ‘fragments’ showing up. This is no longer supported (and technically, never was).

Con: not spreading link value

If a search engine can’t crawl a page, it can’t spread the link value across the links on that page. When a page is blocked with robots.txt, it’s a dead-end. Any link value which might have flowed to (and through) that page is lost.

Robots.txt syntax

WordPress robots.txt

We have an entire article on how best to setup your robots.txt for WordPress. Don’t forget you can edit your site’s robots.txt file in the Yoast SEO Tools → File editor section.

A robots.txt file consists of one or more blocks of directives, each starting with a user-agent line. The “user-agent” is the name of the specific spider it addresses. You can either have one block for all search engines, using a wildcard for the user-agent, or specific blocks for specific search engines. A search engine spider will always pick the block that best matches its name.

These blocks look like this (don’t be scared, we’ll explain below):

User-agent: * 
Disallow: /

User-agent: Googlebot
Disallow:

User-agent: bingbot
Disallow: /not-for-bing/

Directives like Allow and Disallow should not be case sensitive, so it’s up to you whether you write them lowercase or capitalize them. The values are case sensitive however, /photo/ is not the same as /Photo/. We like to capitalize directives because it makes the file easier (for humans) to read.

The user-agent directive

The first bit of every block of directives is the user-agent, which identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent, so for instance the most common spider from Google has the following user-agent:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 

So if you want to tell this spider what to do, a relatively simple User-agent: Googlebot line will do the trick.

Most search engines have multiple spiders. They will use a specific spider for their normal index, for their ad programs, for images, for videos, etc.

Search engines will always choose the most specific block of directives they can find. Say you have 3 sets of directives: one for *, one for Googlebot and one for Googlebot-News. If a bot comes by whose user-agent is Googlebot-Video, it would follow the Googlebot restrictions. A bot with the user-agent Googlebot-News would use the more specific Googlebot-News directives.

The most common user agents for search engine spiders

Here’s a list of the user-agents you can use in your robots.txt file to match the most commonly used search engines:

Search engine Field User-agent
Baidu General baiduspider
Baidu Images baiduspider-image
Baidu Mobile baiduspider-mobile
Baidu News baiduspider-news
Baidu Video baiduspider-video
Bing General bingbot
Bing General msnbot
Bing Images & Video msnbot-media
Bing Ads adidxbot
Google General Googlebot
Google Images Googlebot-Image
Google Mobile Googlebot-Mobile
Google News Googlebot-News
Google Video Googlebot-Video
Google AdSense Mediapartners-Google
Google AdWords AdsBot-Google
Yahoo! General slurp
Yandex General yandex

The disallow directive

The second line in any block of directives is the Disallow line. You can have one or more of these lines, specifying which parts of the site the specified spider can’t access. An empty Disallow line means you’re not disallowing anything, so basically it means that a spider can access all sections of your site.

The example below would block all search engines that “listen” to robots.txt from crawling your site.

User-agent: * 
Disallow: /

With only one character less, the example below would allow all search engines to crawl your entire site.

User-agent: * 
Disallow:

The example below would block Google from crawling the Photo directory on your site – and everything in it.

User-agent: googlebot 
Disallow: /Photo

This means all the subdirectories of the /Photo directory would also not be spidered. It would not block Google from crawling the /photo directory, as these lines are case sensitive.

This would also block Google from accessing URLs containing /Photo, such as /Photography/.

How to use wildcards/regular expressions

“Officially”, the robots.txt standard doesn’t support regular expressions or wildcards, however, all major search engines do understand it. This means you can use lines like this to block groups of files:

Disallow: /*.php 
Disallow: /copyrighted-images/*.jpg

In the example above, * is expanded to whatever filename it matches. Note that the rest of the line is still case sensitive, so the second line above will not block a file called /copyrighted-images/example.JPG from being crawled.

Some search engines, like Google, allow for more complicated regular expressions, but be aware that some search engines might not understand this logic. The most useful feature this adds is the $, which indicates the end of a URL. In the following example you can see what this does:

Disallow: /*.php$

This means /index.php can’t be indexed, but /index.php?p=1 could be. Of course, this is only useful in very specific circumstances and also pretty dangerous: it’s easy to unblock things you didn’t actually want to unblock.

Non-standard robots.txt crawl directives

As well as the Disallow and User-agent directives there are a couple of other crawl directives you can use. These directives are not supported by all search engine crawlers so make sure you’re aware of their limitations.

The allow directive

While not in the original “specification”, there was talk very early on of an allow directive. Most search engines seem to understand it, and it allows for simple, and very readable directives like this:

Disallow: /wp-admin/ 
Allow: /wp-admin/admin-ajax.php

The only other way of achieving the same result without an allow directive would have been to specifically disallow every single file in the wp-admin folder.

The host directive

Supported by Yandex (and not by Google, despite what some posts say), this directive lets you decide whether you want the search engine to show example.com or www.example.com. Simply specifying it like this does the trick:

host: example.com

But because only Yandex supports the host directive, we wouldn’t advise you to rely on it, especially as it doesn’t allow you to define a scheme (http or https) either. A better solution that works for all search engines would be to 301 redirect the hostnames that you don’t want in the index to the version that you do want. In our case, we redirect www.yoast.com to yoast.com.

The crawl-delay directive

Bing and Yandex can sometimes be fairly crawl-hungry, but luckily they all respond to the crawl-delay directive, which slows them down. And while these search engines have slightly different ways of reading the directive, the end result is basically the same.

A line like the one below would instruct those search engines to change how frequently they’ll request pages on your site.

crawl-delay: 10

Different interpretations

Note that Bing interprets this as an instruction to wait 10 seconds after a crawl action, while Yandex interprets it as a directive to only access your site once in every 10 seconds. It’s a semantic difference, but still interesting to know.

Do take care when using the crawl-delay directive. By setting a crawl delay of 10 seconds you’re only allowing these search engines to access 8,640 pages a day. This might seem plenty for a small site, but on large sites it isn’t very many. On the other hand, if you get next to no traffic from these search engines, it’s a good way to save some bandwidth.

The sitemap directive for XML Sitemaps

Using the sitemap directive you can tell search engines – specifically Bing, Yandex and Google – where to find your XML sitemap. You can, of course, also submit your XML sitemaps to each search engine using their respective webmaster tools solutions, and we strongly recommend you do, because search engine webmaster tools programs will give you lots of valuable information about your site. If you don’t want to do that, adding a sitemap line to your robots.txt is a good quick alternative.

Sitemap: https://www.example.com/my-sitemap.xml

Validate your robots.txt

There are various tools out there that can help you validate your robots.txt, but when it comes to validating crawl directives, we always prefer to go to the source. Google has a robots.txt testing tool in its Google Search Console (under the ‘Old version’ menu) and we’d highly recommend using that:

robots.txt Tester

Be sure to test your changes thoroughly before you put them live! You wouldn’t be the first to accidentally use robots.txt to block your entire site, and to slip into search engine oblivion!

See the code

In July 2019, Google announced that they were making their robots.txt parser open source. That means that, if you really want to get into the nuts and bolts, you can go and see how their code works (and, even use it yourself, or propose modifications to it).

The post The ultimate guide to robots.txt appeared first on Yoast.

The robots.txt file is a file you can use to tell search engines where they can and cannot go on your site. Learn how to use it to your advantage!
The post The ultimate guide to robots.txt appeared first on Yoast.Read MoreCrawl directives, Technical SEOYoast

Categories
Uncategorized

The ultimate guide to robots.txt

The robots.txt file is one of the main ways of telling a search engine where it can and can’t go on your website. All major search engines support the basic functionality it offers, but some of them respond to some extra rules which can be useful too. This guide covers all the ways to use robots.txt on your website.

Warning!

Any mistakes you make in your robots.txt can seriously harm your site, so make sure you read and understand the whole of this article before you dive in.

What is a robots.txt file?

Crawl directives

The robots.txt file is one of a number of crawl directives. We have guides on all of them and you’ll find them here.

A robots.txt file is a text file which is read by search engine (and other systems). Also called the “Robots Exclusion Protocol”, the robots.txt file is the result of a consensus among early search engine developers. It’s not an official standard set by any standards organization; although all major search engines adhere to it.

What does the robots.txt file do?

Caching

Search engines typically cache the contents of the robots.txt so that they don’t need to keep downloading it, but will usually refresh it several times a day. That means that changes to instructions are typically reflected fairly quickly.

Search engines discover and index the web by crawling pages. As they crawl, they discover and follow links. This takes them from site A to site B to site C, and so on. But before a search engine visits any page on a domain it hasn’t encountered before, it will open that domain’s robots.txt file. That lets them know which URLs on that site they’re allowed to visit (and which ones they’re not).

Where should I put my robots.txt file?

The robots.txt file should always be at the root of your domain. So if your domain is www.example.com, it should be found at https://www.example.com/robots.txt.

It’s also very important that your robots.txt file is actually called robots.txt. The name is case sensitive, so get that right or it just won’t work.

Pros and cons of using robots.txt

Pro: managing crawl budget

It’s generally understood that a search spider arrives at a website with a pre-determined “allowance” for how many pages it will crawl (or, how much resource/time it’ll spend, based on a site’s authority/size/reputation, and how efficiently the server responds). SEOs call this the crawl budget.

If you think that your website has problems with crawl budget, then blocking search engines from ‘wasting’ energy in unimportant parts of your site might mean that they focus instead on the sections which do matter.

It can sometimes be beneficial to block the search engines from crawling problematic sections of your site, especially on sites where a lot of SEO clean-up has to be done. Once you’ve tidied things up, you can let them back in.

A note on blocking query parameters

One situation where crawl budget is particularly important is when your site uses a lot of query string parameters to filter or sort lists. Let’s say you have 10 different query parameters, each with different values that can be used in any combination (like t-shirts in multiple colors and sizes). This leads to lots of possible valid URLs, all of which might get crawled. Blocking query parameters from being crawled will help make sure the search engine only spiders your site’s main URLs and won’t go into the enormous trap that you’d otherwise create.

Con: not removing a page from search results

Even though you can use the robots.txt file to tell a spider where it can’t go on your site, you can’t use it tell a search engine which URLs not to show in the search results – in other words, blocking it won’t stop it from being indexed. If the search engine finds enough links to that URL, it will include it, it will just not know what’s on that page. So your result will look like this:

If you want to reliably block a page from showing up in the search results, you need to use a meta robots noindex tag. That means that, in order to find the noindex tag, the search engine has to be able to access that page, so don’t block it with robots.txt.

Noindex directives

It used to be possible to add ‘noindex’ directives in your robots.txt, to remove URLs from Google’s search results, and to avoid these ‘fragments’ showing up. This is no longer supported (and technically, never was).

Con: not spreading link value

If a search engine can’t crawl a page, it can’t spread the link value across the links on that page. When a page is blocked with robots.txt, it’s a dead-end. Any link value which might have flowed to (and through) that page is lost.

Robots.txt syntax

WordPress robots.txt

We have an entire article on how best to setup your robots.txt for WordPress. Don’t forget you can edit your site’s robots.txt file in the Yoast SEO Tools → File editor section.

A robots.txt file consists of one or more blocks of directives, each starting with a user-agent line. The “user-agent” is the name of the specific spider it addresses. You can either have one block for all search engines, using a wildcard for the user-agent, or specific blocks for specific search engines. A search engine spider will always pick the block that best matches its name.

These blocks look like this (don’t be scared, we’ll explain below):

User-agent: * 
Disallow: /

User-agent: Googlebot
Disallow:

User-agent: bingbot
Disallow: /not-for-bing/

Directives like Allow and Disallow should not be case sensitive, so it’s up to you whether you write them lowercase or capitalize them. The values are case sensitive however, /photo/ is not the same as /Photo/. We like to capitalize directives because it makes the file easier (for humans) to read.

The user-agent directive

The first bit of every block of directives is the user-agent, which identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent, so for instance the most common spider from Google has the following user-agent:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 

So if you want to tell this spider what to do, a relatively simple User-agent: Googlebot line will do the trick.

Most search engines have multiple spiders. They will use a specific spider for their normal index, for their ad programs, for images, for videos, etc.

Search engines will always choose the most specific block of directives they can find. Say you have 3 sets of directives: one for *, one for Googlebot and one for Googlebot-News. If a bot comes by whose user-agent is Googlebot-Video, it would follow the Googlebot restrictions. A bot with the user-agent Googlebot-News would use the more specific Googlebot-News directives.

The most common user agents for search engine spiders

Here’s a list of the user-agents you can use in your robots.txt file to match the most commonly used search engines:

Search engine Field User-agent
Baidu General baiduspider
Baidu Images baiduspider-image
Baidu Mobile baiduspider-mobile
Baidu News baiduspider-news
Baidu Video baiduspider-video
Bing General bingbot
Bing General msnbot
Bing Images & Video msnbot-media
Bing Ads adidxbot
Google General Googlebot
Google Images Googlebot-Image
Google Mobile Googlebot-Mobile
Google News Googlebot-News
Google Video Googlebot-Video
Google AdSense Mediapartners-Google
Google AdWords AdsBot-Google
Yahoo! General slurp
Yandex General yandex

The disallow directive

The second line in any block of directives is the Disallow line. You can have one or more of these lines, specifying which parts of the site the specified spider can’t access. An empty Disallow line means you’re not disallowing anything, so basically it means that a spider can access all sections of your site.

The example below would block all search engines that “listen” to robots.txt from crawling your site.

User-agent: * 
Disallow: /

With only one character less, the example below would allow all search engines to crawl your entire site.

User-agent: * 
Disallow:

The example below would block Google from crawling the Photo directory on your site – and everything in it.

User-agent: googlebot 
Disallow: /Photo

This means all the subdirectories of the /Photo directory would also not be spidered. It would not block Google from crawling the /photo directory, as these lines are case sensitive.

This would also block Google from accessing URLs containing /Photo, such as /Photography/.

How to use wildcards/regular expressions

“Officially”, the robots.txt standard doesn’t support regular expressions or wildcards, however, all major search engines do understand it. This means you can use lines like this to block groups of files:

Disallow: /*.php 
Disallow: /copyrighted-images/*.jpg

In the example above, * is expanded to whatever filename it matches. Note that the rest of the line is still case sensitive, so the second line above will not block a file called /copyrighted-images/example.JPG from being crawled.

Some search engines, like Google, allow for more complicated regular expressions, but be aware that some search engines might not understand this logic. The most useful feature this adds is the $, which indicates the end of a URL. In the following example you can see what this does:

Disallow: /*.php$

This means /index.php can’t be indexed, but /index.php?p=1 could be. Of course, this is only useful in very specific circumstances and also pretty dangerous: it’s easy to unblock things you didn’t actually want to unblock.

Non-standard robots.txt crawl directives

As well as the Disallow and User-agent directives there are a couple of other crawl directives you can use. These directives are not supported by all search engine crawlers so make sure you’re aware of their limitations.

The allow directive

While not in the original “specification”, there was talk very early on of an allow directive. Most search engines seem to understand it, and it allows for simple, and very readable directives like this:

Disallow: /wp-admin/ 
Allow: /wp-admin/admin-ajax.php

The only other way of achieving the same result without an allow directive would have been to specifically disallow every single file in the wp-admin folder.

The host directive

Supported by Yandex (and not by Google, despite what some posts say), this directive lets you decide whether you want the search engine to show example.com or www.example.com. Simply specifying it like this does the trick:

host: example.com

But because only Yandex supports the host directive, we wouldn’t advise you to rely on it, especially as it doesn’t allow you to define a scheme (http or https) either. A better solution that works for all search engines would be to 301 redirect the hostnames that you don’t want in the index to the version that you do want. In our case, we redirect www.yoast.com to yoast.com.

The crawl-delay directive

Bing and Yandex can sometimes be fairly crawl-hungry, but luckily they all respond to the crawl-delay directive, which slows them down. And while these search engines have slightly different ways of reading the directive, the end result is basically the same.

A line like the one below would instruct those search engines to change how frequently they’ll request pages on your site.

crawl-delay: 10

Different interpretations

Note that Bing interprets this as an instruction to wait 10 seconds after a crawl action, while Yandex interprets it as a directive to only access your site once in every 10 seconds. It’s a semantic difference, but still interesting to know.

Do take care when using the crawl-delay directive. By setting a crawl delay of 10 seconds you’re only allowing these search engines to access 8,640 pages a day. This might seem plenty for a small site, but on large sites it isn’t very many. On the other hand, if you get next to no traffic from these search engines, it’s a good way to save some bandwidth.

The sitemap directive for XML Sitemaps

Using the sitemap directive you can tell search engines – specifically Bing, Yandex and Google – where to find your XML sitemap. You can, of course, also submit your XML sitemaps to each search engine using their respective webmaster tools solutions, and we strongly recommend you do, because search engine webmaster tools programs will give you lots of valuable information about your site. If you don’t want to do that, adding a sitemap line to your robots.txt is a good quick alternative.

Sitemap: https://www.example.com/my-sitemap.xml

Validate your robots.txt

There are various tools out there that can help you validate your robots.txt, but when it comes to validating crawl directives, we always prefer to go to the source. Google has a robots.txt testing tool in its Google Search Console (under the ‘Old version’ menu) and we’d highly recommend using that:

robots.txt Tester

Be sure to test your changes thoroughly before you put them live! You wouldn’t be the first to accidentally use robots.txt to block your entire site, and to slip into search engine oblivion!

See the code

In July 2019, Google announced that they were making their robots.txt parser open source. That means that, if you really want to get into the nuts and bolts, you can go and see how their code works (and, even use it yourself, or propose modifications to it).

The post The ultimate guide to robots.txt appeared first on Yoast.

The robots.txt file is a file you can use to tell search engines where they can and cannot go on your site. Learn how to use it to your advantage!
The post The ultimate guide to robots.txt appeared first on Yoast.Read MoreCrawl directives, Technical SEOYoast

Categories
Uncategorized

The ultimate guide to robots.txt

The robots.txt file is one of the main ways of telling a search engine where it can and can’t go on your website. All major search engines support the basic functionality it offers, but some of them respond to some extra rules which can be useful too. This guide covers all the ways to use robots.txt on your website.

Warning!

Any mistakes you make in your robots.txt can seriously harm your site, so make sure you read and understand the whole of this article before you dive in.

What is a robots.txt file?

Crawl directives

The robots.txt file is one of a number of crawl directives. We have guides on all of them and you’ll find them here.

A robots.txt file is a text file which is read by search engine (and other systems). Also called the “Robots Exclusion Protocol”, the robots.txt file is the result of a consensus among early search engine developers. It’s not an official standard set by any standards organization; although all major search engines adhere to it.

What does the robots.txt file do?

Caching

Search engines typically cache the contents of the robots.txt so that they don’t need to keep downloading it, but will usually refresh it several times a day. That means that changes to instructions are typically reflected fairly quickly.

Search engines discover and index the web by crawling pages. As they crawl, they discover and follow links. This takes them from site A to site B to site C, and so on. But before a search engine visits any page on a domain it hasn’t encountered before, it will open that domain’s robots.txt file. That lets them know which URLs on that site they’re allowed to visit (and which ones they’re not).

Where should I put my robots.txt file?

The robots.txt file should always be at the root of your domain. So if your domain is www.example.com, it should be found at https://www.example.com/robots.txt.

It’s also very important that your robots.txt file is actually called robots.txt. The name is case sensitive, so get that right or it just won’t work.

Pros and cons of using robots.txt

Pro: managing crawl budget

It’s generally understood that a search spider arrives at a website with a pre-determined “allowance” for how many pages it will crawl (or, how much resource/time it’ll spend, based on a site’s authority/size/reputation, and how efficiently the server responds). SEOs call this the crawl budget.

If you think that your website has problems with crawl budget, then blocking search engines from ‘wasting’ energy in unimportant parts of your site might mean that they focus instead on the sections which do matter.

It can sometimes be beneficial to block the search engines from crawling problematic sections of your site, especially on sites where a lot of SEO clean-up has to be done. Once you’ve tidied things up, you can let them back in.

A note on blocking query parameters

One situation where crawl budget is particularly important is when your site uses a lot of query string parameters to filter or sort lists. Let’s say you have 10 different query parameters, each with different values that can be used in any combination (like t-shirts in multiple colors and sizes). This leads to lots of possible valid URLs, all of which might get crawled. Blocking query parameters from being crawled will help make sure the search engine only spiders your site’s main URLs and won’t go into the enormous trap that you’d otherwise create.

Con: not removing a page from search results

Even though you can use the robots.txt file to tell a spider where it can’t go on your site, you can’t use it tell a search engine which URLs not to show in the search results – in other words, blocking it won’t stop it from being indexed. If the search engine finds enough links to that URL, it will include it, it will just not know what’s on that page. So your result will look like this:

If you want to reliably block a page from showing up in the search results, you need to use a meta robots noindex tag. That means that, in order to find the noindex tag, the search engine has to be able to access that page, so don’t block it with robots.txt.

Noindex directives

It used to be possible to add ‘noindex’ directives in your robots.txt, to remove URLs from Google’s search results, and to avoid these ‘fragments’ showing up. This is no longer supported (and technically, never was).

Con: not spreading link value

If a search engine can’t crawl a page, it can’t spread the link value across the links on that page. When a page is blocked with robots.txt, it’s a dead-end. Any link value which might have flowed to (and through) that page is lost.

Robots.txt syntax

WordPress robots.txt

We have an entire article on how best to setup your robots.txt for WordPress. Don’t forget you can edit your site’s robots.txt file in the Yoast SEO Tools → File editor section.

A robots.txt file consists of one or more blocks of directives, each starting with a user-agent line. The “user-agent” is the name of the specific spider it addresses. You can either have one block for all search engines, using a wildcard for the user-agent, or specific blocks for specific search engines. A search engine spider will always pick the block that best matches its name.

These blocks look like this (don’t be scared, we’ll explain below):

User-agent: * 
Disallow: /

User-agent: Googlebot
Disallow:

User-agent: bingbot
Disallow: /not-for-bing/

Directives like Allow and Disallow should not be case sensitive, so it’s up to you whether you write them lowercase or capitalize them. The values are case sensitive however, /photo/ is not the same as /Photo/. We like to capitalize directives because it makes the file easier (for humans) to read.

The user-agent directive

The first bit of every block of directives is the user-agent, which identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent, so for instance the most common spider from Google has the following user-agent:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 

So if you want to tell this spider what to do, a relatively simple User-agent: Googlebot line will do the trick.

Most search engines have multiple spiders. They will use a specific spider for their normal index, for their ad programs, for images, for videos, etc.

Search engines will always choose the most specific block of directives they can find. Say you have 3 sets of directives: one for *, one for Googlebot and one for Googlebot-News. If a bot comes by whose user-agent is Googlebot-Video, it would follow the Googlebot restrictions. A bot with the user-agent Googlebot-News would use the more specific Googlebot-News directives.

The most common user agents for search engine spiders

Here’s a list of the user-agents you can use in your robots.txt file to match the most commonly used search engines:

Search engine Field User-agent
Baidu General baiduspider
Baidu Images baiduspider-image
Baidu Mobile baiduspider-mobile
Baidu News baiduspider-news
Baidu Video baiduspider-video
Bing General bingbot
Bing General msnbot
Bing Images & Video msnbot-media
Bing Ads adidxbot
Google General Googlebot
Google Images Googlebot-Image
Google Mobile Googlebot-Mobile
Google News Googlebot-News
Google Video Googlebot-Video
Google AdSense Mediapartners-Google
Google AdWords AdsBot-Google
Yahoo! General slurp
Yandex General yandex

The disallow directive

The second line in any block of directives is the Disallow line. You can have one or more of these lines, specifying which parts of the site the specified spider can’t access. An empty Disallow line means you’re not disallowing anything, so basically it means that a spider can access all sections of your site.

The example below would block all search engines that “listen” to robots.txt from crawling your site.

User-agent: * 
Disallow: /

With only one character less, the example below would allow all search engines to crawl your entire site.

User-agent: * 
Disallow:

The example below would block Google from crawling the Photo directory on your site – and everything in it.

User-agent: googlebot 
Disallow: /Photo

This means all the subdirectories of the /Photo directory would also not be spidered. It would not block Google from crawling the /photo directory, as these lines are case sensitive.

This would also block Google from accessing URLs containing /Photo, such as /Photography/.

How to use wildcards/regular expressions

“Officially”, the robots.txt standard doesn’t support regular expressions or wildcards, however, all major search engines do understand it. This means you can use lines like this to block groups of files:

Disallow: /*.php 
Disallow: /copyrighted-images/*.jpg

In the example above, * is expanded to whatever filename it matches. Note that the rest of the line is still case sensitive, so the second line above will not block a file called /copyrighted-images/example.JPG from being crawled.

Some search engines, like Google, allow for more complicated regular expressions, but be aware that some search engines might not understand this logic. The most useful feature this adds is the $, which indicates the end of a URL. In the following example you can see what this does:

Disallow: /*.php$

This means /index.php can’t be indexed, but /index.php?p=1 could be. Of course, this is only useful in very specific circumstances and also pretty dangerous: it’s easy to unblock things you didn’t actually want to unblock.

Non-standard robots.txt crawl directives

As well as the Disallow and User-agent directives there are a couple of other crawl directives you can use. These directives are not supported by all search engine crawlers so make sure you’re aware of their limitations.

The allow directive

While not in the original “specification”, there was talk very early on of an allow directive. Most search engines seem to understand it, and it allows for simple, and very readable directives like this:

Disallow: /wp-admin/ 
Allow: /wp-admin/admin-ajax.php

The only other way of achieving the same result without an allow directive would have been to specifically disallow every single file in the wp-admin folder.

The host directive

Supported by Yandex (and not by Google, despite what some posts say), this directive lets you decide whether you want the search engine to show example.com or www.example.com. Simply specifying it like this does the trick:

host: example.com

But because only Yandex supports the host directive, we wouldn’t advise you to rely on it, especially as it doesn’t allow you to define a scheme (http or https) either. A better solution that works for all search engines would be to 301 redirect the hostnames that you don’t want in the index to the version that you do want. In our case, we redirect www.yoast.com to yoast.com.

The crawl-delay directive

Bing and Yandex can sometimes be fairly crawl-hungry, but luckily they all respond to the crawl-delay directive, which slows them down. And while these search engines have slightly different ways of reading the directive, the end result is basically the same.

A line like the one below would instruct those search engines to change how frequently they’ll request pages on your site.

crawl-delay: 10

Different interpretations

Note that Bing interprets this as an instruction to wait 10 seconds after a crawl action, while Yandex interprets it as a directive to only access your site once in every 10 seconds. It’s a semantic difference, but still interesting to know.

Do take care when using the crawl-delay directive. By setting a crawl delay of 10 seconds you’re only allowing these search engines to access 8,640 pages a day. This might seem plenty for a small site, but on large sites it isn’t very many. On the other hand, if you get next to no traffic from these search engines, it’s a good way to save some bandwidth.

The sitemap directive for XML Sitemaps

Using the sitemap directive you can tell search engines – specifically Bing, Yandex and Google – where to find your XML sitemap. You can, of course, also submit your XML sitemaps to each search engine using their respective webmaster tools solutions, and we strongly recommend you do, because search engine webmaster tools programs will give you lots of valuable information about your site. If you don’t want to do that, adding a sitemap line to your robots.txt is a good quick alternative.

Sitemap: https://www.example.com/my-sitemap.xml

Validate your robots.txt

There are various tools out there that can help you validate your robots.txt, but when it comes to validating crawl directives, we always prefer to go to the source. Google has a robots.txt testing tool in its Google Search Console (under the ‘Old version’ menu) and we’d highly recommend using that:

robots.txt Tester

Be sure to test your changes thoroughly before you put them live! You wouldn’t be the first to accidentally use robots.txt to block your entire site, and to slip into search engine oblivion!

See the code

In July 2019, Google announced that they were making their robots.txt parser open source. That means that, if you really want to get into the nuts and bolts, you can go and see how their code works (and, even use it yourself, or propose modifications to it).

The post The ultimate guide to robots.txt appeared first on Yoast.

The robots.txt file is a file you can use to tell search engines where they can and cannot go on your site. Learn how to use it to your advantage!
The post The ultimate guide to robots.txt appeared first on Yoast.Read MoreCrawl directives, Technical SEOYoast