Sitemap, Robots and Search Engine Optimisation

Optimise your Sitemap and Robots.txt for improved SEO

Creating a sitemap and robots.txt file is a simple, yet important part of optimising your website. As a business, you will want your site to be found in the search results and creating these two files will help search engines identify what pages you want to be crawled and indexed.

Robots.txt files are normally used to ask search bots to avoid particular parts of your website that you don’t want to be indexed such as admin and thank you pages, and a sitemap is used to give a blueprint of your layout as well as page priority, how often pages are updated, the importance of different pages and other pieces of metadata.

In this optimisation post, we’ll be covering the following:

Why should you create a sitemap?
Creating an XML sitemap
Sitemaps and thin pages
Noindex pages
What is Robots.txt?
How to check your Robots.txt file
How to create a Robots.txt file
How to add your sitemap to virtual Robots.txt (Yoast)

Having a well-optimised sitemap and robots.txt will help you with your rankings and conversions. Ready? Let’s dive in.

Why should you create a sitemap?

We all want our content to be found in the search engines, especially Google. Whilst you can leave this to chance, it’s much better to give them a map and say ‘here is the structure of my website’

Search engines use what’s known as crawlers, AKA bots, to index webpages. When you create an XML sitemap, it’s like giving them a book with the index/contents page. It tells them where things are. Without this page, they’re having to guess and as crawlers have a crawl limit, you should take every opportunity to guide them.

However, a well-structured sitemap can do so much more than just letting them what pages are on your site. Let’s consider the fact that you may update some pages more frequently than others and you want this fresh content to get reindexed with your new content. Letting search engines know this, means you stand a higher chance of your already indexed pages getting recrawled at a more frequent rate. For websites that contain a high number of pages, this can be the difference between having outdated content in the search results and your updated content.

There are other benefits as well by utilising a well-structured sitemap.

  • When you create new webpages, they’re likely to get indexed a lot faster.
  • You can specify the priority of your webpages. Whilst Google now ignores this piece of metadata, others do not.
  • Pages that are deep within the navigation are more likely to be found within the results.
  • You’re helping search crawlers be more efficient and not waste their crawl limit.
  • You can create multiple sitemaps to create a better structure.

Creating your XML sitemap

Whilst considered as a technical side of search engine optimisation, the process is simple so don’t worry. If you’re using the largest open source CMS, WordPress, there are plugins that can take care of this for you. In fact, most content management systems will have a plugin to create dynamic sitemaps which are more beneficial than static ones as they self-update themselves when you create, modify or delete webpages.

For WordPress users, you most likely have a plugin installed to take care of some of your other metadata, like titles and descriptions. The two biggest plugins on the market are AIO SEO and Yoast.

All in One SEO

For users of AIO SEO, there’s a free XML sitemap extension you can install which will dynamically create your sitemap. First, you’ll want to click on ‘Feature Manager’ within the left-hand side menu of your dashboard.

From here, simply click ‘Activate’ on the box titled ‘XML Sitemaps’. Now, click on ‘XML Sitemap’ further up the menu.

Here you’ll want to make some changes, the reasons which will be explained as we go along.

  1. Enable sitemap indexes.

Sitemaps have a limitation in terms of size. They can’t contain more than 50,000 URLs or be larger than 50MB when uncompressed. If they are, you’ll want to create multiple sitemaps with a sitemap index linking to the individual sitemaps. Ideally, you should be doing this anyway as it creates a better structure for search crawlers.

  1. Post types

Here you’ll select what you want to appear within your sitemap. From an SEO point of view, you don’t want everything to be indexed, especially thin pages (More on that later). So, let’s start by selecting both ‘Posts’ and ‘Pages’. Now every website is different so mileage may vary here. You may have additional types you want to select, and the picture I have uploaded may not contain certain taxonomies which you see within your options. This is because some themes and plugins create additional ones. For example, if you’re a digital products supplier, you may have ‘Downloads’ or ‘Products’ that you also want to select.

Now, you’ll want to scroll down and enable the last 4 options.

  • You don’t want images to be included within your sitemap. They’ll still be indexed, don’t worry.
  • Creating a compressed sitemap means fewer resources used by your server. Especially useful for low-cost hosting plans.
  • Link from virtual robots.txt file. As it sounds, this means that your robots.txt file will contain a link to your sitemap/s.
  • Dynamically generate sitemap ensures that when you create, modify and delete pages, they’re automatically updated within your sitemap.

For most users, everything else can be left with the default options selected.

Yoast

For WordPress users using Yoast, the process is similar. First, click on ‘Search Appearance’ which is located underneath ‘SEO’ within the left-hand side menu. Now at the top menu bar, select ‘Content Types’.

Like All in One SEO, we’re going to select what we do and don’t want to appear in our sitemap. When you expand one of your post types you’ll be presented with a few options. Here you simply select whether you want that type to appear within your sitemap. You can also choose to show or hide the data from the search results, and if you want the SEO meta box to appear on your pages. You’ll want to make sure this is selected to show if you have selected for that post type to appear within your sitemap.

Now select ‘Media’ in the top menu and make sure this is set to yes. The reason for this is that pictures uploaded to WordPress appear on their own URL and we don’t want these to be indexed by Google as they are classed as very thin pages as they contain no content, and just a picture.

Next up is ‘Taxonomies’ within the top menu. Expand the boxes and select ‘no’ for every option. These, are also classed as thin pages (Honestly, thin pages and the reason for removal is coming up soon.)

The other options will vary dependant on the site in question.

Yoast doesn’t have an option to include the sitemap within the virtual robots.txt file

You may have already noticed that Yoast is lacking an option found in All in One SEO. For some reason, the developers of Yoast have decided that this is not an important or needed feature. Sadly, I disagree as so do many others within the SEO industry.

Best practise indicates that you not only submit your sitemap to the major search engines such as Google but that you link to it from your robots.txt file. Yoast used to have this option enabled, but in a plugin update, removed it. I’ll show you a simple way to get around this problem further on.

Excluding thin pages from your sitemap

Above, I have mentioned about excluding thin pages from your sitemap. There are several reasons for this, some of which are meant for another post, but I’ll quickly explain.

What are thin pages, AKA, low quality

Thin pages are low-quality pages that don’t provide your users with valuable content. They could be pages with little information such as WordPress attachment pages or thank you pages. Now there are exceptions to the rule, such as contact pages. Whilst they may be considered as thin due to normally only containing a few lines of text, they’re important to your site structure.

One example would be my portfolio pages on this website. They show potential clients who I have worked for, but they don’t contain a lot of content. Therefore, they don’t need to be included within my sitemap.

Search engines evolve, and Google’s algorithm updates come thick and fast. Back in 2011, Google released the Panda update. This was just before the Penguin update. I’m not making these names up; I think they just like animals.

Now the Panda update was focused around quality control. Sites that were trying to manipulate the search results by creating tons of low-quality pages were penalised. So, if you find yourself losing your rankings one day and are not sure why, remember this post.

There are other reasons aside from search engines why you don’t want thin pages appearing within the results, and that’s your users. If a user is browsing and clicks on one of these links, they’re likely to pogo back to Google. Pogo-sticking is when a user clicks on one of your search results, finds that the page doesn’t answer their query, and then they immediately without browsing other pages, go back to the search engine. It’s what’s known as bounce rate within Google analytics and it does no good for yourself or your website users. In fact, an increased bounce rate is a signal to Google that the user did not have a good browsing experience, and this can have a dramatic impact on your search result positions.

We, therefore, don’t want these low-quality pages appearing within the search results. Within both All in ONE SEO and Yoast, you can specify on individual pages for that page to be excluded from the sitemap. You’ll need to judge for yourself what you do and don’t want to be included when creating new pages.

Bonus Tip

Be sure if you have a ‘Thank you’ page that you exclude it from your sitemap. Normally thank you pages are created to track conversions. Having this page appear within the results is going to skew the statistics you are seeing.

Noindex pages that are excluded from your sitemap

A sitemap tells search engines what you want them to index. It doesn’t, however, stop them from indexing pages that are not included. So, if you don’t want a page to appear within the search results, you’ll want to set them to ‘noindex’.

Consistency is key here. Having a page set as noindex, and then appearing within your sitemap is essentially telling search engines ‘here’s a page I want to appear within the search results, please crawl it’ followed by ‘Hey, you’re not allowed to index this page’.

You can do this by inserting the following code within the <head></head> section of your webpage.

<meta name=”robots” content=”noindex”>

For WordPress users, you can set this option on the page itself if using All in One SEO or Yoast.

Ok great, you now have a well-structured sitemap and are ready to submit it to the major search engines such as Google.

Robots.txt

Whilst sitemaps can have variations in their path, robots.txt is always found at yourdomain.com/robots.txt

This simple text file tells robots where they can and more importantly, can’t crawl. Normally, search engine crawlers will look to the robots.txt file first for instructions about how to proceed hence its location is always found at yourdomain.com/robots.txt

Search crawlers have what’s known as a crawl budget per domain per visit. Letting them know where not to go, means resources are not wasted and for larger sites, means more content is indexed. Whilst this applies more so to sites with more than a few thousand URLs, it’s an important practice to follow when it comes to search engine optimisation and is a simple one to carry out.

Helping search engine crawlers spend their budget crawling your important pages and not your admin pages, means the difference between quick and slow indexation.

As Google puts it:

“You don’t want your server to be overwhelmed by Google’s crawler or to waste crawl budget crawling unimportant or similar pages on your site.”

Whilst a simple task to carry out, it’s surprising to see that it’s one that’s often overlooked. Even big names get it wrong sometimes. Look at www.disney.com/robots.txt and you’ll see it’s a blank file. Stranger still that they have one on the UK variant.

Check your Robots.txt file

Simply navigate to your website address and add /robots.txt on the end. You should see one of three things. A valid robots.txt file, a blank one like Disney, or a 404 page. If it’s the latter two, you’ll want to fix this. Thankfully this is easy to do.

If you have FTP access, you can log in and look in your root folder for robots.txt – This file must be in this location if it exists. If you’re a WordPress user, and you don’t have one, you will still have a virtual robots.txt file which you can confirm by following the instructions above on how to check your robots.txt file.

Creating a Robots.txt file

In either case, creating a new robots.txt file is simple to do. Open notepad and save the file as robots.txt. Once you’ve finished creating your robots.txt file, you can simply upload it to your root folder on your server.

Within this notepad file, we’re going to enter some instructions for search crawlers to follow.

The first thing we’re going to do is add these lines.

User-agent: *

Disallow:

The above whilst basic is a simple robots.txt file. As it stands, all search bots will be allowed to crawl your site in its entirety.

Next, we’re going to add our sitemap. Sitemap addresses are now always the same, so you’ll need to locate yours first. Once you have done so, simply add it to the above lines and you should have something like this.

User-agent: *

Disallow:

Sitemap: https://www.omrishalom.com/sitemap_index.xml/

If you were to look at my robots.txt file, you’ll notice its slightly different to above.

User-agent: *

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php

Sitemap: https://www.omrishalom.com/sitemap_index.xml/

The reason for this is that I have blocked my admin pages. I don’t need crawlers wasting their budget trying to crawl pages that are not going to appear within the results.

If you’re a WordPress user, you can use the exact above lines of text, making sure to replace the last line with your own sitemap address or addresses. Be sure to use an absolute URL. You can also specify for crawlers to not be allowed to look through your website. There’s a big list of crawlers, so that’s one for another post.

Adding your sitemap to Robots.txt in Yoast

For some reason, the developers of Yoast have decided it’s no longer important to offer this feature. Few will argue that a sitemap within your Robots.txt file is not needed, but there’s certainly no harm in doing so and some SEO consultants, including myself, would recommend it.

If you’re not comfortable in editing / creating a robots.txt file, or you don’t have FTP access which can be the case with some shared hosting plans, there is a simple solution.

Install and activate the ‘Code Snippets’ plugin from the repository. After you have done so, select ‘Snippets’ from the left-hand menu. Now click on ‘Add new’ and give your snippet a name; it doesn’t matter what it is as it’s for your reference only.

Lastly, paste the following code into the ‘code’ box and select ‘save changes’.

/**

* Add sitemap to virtual robots.txt file in WordPress

* https://www.omrishalom.com/sitemaps-robots-txt-and-search-engine-optimisation/

*/

function osrobot( $output, $public ) {

$homeURL = get_home_url();

$output .= “Sitemap: $homeURL/sitemap_index.xml/\n”;

return $output;

}

add_filter( ‘robots_txt’, ‘osrobot’, 10, 2 );

Now if you browse to yourdomain.com/robots.txt you’ll see that your sitemap has been added in.

Hopefully, this article has cleared up how to optimise your sitemap and robots file for SEO purposes and been of use. If it has, please click the share button below. Thank you.