Crawler

The robots.txt file for a WordPress Site and Other Considerations

When you install WP it comes with an autogenerated robots.txt file which is rather incomplete.

From experience I find this is the most appropriate robots.txt file (at a minimum), for a WP site installed right in the root folder:
User-agent: *
Disallow: /feed
Disallow: /*/feed
Disallow: /xmlrpc
Disallow: /wp-
Disallow: /?p=
Disallow: /*trackback
Allow: /wp-admin/admin-ajax.php
Allow: /wp-content/uploads/
Allow: /wp-content/plugins/
Allow: /wp-*/*.css
Allow: /wp-*/*.js
 
 
If the WP installation is in a subfolder (example folder /blog/ ) then the robots.txt for the entire site has to have WP specific directives in addition to those for the  rest of the site.
 
Under the general user agent (User-agent: *) line you will add these similar directives to the others that apply to the rest of the site:
 
Disallow: /blog/feed
Disallow: /blog/*/feed
Disallow: /blog/xmlrpc
Disallow: /blog/wp-
Disallow: /blog/?p=
Disallow: /blog/*trackback
Allow: /blog/wp-admin/admin-ajax.php
Allow: /blog/wp-content/uploads/
Allow: /blog/wp-content/plugins/
Allow: /blog/wp-*/*.css
Allow: /blog/wp-*/*.js
 
NB: There may be urls in other folders (e.g. images in other folders) which may need to be allowed as well. It depends on your installation.
You will need to adjust perhaps for various other situations what gets further allowed. It all depends on the theme and plugins used. Best way to figure out what needs adjusting is to run a few pages through the Mobile Friendly tester( https://search.google.com/test/mobile-friendly )  and see what is listed as blocked by robots.txt and add a robots.txt directive to allow them, as much as possible as a pattern.
 
In  addition, tag, category, archive and author pages require a robots “noindex” meta tag, because all they are at best is lists of links to posts, and at worst duplicate content when all all part of each post is listed as well.
 
This is easily achieved by using the All in One SEO Pack plugin (or similar, e.g. Yoast) and configuring it to add those robots “noindex” meta tag to those types of urls.
 
Another set of pages which need to be noindexed are attachment pages. The All in One SEO Pack plugin  has an option for this already.
Yoast on the other hand offers the option to 301 redirect the attachment page to the actual image itself.
Other than modifying manually the script that creates the attachment page (and keeping it up-to-date with every new version of WP or of your theme), the only stand-alone  plugin that I’ve found to do that is: https://wordpress.org/plugins/noindex-attachment-pages/ but I don’t know if it’s truly or still compatible with the latest version of WordPress nor if it will stay that way. Still best policy would be to not create attachment pages in the first place, just be careful how you insert your images and watch out what you link them to.
 
Configure the sitemap generator to not include those types of urls as well.
As  a matter of fact, the Yoast plugin has (or used to have) a setting to handle attachment pages by responding with a 410 (Gone) for them, and additionally creates an xml sitemap with those urls which get submitted to Google to crawl so that they can eventually get deleted from the index if they’ve ever been indexed. Personally I’d prefer a robots noindex meta tag.
 
 
Note: In rare cases WP site webmasters are actually providing some good stand-alone content for each category page – In that case don’t add a robots noindex meta tag to category pages.
 
Other things to watch out for when using WordPress:
  1. Don’t forget to change the Privacy (in Settings) to allow robots to crawl the site. Do this only after you have decided your site is in good shape to be published, or rather to let robots index it.
  2. Ensure the WP site is on the same canonical domain as the rest of the site (all www or all non-www). For WP site it’s managed from Settings.
  3. Ensure correct server response for a non-existent url – that’s a 404, with or without a custom error page.
  4. Get your permalink structure decided form the start, before you allow the site to be indexed. It’s much harder to change later and it will require 301 redirection from the old url to the new ones.
  5. Either disable comments or moderate them seriously.
  6. Keep software updated to the latest version.
  7. Keep the number of both tags and categories small, don’t pepper a page or post with umpteen tags and category labels.
  8. Don’t use tag or category clouds – they do look spammy.
  9. Disable feed generation if you can – feeds are the main vehicle for getting your site content scraped by other.
  10. Modify the theme to get rid of the login url.
  11. Watch out for footer or other site-wide links when using any theme, especially one you got from other sources instead of WordPress. Ensure they all use rel=”nofollow”.
  12. Avoid having a blogroll or if you do, use a plugin to add rel=”nofollow” to the links (at least for all pages except at most the homepage). Of course exercise common sense when keeping links in a blogroll – they should not be part of a link exchange or paid links and they should all be relevant to your site.