Crawler

Website Check-list

So … What’s wrong with my site?

Check-list to go by:

  1. Are you verifying site ownership  using the  Google verification meta tag or the Google HTML verification file?
    • If using the file, it has to remain on the site because verification is done repeatedly and randomly.
    • If using the meta tag, does the homepage have a DTD? You actually should use a doctype.:http://alistapart.com/stories/doctype/ .
      • No doctype or a non-XHTML doctype: make sure the verification meta tag is closed with > and NOT with />. Otherwise the page is TOTALLY broken not just a bit invalid. Actually this applies to all meta tags and links tags. All that is except for html 5.
      • Has an XHTML doctype: make sure the Google verification meta tag has all its attributes in lower case and it is closed with a /> . But If using html5, then don’t worry about how you close meta tags, it doesn’t matter 🙂
  2. Get server response headers for homepage – it should be a server response code 200; for a missing page it should be a 404. Avoid doing a redirection right away from the root to a lower directory. Sometimes it cannot be helped (some CMS insist one getting installed in a folder), but it’s not a good start. You can take advnatage of thsi by making a proper homepage as an intro to the site 9with useful content, mind you) at the root level after which you can move on to the lower level where the CMS is located. Be creative and use all the real estate you get.
  3. Run Xenu Link Sleuth starting at homepage and check broken links and number of true links found – unless it never completes the crawl which will mean a problem. Compare list of good links found with your sitemap.
    • If Xenu finds any redirections consider changing the website navigation to use the new page url’s instead of redirecting to them. These kinds of internal redirections slow down the crawl.
    • If you are using a CMS, pay particular attention to the pages found by Xenu. If you are using SEF url’s and find that not all the url’s reported are according to the SEF structure, there is  a potentially nasty problem if the links are duplicates. One or both will end up supplemental.
  4. Use the w3c validator to validate your homepage and others – proper doctype and charset help – broken code especially at the block level will prevent bots from crawling. You cannot know how broken is broken until you fix it. Full site validation:http://htmlhelp.com/tools/validator/. Unfortunately this validator is rather obsolete and does not handle html5 properly.
  5. Pages not reachable through a normal crawl (i.e. no javascript or flash navigation) will not be crawled/indexed even if present in the sitemap – they will remain orphan. Orphan pages quickly turn supplemental – or, rather, fall to the bottom of the heap since pages are no longer marked supplemental (as of August 2007).
  6. Find out the situation by checking in Google.com using the search terms site:example.com and site:www.example.com – investigate supplemental pages. When you see an indication of similar pages that is usually because they have the same or very similar title or descriptions tag as others already listed and that should be fixed. It’s not a big problem unless those pages are also supplemental, which may point to a broken navigation elsewhere.
    • Page titles, headers on the text, proper use and distribution of keywords and keyphrases, anchor text, alt text for images, increased use of css – all part of efforts of internal optimization of pages.
  7. Keep the file robots.txt up to date. If you disallow a url in robots.txt don’t bother adding any robots meta tag to it as well, it will not be seen. To prevent indexing of a url, do not disallow it in robots.txt AND add a robots “NOINDEX” meta tag.I Furthermore any internal links to those pages should also use rel=”nofollow”.
  8. Easily fix canonical issues (www vs non-www) for sites on an Apache server:
    http://faqhowto.info/only-validation/301-redirect-https-to-http-or-viceversa-on-apache-server/

A bit more detail is also available in my 16-point website tune-up plan. The number 16 may change at any time 😉

Why does a broken meta tag break the head and the page?

The rules for closing tags are really very simple, if you stop a bit to think.

 

There are 2 kinds of tags:

  1. Those that work in pairs like <b>…</b>, or <p> …</p>, <div …>…</div>, etc.  and which act upon the content in between the opening and closing tag. These are the SAME in html and xhtml. The extra requirement in xhtml is that if you opened it, you MUSTclose it eventually, since old versions of html did not require closing of <p> or <div> specifically.
  2. Standalone tags that are not in matched pairs of opening and closing, like <meta …> , <link …>, <br> and <hr> . Well, ONLY for xhtml, they all have to be self-closed with />   so you get: <meta …./>, <link …. />, <br />, <hr /> .

In html, when a parser encounters a tag closed with />, since the syntax does not call for self-closing of tags, it assumes that it is a closing tag for whatever was opened before that still needed closing. The emphasis is on needed. You go back and see what tag had been opened and needed closing. If nothing else is broken before that, then, in the head section, the last tag that had been opened was the <head> and would need closing eventually.

 

Well, the first instance of /> found (as in an incorrect meta tag or link tag) tells it ok, the <head> has now been closed. What follows is not part of the head.

But when it gets to the </head> tag proper, it will determine that this in fact closes whatever else had been opened before and thus far left unclosed – and that is the <html> tag. Ok, so when it gets to </head> the html parsee has closed the entire page, everything else is no longer considered part of that page.

 

Why Bother Validating? Amazon and Google Aren’t Valid Anyway!

Well how invalid is invalid? If the pages are not broken at the block level, that’s ok. At least from the point of view of search engine indexing. It’s not optimal and it’s not commendable, but those are big players. The rest of us are very small fry.

Only by validating will we know that there are problems. Oh, sure, if you’re a real whiz at html and javascript code, maybe you produce perfect code. uhuh .. and maybe I am the queen of Googledom. If you’re using any kind of WYSIWYG editor, whether on your  pc or online (like what I’m using now here) there’s about a 99.9% chance that the code is invalid and at least somewhat broken.

 

Validation ensures a page is rendered correctly in all browsers and  robots will not have trouble parsing the page, so it results in a better and deeper indexing. Robots cannot index what they cannot read. And they cannot read a page if they cannot parse it properly.

 

What’s parsing? I couldn’t explain it better myself: http://www.answers.com/topic/parsing . Let’s just say proper parsing is crucial to indexing.

 

 

To recap: if the w3c validator gives your pages  a clean bill of health, that eliminates one major obstacle to indexing.

 

What all this amounts to is improved crawlability.

 

Just remember: CRAWLABILITY is the number 1 technical requirement for a site to even start to be indexed. View Matt Cutt’s video where he mentions CRAWLABILTIY around 1min and again at around 3min into the video.

 

Finally to explain it all better, see this site: http://www.feedthebot.com/ .

 

Is this SEO?

No. This list is a basic set of pre-requisites that lay the groundwork to SEO. This is structural debugging. You can’t hope to achieve much SEO-wise if the basics are not covered.