Cleaning up the website using Google Webmaster Tools
I recently found some time to sit down and go through Google Webmaster Tools with more than the usual passing glance and I thought it might be interesting to share some of my experiences when working through it – in relation to my technology blog at www.jasonslater.co.uk.
The first place to start in Google Webmaster Tools is to ensure my site can be accessed by the Google bot – this can be checked under the Dashboard, and fortunately it was accessing my site just fine.
The next thing to discover from the Dashboard was Webmaster Tools reporting 1,217 URL in its database for my blog when I actually have 1,220 posts – plus 11 pages currently live on the site – so where are the missing URL?
Note: If we factored in categories and tags this figure would be higher – however I tell the robots.txt file to exclude certain URL paths including categories, tags, archives, and so on – I will get back to this later. I would expect a minimum of 1,234 URL to be displayed (Homepage + 1220 posts + 11 pages + Wiki Homepage + Forum Homepage)
The other thing to notice from the dashboard screen is that 40,016 URLs are being shown as unreachable – hang on a minute – there are only 1,234 URL aren’t there?
The Dashboard shows a few other things including “Top search Queries” and “Links to your site” – it would be handy to have a few extra stats on this page but there is enough here to keep me busy for a while.
Clearly – I need to work through some of the tools available on Google Webmaster Tools.
The first place to head is the Sitemaps section to ensure all is in order. Sitemaps reports Sitemap stats of Total URLs: 1,234 and Indexed URLs: 1,217 – the Sitemaps was last updated only a day ago so we can establish that barring a few recent articles – the Sitemap.xml file Google is reading is up to date.
It would be useful on this page to be able to drill down into the number of indexed URL to identify which ones were not being included – however for now I will have to find another way.
Crawler access shows the location of the robots.txt file which Google describes as “If your site has content you don’t want to appear in search results, use a robots.txt file to specify how search engines should crawl your site’s content.”. My robots.txt file is reporting a 200 (Success) message which means it is being read just fine. A copy of the parse results is included in this section which detects the location of the Sitemap and shows the contents of my robots.txt file so I can double-check its entries. I have the usual Disallow entries you might expect to find a WordPress based blog but there is nothing that should affect any of the 1,234 URL submitted in the Sitemap – clearly the problem lies elsewhere.
A useful feature listed under Crawler access is “Remove URL” which allows you to submit a reference to a URL, or a whole directory (or entire site) – although I have had somewhat limited success with it thus far. Google tell us there are three routes to getting a link removed (see How can I ensure my content is eligible for removal from the Google index?):
- Ensure content is no longer live. Requests must return HTTP 404 or 410 status codes
- Block content using the robots.txt file
- Block content using the meta noindex tag
Even though many of the pages submitted return 404 request for their removal still result in a stern “Denied” warning.
Even though I was using an XML Sitemap generator and switched off tags and categories there initially appeared to be some tags URLs listed in the sitemap which was odd – it was only when drilling down the content I discovered some older slug titles had the post ID or a single keyword instead of the post title. In order to drill down the content I downloaded the sitemap.xml file then loaded it into Excel 2007 which allowed me to sort the contents and analyse it in a number of different ways.
First off I sorted it alphanumerically which allowed me to separate post articles from other things – as my permalinks are setup to show year/month/day/post-title. Other pages would be listed separately so I could sort through them. Whilst looking at this I took the opportunity to exclude a number of additional pages including the search page and archive page in order to avoid duplicate content being submitted to Google. This brings the total number of URL down to 1,231 (homepage + 1,220 posts + 10 static pages).
The next step in Excel was to filter out the www.jasonslater.co.uk so only the relative path names remained – this would make it easier to submit certain pages to Webmaster Tools later.
Back on the Dashboard it reported a number of Crawl Errors which you can see in the following image.
The first error showing up under Crawl errors were 76 “Not found” entries which largely turned out to me failing to put the http part in front of a number of external links – as a result the Google bot was pre-pending my website address in front of the links believing them to be relative links. Fixing the links to include the http part then submitting these false links for removal should fix the problem – fingers crossed they get accepted for removal (see the earlier section about URL removal).
Rather alarmingly over 4,016 URL were being described as unreachable in the Diagnostics section– somewhat bizarre given only 1,220 articles. However, the cause soon became clear when I remembered that some time ago I gave a language translator plug-in a go which created cached versions of translated website pages – these appear to have made their way into Google Search index. I took the widget off because I am not entirely comfortable with full on automatic language translation – the technology still has some way to go. I added these rogue links to my robots.txt file as Disallow: references. Once the sitemap is updated I will request removal from Google for the paths which no longer exist.
For the next step I took a look under the HTML Suggestions section of the Diagnostics tab which reported a few areas for investigation.
20 pages were showing up as “Duplicate meta descriptions” – this could be bad as it could lead to confusion for people visiting the site (as multiple pages appear to talk about the same things). As it turns out, these are all due to the web tran
slation plug-in again. As a result of action in the previous section these should resolve by themselves.
Two pages were being shown as having “Short meta descriptions”. A meta description as is described in Webmaster Tools as “give users a clear idea of your site’s content and encourage users to click on your site in the search results pages.” – as there were only two I thought I would take a look at it. This is made somewhat easier by the All in One SEO plug-in which means I only really have to change the page title and slug to be a little more descriptive of the article content.
45 pages are showing duplicate title tags. Some of these are due to the web translation plug-in but many fall into two categories pointing to old posts where:
- I had not titled the articles very well (using one or two words which lead to duplication)
- I have written an update to an earlier article but given it the same title – I should have simply updated the original article or re-titled the new article
Top Search Queries
The Top Search Queries information on the Dashboard is a useful planning tool which you can:
- Compare against your own set of keywords for your site (also use the Keywords section of “Your site on the web” for this). For example, my site is a technology blog based in the UK and as such some of my keywords are “UK technology blog” so I would hope these might appear in the search queries – if they do not it indicated it could be time to re-evaluate your keywords.
- See what people are searching for in your scope of blog content. This can be used to ensure visitors to your site have their expectations met when they arrive – if the resulting content is not a good match for the search terms it could indicate you need to spend time working on your content.
- Use to understand which articles are getting attention and which ones might need a little further work or investigation. For example, you might have a fantastic article addressing a particular problem but you might have missed important search terms which might help people find your content.
Links To Your Site
This section is one of the most useful as it shows which pages in your site are being referred to by others. It is worth spending some time on this section as it can tell you some really useful information including:
- Which sites are regularly referring back to your site
- Sites which may simply be copying your content
This is often highlighted if you include links to your own content within an article
- Which areas of your site are getting the most focus – you might benefit from specialising in a particular area
- Which sites are linking to particular articles versus those who include a pointer to your home page (perhaps in a blog roll)
An upcoming subject in preparing your content for suitable inclusion by search engines is the appropriate and relevant use of anchor text. Anchor text is the bit of human readable text that appears highlighted when a clickable link is placed. When writing an article you might, as I have been in the past, be tempted to simply put things like “click here” or “more” for this text however it is worth rephrasing a sentence to allow the anchor text to be much more descriptive.
Looking at the “Links to your site” then “Anchor text” subsection under Google Webmaster Tools you can see many of the anchor text entries Google has picked up. For each term you can expand on each text link and see what extra information might be available. Ideally, this would show the source URL but in the meantime you can search for this anchor text on your site.
Looking at information from your blog “out of context” It is actually quite an eye opener – when I look through mine I have a number of entries where the anchor text may something like “fix” or “tip” or “link” without really alluding to what it fixes, what the tip is, or what it might links to. There are 200 entries in my list which I have downloaded and will work through slowly. What would be useful would be a WordPress plug-in that could list all your anchor text in a similar way and allow you to go straight to the edit for the article.
Whilst looking through the posts I noticed a whole bunch of older, non topic related, content which I have decided to cull at a later time which should make for a much leaner, more technology focused, blog. There is a school of thought that you shouldn’t remove old posts however these are off topic, and don’t have any link backs so I think they qualify in this case.
When making changes using Google Webmaster Tools it may take some time for the changes to permeate through the systems so be patient.