Why Googlebot Stopped Coming To My Site And How I Got It Back

OK, today was one of the most worrying days for me so far. This is how the day panned out.

This morning I was reviewing a website in the Google index as new pages were not being indexed. For the site in question new pages are generally crawled and indexed in about 10 minutes, sometimes less, sometimes more. The Google cache for the home page and main internal pages is also updated most days.

So today I noticed that the cache had not been updated for 3 days. This was unusual. So, what did I do? Rather than just say what I did, I will write it as a task list should you find yourself in this situation. (edit – this all went wrong, started talking about what happened again below, bear with me, I am not going to rewrite it properly…).

1. Check Google Webmaster Tools

First check Site Configuration > Crawler Access. Can Google access your robots.txt, or is it blank? (mine was blank).

If blank, test again, if fails, find out why. Go to Labs > Fetch as Googlebot. Attempt to Fetch your homepage first. If you can fetch it, then Google can crawl you, but has chosen not to. This is a different problem. In my case Google returned a 403 error: HTTP/1.1 403 Forbidden.

In SEO terms this is known as a fucking disaster.Why could Googlebot not access my site?

2. Check Your IP Block List

Now, I run the site in question on WordPress and have WP Firewall installed (created by SEO Egghead) that alerts me of possible threats, generally when someone attempts to access part of the site that they should not as they are sniffing around for vulnerabilities. When I get these reports I always check the IP address and then block them in .htaccess if they are a bit suspect.

Although, it is possible that on the 2nd October I failed to check this one sufficiently:

WordPress Firewall has detected and blocked a potential attack!

Web Page: /news.php?page=../../../../../../../../../../../../../../../../../../../proc/self/environ
Warning: URL may contain dangerous content!
Offending IP: 66.249.66.137 [ Get IP location ]
Offending Parameter: page = ../../../../../../../../../../../../../../../../../../../proc/self/environ

Because on checking the IP address, 66.249.66.137 is Google, Mountainview. Ooops. I blocked Google, what a twat.

How did I find out? The long and hard way. I removed all the IP addresses that I had blocked in the last week and then added them back a few at a time and ran the Labs > Fetch as Googlebot again until I found out which IP was being blocked.

So, good news! I found the problem. Bad news, Google was dropping me from the index like I was a bit of shit on its shovel.

Google Dropping Pages from the Index

I checked several pages in the morning, and by the time I had discovered the problem many of these were out of the index already, or at least shoved into the “supplemental index”, meaning that they were there for a site:url search but did not appear for any keywords, not even unique quotes from the page.

OK, so Googlebot was allowed back in the site, but it pages were still being dropped.

Back to Google Webmaster Tools

I double checked in Site Configuration > Crawler Access that Google could access the site. It could. Tick.

Then I check the XML sitemaps, clicked to Resubmit on all of them. For sitemaps I have:

  • /sitemap.xml
  • /feed
  • /comments/feed

The feeds were only added today (as suggested by Google in Webmaster tools).

Send a Reinclusion Request

This may seem extreme, but Google reinclusion / reconsideration requests are not just for when your site is flagged as spam or containing Malware. It can be used also when you just want to tell Google that you think something has upset the index and you would like it reviewed. They actually say:

“If your site isn’t appearing in Google search results, or it’s performing more poorly than it once did.”

So a site dropping out as a result of you (me) blocking an IP address accidentally falls into that category.

Add Some New Content and Update Some Articles

Next thing I did was deactivate my WordPress Ping controller. WordPress by default pings websites whenever you publish or edit an article. As I do a lot of editing, I deactivated pinging on edit using MaxBlogPress Ping Optimizer.

I edited about 5 articles, republishing for today’s date, so that they appear on on the front page and top of the RSS feeds. Also they appear as new on the xml sitemap, although the URLs all remain the same. This is not something you should really do all the time, but handy to send out a wake up call to the bots.

Then I published a couple of new articles just to give the ping services and Google an extra kick.

Check Access Logs

Next I checked the access logs and thankfully I could see that Google did start crawling the site again. What you need to look for on an access log is something like (or identical to): Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

This is the Googlebot (according to a forum I found earlier today) that does a deeper crawl of the site. This is a good sign. But just checking the logs for “googlebot” should pick up any googlebot.

For example Agent: portalmmm/2.0 N500iS(c20;TB) (compatible; Mediapartners-Google/2.1; +http://www.google.com/bot.html) is the bot that Adwords / Adsense uses to determine the content of your site. Not sure if that one is Adsense (for displaying contextual adverts) or Adwords (for calculating quality score) but either way good if they are looking. Note that both of the Googlebots come from IP 66.249.66.137.

So in addition to actually blocking Google from crawling and indexing my site I was blocking Adwords bots from visiting, which probably also explained by eCPM crashed yesterday. And that was before the major traffic crash.

Yesterday was the worse Monday I have experienced on this site (not *this* site, but the site I am talking about….) in a long time. I originally thought that I was still suffering as a result of the recent site restructure, which saw me move around over 1000 articles, many articles having well over 1000 words (the biggest having 41,000 words) and many being ranked No.1. in Google for their keywords.

Looking To The Future

The good news is that pages are being re-indexed now. This afternoon at the height of the disaster even the homepage was not showing for the domain name or brand name – instead content thieves and brand thieves were showing on top of the SERPs.

But the home page is back, the Google sitelinks are back (with new ones to boot) and some of the major keywords are back too.

Overall the site is still suffering, a combination of a major restructure and then blocking Google has caused untold damage, but hopefully time will heal.

So, What Actually Cured the Problem?

Whenver testing things, you should do one thing at a time. But I did not want to wait for this, so just charged in doing everything possible to resolve the problem. Any one of these may have been enough to fix the problem, or it may have been a combination;

  • Unblocking the IP address. This is the only sure thing that started getting the site back in the index.
  • Adding the feeds to the sitemaps
  • Resubmitting the sitemap
  • Testing the robots.txt
  • Republishing articles and pinging
  • Writing new content
  • Sending a reinclusion / reconsideration request

Lessons for the Day

  • Never block an IP address unless you are 100% sure it is not a good bot. Firewalls give false positives, although still no idea why Google searched for that URL (it does not actually exist on the site).
  • If the shit hits the fan take a deep breath and analyse the situation. Look at the logs. Know your logs. Until today I did not know how often Googlebot visits my site. Webmaster tools is always a few days out of date, so it looks healthy as the latest report is 2nd. I will probably see the problem in Webmaster tools tomorrow.
  • Keep a log of changes so that you can quickly reverse things. Since I started the site restructure about 6 weeks ago I have kept a log of all (almost all) changes. The spreadsheet is now about 500 rows long.
  • DO NOT PANIC.

Leave a Reply

Your email address will not be published. Required fields are marked *