Telling the Google Bot no
On every web project I've worked, one of the key/top/vital priorities was to make sure that Google could index every single last word of the site so that if someone was searching for what we had, they'd find it. My most recent project turned that on its head.
What if you don't want Google to index everything? What if you only want Google to index this, but not that?
The project where this came up was Tampa Bay Mug Shots, a site that displays the mug shots of people booked into county jails in three Florida counties as they come in. To make the site, we wrote bots that go and scrape the county jail websites every hour. The information we gather is already in the public domain and has been for years. Many counties have several years -- 14 years in Hillsborough County -- of mugs available.
From before we even had written a line of code, we were concerned about the Google issue. We did not want the first result in Google for someone's name to be our site. The goal of our site was to show you who was passing through local jails at any given moment, not to become a permanent repository of criminal records. For many, many reasons, we can't do that. It's difficult to track an offender through the system with reporters who know what to look for. To do it in an automated fashion with bots is impossible. And not only that, it's not the job of a newspaper company to become a repository of criminal histories. Nor should it be.
So from the start, the site was constructed with the First Result in Google notion tops on our mind. Here's what we did to stop the Google Bot:
1. Mugs expire after 60 days: The queries that bring up a mug are date-based. The query checks if the booking date on the record is older than 60 days. If it is, it serves up a template that says it's been removed. If it's newer than 60 days, it serves up the mug. This serves two purposes: 1. If someone links to a mug on their blog, and 100 days later someone finds it and follows it, they'll find nothing. 2. Since we can't know the resolution of any case, we just drop the mug, convicted or acquitted. One of the most frequent questions I get is about that issue: What if the charges are dropped? The answer is the same as if the person is convicted and punished to the fullest extent of the law: On our site, the mug is gone in 60 days. It's a matter of public record that someone was arrested and booked into jail, and transparency here is an important part of the justice system. But since we can't know the outcome of a case, we elected to just drop the mug after a set period.
2. The mug shot galleries are made with Javascript, which the Google Bot ignores. If you ever want Google to just never find something, wrap it in Javascript. This is a mistake a lot of sites make, building rich interfaces -- or whole applications -- out of Javascript which are almost totally ignored by Google. In this case, we used it to our advantage. To Google, there is no link to an individual's page on those galleries for the bot to follow. No link, no indexing.
3. The sitemap we submitted to Google doesn't include individual mugs. The best way to get Google to index your site is to give the bot a road map in the form of a Google sitemap. Ours only includes our index pages -- the browse by height, weight, eye color, etc. pages. We simply didn't include a map to every individual booking photo.
4. The individual pages have a meta tag that says "don't index me." Even with our other efforts, Google could still find individual pages. Anytime anyone linked to a person in their blog, or Twitter stream or other places where the Google Bot goes, the bot could follow that link and index that individual's page. So, our last step was to attach a tag to those individual pages that says to Google "nothing to see here, move along."
For the tech heads, we used Django, and we can target the Google Bot at individual pages through a block tag in our templates. We put a
{% block google %}on our base template in the head tag and on the mug shot detail template, we put
<meta name="robots" content="noindex, nofollow, noarchive, nosnippet, noodp" />into that block. We can do it anywhere now, should we find a need.
I can't say I think it will be a common case where you don't want Google to index something, but sometimes privacy trumps SEO. Hope this helps anyone else facing this trade-off.