Google Sitemaps with Ruby on Rails, Capistrano, and Cron

June 10, 2008

This is a slight modification of code originally written by Alastair Brunton. I recently implemented this for Jetrecord and since Alastair was so generous, I decided to share the love as well. I have changed Alastair’s code to generate a sitemap index file plus sitemap files for each model, all of them gzipped to save on bandwidth.

I have also added Capistrano code to copy sitemap files from the previous release to the current release so we don’t lose our sitemap files when we deploy a new release.

Remember, Google sitemaps are for publicly available URLs. They’re for pages that you want Google to find and index. If you don’t want Google to find your CIA Operatives records, don’t tell Google about it!

Let’s just go straight to the code. I am going from the top down in my application’s root directory.

app/models/your_model.rb

You must add this code to each model that you want to generate a sitemap for. Here is an example for Airports on Jetrecord.

# put this inside app/models/airport.rb
def self.get_paths
  path_ar = []
  self.find(:all).each do |model|
    path_ar << {:url => "/airports/#{model.to_param}", :last_mod => model.updated_at.strftime('%Y-%m-%d')}
  end
  path_ar
end

config/sitemap/sitemap_tasks.rb

This is for Capistrano. You probably don’t have a config/sitemap directory. I created one and put my Capistrano sitemap task in it. This tells Capistrano, “After deploying my new release, copy the sitemap files from the previous release and store them in the same location in the current release.”

Capistrano::Configuration.instance(:must_exist).load do
  namespace :sitemap do
 
    desc "Copy the sitemap files after deploy"
    task :copy_sitemap, :roles => :app do
      puts "copying Rails sitemap files"
      sudo "cp #{previous_release}/public/sitemaps/* #{current_release}/public/sitemaps/"
    end
 
    after :deploy, 'sitemap:copy_sitemap'
  end
end

config/deploy.rb

This file usually contains your typical Capistrano recipes. All you have to do is require the sitemap_tasks file we created above.

# At the top of the file, after any other required files
require 'config/sitemap/sitemap_tasks'

lib/google_sitemap.rb

This is the meat of the whole thing. Kudos to Alastair for setting this up. The reason I modified it into using a sitemap index with sitemaps for each model is because Google allows a total of 50,000 links per sitemap. I have 48,000 navigation fixes, 20,000 airports, and 3,000 navaids in Jetrecord. By necessity I have to split my sitemap into many sitemaps.

I’m also gzipping the sitemap files because Google can read them and it saves bandwidth. Oh, and the URL to ping Google has changed, as has the XML namespace for their sitemap tags.

require 'net/http'
require 'uri'
 
# A class specific to the application which generates a google sitemap from the contents of the database.
# Author: Alastair Brunton
# Modified: Harry Love 2008-06-09
class GoogleSitemapGenerator
 
  def initialize(base_url, sources)
    @base_url = base_url
    @sources = sources
  end
 
  # 1. Iterate through each model's #get_paths method
  # 2. Create sitemap file for each model
  # 3. Create sitemap index file
  # 4. Ping Google
  def generate
    path_ar = []
    sitemaps = []
    @sources.each do |source|
      # initialize the class and call the get_paths method on it.
      path_ar = eval("#{source}.get_paths")
      xml = generate_sitemap(path_ar)
      save_file(source, xml)
    end
    index = generate_sitemap_index(@sources)
    save_file('index', index)
    update_google
  end
 
  # Create a sitemap document for a model
  def generate_sitemap(path_ar)
    xml_str = ""
    xml = Builder::XmlMarkup.new(:target => xml_str)
    xml.instruct!
    xml.urlset(:xmlns => 'http://www.sitemaps.org/schemas/sitemap/0.9') {
      path_ar.each do |path|
        xml.url {
      	  xml.loc(@base_url + path[:url])
      	  xml.lastmod(path[:last_mod])
      	  xml.changefreq('weekly')
        }
      end
    }
    xml_str
  end
 
  # Create a sitemap index document
  def generate_sitemap_index(sitemaps)
    xml_str = ""
    xml = Builder::XmlMarkup.new(:target => xml_str)
    xml.instruct!
    xml.sitemapindex(:xmlns => 'http://www.sitemaps.org/schemas/sitemap/0.9') {
      sitemaps.each do |site|
        xml.sitemap {
      	  xml.loc(@base_url + "/sitemaps/sitemap_#{site}.xml.gz")
      	  xml.lastmod(Time.now.strftime('%Y-%m-%d'))
   	}
      end
    }
    xml_str
  end
 
  # Save the xml file (gzipped) to disk
  def save_file(source, xml)
    File.open(RAILS_ROOT + "/public/sitemaps/sitemap_#{source}.xml.gz", 'w+') do |f|
      gz = Zlib::GzipWriter.new(f)
      gz.write xml
      gz.close
    end
  end
 
  # Notify Google of the new sitemap index file
  def update_google
    sitemap_uri = @base_url + '/sitemaps/sitemap_index.xml.gz'
    escaped_sitemap_uri = URI.escape(sitemap_uri)
    Net::HTTP.get('www.google.com', '/webmasters/tools/ping?sitemap=' + escaped_sitemap_uri)
  end
end

lib/tasks/sitemap.rake

This is the rake task that we’ll call periodically from Cron to generate new sitemap files.

require 'google_sitemap'
namespace :google_sitemap do
  desc "Generate a Google sitemap from the models"
  task(:generate => :environment) do
    # Generate sitemaps for each of the models listed in the array
    sources = %w( Airport Navaid Fix AnotherModel AnotherModel AndAnotherModel EtCetera )
    sitemap = GoogleSitemapGenerator.new('http://yourdomain.com', sources)
    sitemap.generate
  end
end

public/sitemaps

Assuming this directory doesn’t exist already, create it.

Also, depending on what stack you’re using to deploy your Rails app, you may also need to tell your server to skip proxying HTTP requests to this directory. For example, I’m proxying requests to Mongrel via Apache. So, in the Apache virtual host conf file for my app, I had to add a ProxyPass directive so Apache would serve the sitemap files instead of Mongrel.

# Right after the ProxyPass directives for images, stylesheets, and javascripts
ProxyPass /sitemaps !

Don’t forget to restart Apache after you save the new conf file!

Add a Cron Job

Lastly, you need to add a cron job to call the rake task so we can generate new sitemap files from time to time. The frequency is up to you and the requirements of your app.

Unfortunately, I’m not up to date on raw Cron commands. I use a GUI provided by my web host. But here’s the command I’m using on Solaris to call the rake task. You’ll have to edit this to suit the specifics of your application and server environment.

cd /var/www/apps/myapp/current &amp&amp /opt/local/bin/rake RAILS_ENV=production google_sitemap:generate

Don’t forget to tell Rake to use the production environment. Another potential gotcha: you usually have to give cron the full path to rake. You can find out where it is on your server by logging in as the user you plan to use for the cron job (usually root) and doing “which rake”. If that doesn’t bring it up it means rake isn’t in your PATH. That’s okay. You’ll just have to do a little more digging to find out where rake is installed on your system.

If I’ve left out anything let me know. By the way, this would make a great plugin or gem, if only I knew how to make them.

8 Comments

  1. David Kennedy spoke thusly:

    I have one web application that is used to run 100+ domains. We’re adding content to some of these sites http://www.PublicWaterMan.com and http://www.PublicEvidenceMan.com and don’t seem to be getting indexed by Google. We get hits with their ‘bots’ as well as Yahoo’s and MSN’s each night (exception_notification emails) but we don’t seem to score visibility to the search engines.
    Anyway, I’m just now waking up to the fact that I’m blowing it without having any sitemap.xml and wondered if I could hire you to teach me to write a routine and the cron job on my server. I know this would probably be very boring for you and I would do my best to be a ‘quick study’ and value your time.
    My application is HUGE and has over 60 tables with each one being scoped by the website. I am grateful for any reply.
    David

  2. MJ spoke thusly:

    Thanks for the good writeup, I had to add a few things:

    * require ‘zlib’ to google_sitemap_generator.rb
    * in get_paths I changed model.updated_at to the name of the var that was passed in

  3. Harry Love spoke thusly:

    Thanks, MJ, and sorry about that. Rails 2.1 has zlib required as part of ActiveSupport so it wasn’t necessary for me.

    Thanks for the catch on get_paths. I have updated the code. I think I copied it midway through a change, which is why the variable names didn’t match up.

  4. Jason spoke thusly:

    Hmmm, I’m not understanding in the code above how the sources and base_url are getting populated, or even where the methods on your models are called from. Is there something that I’m missing?

  5. Harry spoke thusly:

    Jason, the first two are in the rake file. The model methods are called from the #generate method in google_sitemap.rb.

  6. links for 2008-07-26 | Libin Pan spoke thusly:

    [...] Google Sitemaps with Ruby on Rails, Capistrano, and Cron, or so says Harry Love (tags: google sitemap rails rake ruby rubyonrails) [...]

  7. snowmaninthesun spoke thusly:

    Hey, thanks for the write up, works great, though it took me a few minutes to figure out the ‘zlib’ problem with rails 2.0.2 . I have a massive amount of model generated pages, so this is perfect, but if i upload a sitemap without indexing my static pages (index.html.erb), will google ignore them, or is their bot smarter than that??

  8. Harry Love spoke thusly:

    As far as I know, Googlebot follows links. If your static pages are linked from the pages in the sitemap, Googlebot will find them. It probably just won’t assign them a priority or a schedule to index.

    The format for a sitemap file is pretty simple, though. You could create a static one yourself that lists all of your static pages, put it in the sitemaps directory, and then add a line to the #generate_sitemap_index method to write out a link to that file.

    I would copy the xml.sitemap block and put it above the sitemaps.each iterator. Change xml.loc to point to your static file.

    The Capistrano task does a bulk copy upon deploy which means your static file will get copied as well.

Say, say, say, what you want.