Google Sitemaps with Ruby on Rails, Capistrano, and Cron
This is a slight modification of code originally written by Alastair Brunton. I recently implemented this for Jetrecord and since Alastair was so generous, I decided to share the love as well. I have changed Alastair’s code to generate a sitemap index file plus sitemap files for each model, all of them gzipped to save on bandwidth.
I have also added Capistrano code to copy sitemap files from the previous release to the current release so we don’t lose our sitemap files when we deploy a new release.
Remember, Google sitemaps are for publicly available URLs. They’re for pages that you want Google to find and index. If you don’t want Google to find your CIA Operatives records, don’t tell Google about it!
Let’s just go straight to the code. I am going from the top down in my application’s root directory.
app/models/your_model.rb
You must add this code to each model that you want to generate a sitemap for. Here is an example for Airports on Jetrecord.
# put this inside app/models/airport.rb def self.get_paths path_ar = [] self.find(:all).each do |model| path_ar << {:url => "/airports/#{model.to_param}", :last_mod => model.updated_at.strftime('%Y-%m-%d')} end path_ar end
config/sitemap/sitemap_tasks.rb
This is for Capistrano. You probably don’t have a config/sitemap directory. I created one and put my Capistrano sitemap task in it. This tells Capistrano, “After deploying my new release, copy the sitemap files from the previous release and store them in the same location in the current release.”
Capistrano::Configuration.instance(:must_exist).load do namespace :sitemap do desc "Copy the sitemap files after deploy" task :copy_sitemap, :roles => :app do puts "copying Rails sitemap files" sudo "cp #{previous_release}/public/sitemaps/* #{current_release}/public/sitemaps/" end after :deploy, 'sitemap:copy_sitemap' end end
config/deploy.rb
This file usually contains your typical Capistrano recipes. All you have to do is require the sitemap_tasks file we created above.
# At the top of the file, after any other required files require 'config/sitemap/sitemap_tasks'
lib/google_sitemap.rb
This is the meat of the whole thing. Kudos to Alastair for setting this up. The reason I modified it into using a sitemap index with sitemaps for each model is because Google allows a total of 50,000 links per sitemap. I have 48,000 navigation fixes, 20,000 airports, and 3,000 navaids in Jetrecord. By necessity I have to split my sitemap into many sitemaps.
I’m also gzipping the sitemap files because Google can read them and it saves bandwidth. Oh, and the URL to ping Google has changed, as has the XML namespace for their sitemap tags.
require 'net/http' require 'uri' # A class specific to the application which generates a google sitemap from the contents of the database. # Author: Alastair Brunton # Modified: Harry Love 2008-06-09 class GoogleSitemapGenerator def initialize(base_url, sources) @base_url = base_url @sources = sources end # 1. Iterate through each model's #get_paths method # 2. Create sitemap file for each model # 3. Create sitemap index file # 4. Ping Google def generate path_ar = [] sitemaps = [] @sources.each do |source| # initialize the class and call the get_paths method on it. path_ar = eval("#{source}.get_paths") xml = generate_sitemap(path_ar) save_file(source, xml) end index = generate_sitemap_index(@sources) save_file('index', index) update_google end # Create a sitemap document for a model def generate_sitemap(path_ar) xml_str = "" xml = Builder::XmlMarkup.new(:target => xml_str) xml.instruct! xml.urlset(:xmlns => 'http://www.sitemaps.org/schemas/sitemap/0.9') { path_ar.each do |path| xml.url { xml.loc(@base_url + path[:url]) xml.lastmod(path[:last_mod]) xml.changefreq('weekly') } end } xml_str end # Create a sitemap index document def generate_sitemap_index(sitemaps) xml_str = "" xml = Builder::XmlMarkup.new(:target => xml_str) xml.instruct! xml.sitemapindex(:xmlns => 'http://www.sitemaps.org/schemas/sitemap/0.9') { sitemaps.each do |site| xml.sitemap { xml.loc(@base_url + "/sitemaps/sitemap_#{site}.xml.gz") xml.lastmod(Time.now.strftime('%Y-%m-%d')) } end } xml_str end # Save the xml file (gzipped) to disk def save_file(source, xml) File.open(RAILS_ROOT + "/public/sitemaps/sitemap_#{source}.xml.gz", 'w+') do |f| gz = Zlib::GzipWriter.new(f) gz.write xml gz.close end end # Notify Google of the new sitemap index file def update_google sitemap_uri = @base_url + '/sitemaps/sitemap_index.xml.gz' escaped_sitemap_uri = URI.escape(sitemap_uri) Net::HTTP.get('www.google.com', '/webmasters/tools/ping?sitemap=' + escaped_sitemap_uri) end end
lib/tasks/sitemap.rake
This is the rake task that we’ll call periodically from Cron to generate new sitemap files.
require 'google_sitemap' namespace :google_sitemap do desc "Generate a Google sitemap from the models" task(:generate => :environment) do # Generate sitemaps for each of the models listed in the array sources = %w( Airport Navaid Fix AnotherModel AnotherModel AndAnotherModel EtCetera ) sitemap = GoogleSitemapGenerator.new('http://yourdomain.com', sources) sitemap.generate end end
public/sitemaps
Assuming this directory doesn’t exist already, create it.
Also, depending on what stack you’re using to deploy your Rails app, you may also need to tell your server to skip proxying HTTP requests to this directory. For example, I’m proxying requests to Mongrel via Apache. So, in the Apache virtual host conf file for my app, I had to add a ProxyPass directive so Apache would serve the sitemap files instead of Mongrel.
# Right after the ProxyPass directives for images, stylesheets, and javascripts ProxyPass /sitemaps !
Don’t forget to restart Apache after you save the new conf file!
Add a Cron Job
Lastly, you need to add a cron job to call the rake task so we can generate new sitemap files from time to time. The frequency is up to you and the requirements of your app.
Unfortunately, I’m not up to date on raw Cron commands. I use a GUI provided by my web host. But here’s the command I’m using on Solaris to call the rake task. You’ll have to edit this to suit the specifics of your application and server environment.
cd /var/www/apps/myapp/current && /opt/local/bin/rake RAILS_ENV=production google_sitemap:generate
Don’t forget to tell Rake to use the production environment. Another potential gotcha: you usually have to give cron the full path to rake. You can find out where it is on your server by logging in as the user you plan to use for the cron job (usually root) and doing “which rake”. If that doesn’t bring it up it means rake isn’t in your PATH. That’s okay. You’ll just have to do a little more digging to find out where rake is installed on your system.
If I’ve left out anything let me know. By the way, this would make a great plugin or gem, if only I knew how to make them.









