Google Sitemap Generator

February 26, 2008

Google sitemaps are nice for telling google what is where. Often clients want it for SEO or you have a site which has new content all the time and you want to keep google up to date.

Whatever the reason is thats you are interested in these little xml files, the following code allows you to generate a sitemap for a dynamic site in ruby.

Firstly the class:

require 'net/http'
  require 'uri'

  # A class specific to the application which generates a google sitemap from
  # the contents of the database.
  # Author: Alastair Brunton
  class GoogleSitemapGenerator

    def initialize(base_url, sources)
      @base_url = base_url
      @sources = sources 
    end

    # The main generator method which in turn adds to the path_array from the different
    # sources.
    # Sources are: pages, events, properties
    def generate
      path_ar = Array.new
      @sources.each do |source|
        # initialize the class and call the get_paths method on it.
        path_ar = path_ar + eval("#{source}.get_paths")
      end
      xml = generate_xml(path_ar)
      save_file(xml)
      update_google
    end

    # This creates the xml document.
  	def generate_xml(path_ar)
  		xml_str = ""
  		xml = Builder::XmlMarkup.new(:target => xml_str)

  		xml.instruct!
  			xml.urlset(:xmlns=>'http://www.google.com/schemas/sitemap/0.84') {
    			path_ar.each do |path|
      	    xml.url {
        	    	xml.loc(@base_url + path[:url])
        			xml.lastmod(path[:last_mod])
        			xml.changefreq('weekly')
     			 }
    			end
  			}	
  		xml_str
  	end

  	# Saves the xml file to disc. This could also be used to ping the webmaster tools
  	def save_file(xml)
  		File.open(RAILS_ROOT + '/public/sitemap.xml', "w+") do |f|
  			f.write(xml)	
  		end		
  	end

  	# Notify google of the new sitemap
  	def update_google
  	    sitemap_uri = @base_url + '/sitemap.xml'
  	    escaped_sitemap_uri = URI.escape(sitemap_uri)
  	    Net::HTTP.get('www.google.com',
  	                  '/webmasters/sitemaps/ping?sitemap=' +
  	                  escaped_sitemap_uri)
  	end


  end

You will notice that an array of strings are passed when calling the generator. These are names of object which implement the get_paths method. An example get_paths class method is as follows:

# for the google sitemap
   def self.get_paths
     path_ar = Array.new
     Property.live_properties.each do |property|
       path_ar << {:url => "/property/#{property.to_param}", :last_mod => property.updated_at.strftime('%Y-%m-%d')}
     end
     path_ar
   end

Basically, you need an array of hashes which each contain the url and the last_mod.

To call this little beastie it is best done from a cron on the production server. An example rake task to do this is as follows:

namespace :google_sitemap do
    desc "Generate a google sitemap from the site."
    task(:generate => :environment) do
      sources = ['Page', 'Event', 'Property']
      sitemap = GoogleSitemapGenerator.new('http://www.your_url.com', sources)
      sitemap.generate
    end
  end

Remember when you are calling it from a cron to pass the RAILS_ENV. This generator does rely on rails but you could convert it to only rely on ruby by modifying the rake task and changing the RAILS_ROOT reference in the save_file method. Probably can be made to work with Merb but I am unsure of how merb and rake work together. Will hopefully get my hands dirty with Merb sometime soon.

cd /var/www/apps/site/current /usr/bin/rake RAILS_ENV=production google_sitemap:generate