In the past in order to gather server side usage stats, we have had a procession of different methods to track and record this data in a way that was outside the main app. The reason for this separation is to ensure that search results on TouchLocal were as quick as possible while still recording the details. The first attempt was to use Stomp, which was interesting but we had problems with stability. After that, and for a long time, we had a backend Merb application that was sent the tracking information from the webserver and returned immediately. While this was quick and pretty fast, it had the downside of the HTTP request from the main site which would time out and make the site crawl to a halt if the backend processes were not responding for whetever reason.
As a result, we went on a third rewrite effort. This time we went back to basics, and decided to use log4r to output a special log format that would be parsed and inserted into the tracking tables aynchronously. This would imply a delay, but only in the vicinity of a few minutes. The main design problems were ensuring stability of the processing platform and also the ability to retry files that might have been missed or errored. The repeatability was achieved by using a combination of a hash of the tracking data + the timestamp + some random information, and a (large) table that tracks this timestamps and hash combination (given that hashes can sometimes collide, it makes sense to add the timestamp as a factor).
The backend process to parse and record the information from this log file format was written in Ruby of course. In order to achieve stability I decided to use the Ruby Daemons gem. This handles PID file management and lots of other neat things for writing a daemon so the basics of long-running processes were not my ‘problem’ as such. So that these Ruby processes could scale up, I ensured as I was writing it that it would be aware of the potential error conditions of multiple processes, such as one process moving a file when another was looking for it (race conditions, etc). While the Daemons gem uses the fork() implementation on *nix for the standard headless run mode, it also support a non-background run mode which works on Windows. I also chose to use ActiveRecord, reusing the AR models from the Merb application.
One of my personal goals was to ensure the daemon was as self-contained as possible. For me, this meant that I wanted the Sysadmin to be able to check it out and run the start command and have it work on a standard base Ruby installation.
Here’s the initialisation code for the Daemon:
################ ### REQUIRES ### ################ # This loads gems in vendor/gems (abstracted out so it can be used in rake tasks that haven't loaded ENV yet) require "lib/local_gem_loader" require 'rubygems' require 'daemons' require 'activerecord' require 'json/pure' require 'erb' require 'cgi' # +require+ all the models model_path = File.expand_path(File.join(File.dirname(__FILE__), 'models')) $LOAD_PATH.unshift model_path Dir.glob(File.join(model_path, '*.rb')) do |file| require file end # Log4r does not work because the Daemons gem closes all open file descriptors. #require 'log4r' #require 'config/logger' #logger = ::DEFAULT_LOGGER ################# ### CONSTANTS ### ################# SLEEP_TIME_SECONDS = 5 FILE_NAME = 'tracking_daemon.rb' # name to report as the process ##################### ### CONFIGURATION ### ##################### database_configuration_file = File.expand_path(File.join(File.dirname(__FILE__), 'config', 'database.yml')) database_configuration = YAML::load(ERB.new(IO.read(database_configuration_file)).result) ###################### ### INITIALISATION ### ###################### # Parse out the RAILS_ENV=production setting ARGV.each do |arg| if arg.include?('=') key, val = arg.split('=', 2) ENV[key] ||= val elsif database_configuration.keys.include?(arg) ENV['RAILS_ENV'] ||= arg end end RAILS_ENV = (ENV['RAILS_ENV'] || "development").dup puts "Starting #{FILE_NAME} daemon in #{RAILS_ENV} mode" ActiveRecord::Base.configurations = database_configuration ActiveRecord::Base.establish_connection RAILS_ENV root_files_path = File.expand_path(File.join(File.dirname(__FILE__), 'files')) incoming_files_path = File.expand_path(File.join(root_files_path, 'incoming_files')) options = { :multiple => true, :ontop => false, :backtrace => true, :log_output => true, :monitor => true } ############## ### DAEMON ### ############## Daemons.run_proc(FILE_NAME, options) do loop do # 1. Get the next file to process Dir.glob(File.join(incoming_files_path, '*')) do |incoming_file| end #... puts "Sleep #{SLEEP_TIME_SECONDS} sec" if RAILS_ENV == "development" sleep(SLEEP_TIME_SECONDS) end end
The local_gem_loader is something of my own invention, based on the vendor/gems loader that was introduced in Rails 2. I wrote it initially to enable our (then) Rails 1.2.6 app to have vendored gems. It was very useful here to allow me to meet my desire to have this thing be checked out from SVN and started. Here it is – it’s pretty simple really:
# Load the gems in /vendor/gems standard_dirs = ['rails', 'plugins'] gems = Dir[File.join(__FILE__, "vendor/*/**") ] if gems.any? gems.each do |dir| next if standard_dirs.include?(File.basename(dir)) lib = File.join(dir, 'lib') $LOAD_PATH.unshift(lib) if File.directory?(lib) src = File.join(dir, 'src') $LOAD_PATH.unshift(src) if File.directory?(src) end end
After including that line, I was able to vendor all the gems I needed (activerecord, json-pure, and even daemons) in the vendor/gems directory I created. After that, the ./models directory is loaded with the lines
# +require+ all the models model_path = File.expand_path(File.join(File.dirname(__FILE__), 'models')) $LOAD_PATH.unshift model_path Dir.glob(File.join(model_path, '*.rb')) do |file| require file end
Also note that as per the behaviour of how Daemons is designed, as it starts it closes all open file descriptors. While I read this in the documentation, I still tried to integrate Log4r, and spent a very confused hour wondering why all my log files were erroring on write… anyhoo…
After ensuring the models are in the load path and are ready to go, ActiveRecord needs to be initialised. I added the ability to at runtime choose the environment to write to database-wise, just like Rails does. This is achieved here:
# First load the Rails config/database.yml database_configuration_file = File.expand_path(File.join(File.dirname(__FILE__), 'config', 'database.yml')) database_configuration = YAML::load(ERB.new(IO.read(database_configuration_file)).result) # Parse out the RAILS_ENV=production setting, which can be either in the form # ruby tracking_daemon.rb start RAILS_ENV=production or # ruby tracking_daemon.rb start production # Note that the environments allowed are vaildated against the ones available in the database.yml ARGV.each do |arg| if arg.include?('=') key, val = arg.split('=', 2) ENV[key] ||= val elsif database_configuration.keys.include?(arg) ENV['RAILS_ENV'] ||= arg end end RAILS_ENV = (ENV['RAILS_ENV'] || "development").dup puts "Starting #{FILE_NAME} daemon in #{RAILS_ENV} mode" ActiveRecord::Base.configurations = database_configuration ActiveRecord::Base.establish_connection RAILS_ENV
At the end of this we have loaded the database config and connected to the database. Happy days. The only thing left is to start the daemon, which I chose to do in an inline fashion. Note that Daemons allows you to have the process report whatever name you like in the process lists, but I went with the name of the file itself for clarity:
root_files_path = File.expand_path(File.join(File.dirname(__FILE__), 'files')) incoming_files_path = File.expand_path(File.join(root_files_path, 'incoming_files')) options = { :multiple => true, # allow multiple concurrent of the same :ontop => false, # daemonise :backtrace => true, # show full failure info :log_output => true, :monitor => true # instantiate a monitor to restart as required } Daemons.run_proc(FILE_NAME, options) do loop do # 1. Get the next file to process Dir.glob(File.join(incoming_files_path, '*')) do |incoming_file| end #... puts "Sleep #{SLEEP_TIME_SECONDS} sec" if RAILS_ENV == "development" sleep(SLEEP_TIME_SECONDS) end end
All in all it’s been a great success – what was previously using a few backend servers running full whack to process all the incoming information, was now using 2 daemons on a single server in each datacentre. They hardly even show up on the top list. Excellent stuff.