Performance Benchmarking Ruby with MiniTest

A handy feature of MiniTest is the performance benchmarking assertions. Here's an example testing a couple of methods that are constant and linear in time as a function of their inputs:

require 'rubygems'
require 'minitest/benchmark'
require 'minitest/autorun'

class Thing
  def constant_time_method(n)
    true # O(1)
  end
  
  def linear_time_method(n)
    n.times { |i| constant_time_method(i) } # O(n)
  end
end

class AwesomeTest < MiniTest::Unit::TestCase
  def setup
    @thing = Thing.new
  end
  
  def test_constant_time_method_performance
    assert_performance_constant 0.99999 do |n|
     @thing.constant_time_method(n)
    end
  end
  
  def test_linear_time_method_performance
    assert_performance_linear 0.9999 do |n|
      @thing.linear_time_method(n)
    end
  end
end

Whilst I wouldn't go nuts with this, it's a nice solution for when you have optimized some code and want to add a check against regressions.

Scope the source and docs for more details.

Backport 1.9.3 load fixes to 1.9.2 with RVM

$ wget http://redmine.ruby-lang.org/attachments/download/1958/ruby-1.9.2-p290-load-path-backport.diff
$ rvm install 1.9.2-p290 --patch ruby-1.9.2-p290-load-path-backport.diff

This cut the load time for Rails 3 environment for a complex project from 31 seconds down to 15, and has been running fine with no issues for 2 weeks.

Made of Code Theme for Xcode 4

A port of my TextMate theme, "Made of Code". Being a comic genius, I call it "Made of Xcode". It only works with Xcode 4.

To install download it and copy to ~/Library/Developer/Xcode/UserData/FontAndColorThemes. Restart Xcode and you can select it (Preferences > Fonts & Colors).

It’s the ephemeral but very real sense when you first make contact with the product that someone really truly understands you.

Awesome Rubygem for Gmail

A small selection of what you can do:

require "gmail"

Gmail.connect(username, password) do |gmail|
  gmail.logged_in?

  gmail.inbox.count
  gmail.inbox.count(:unread)
  gmail.inbox.count(:read)

  gmail.inbox.find(:unread) do |email|
    email.read!
    email.attachments[0].save_to_file("/path/to/location")
  end
end

(thx @bbergher)

Designing Beautiful Ruby APIs

Slides from a talk given by Wen-Tien Chang (@ihower) at Ruby Conf China 2010.

They're an amazingly rich source of information, especially the second half on Ruby's object model and meta-programming. If you use Ruby, you must read them!

ActiveSupport core extensions: Module

ActiveSupport provides some very handy additions the Module class, including attr_accessor_with_default, attr_internal_accessor, included_in_classes and synchronize, amongst others.

TextMate bundles: Rails, HAML, SASS + Shoulda

An awesome set of TextMate bundles for Ruby, Rails, HAML, SASS and Shoulda from phuibonhoa. This script backs up any existing bundles you’ve installed with the same name first, then installs all of the ones above.

VSS – a vector space search engine in Ruby

VSS is a vector space search engine with tf*idf weighting. Checkout the source on Github, or gem install vss to get started. If you want to know how it works, read on:

1. Tokenize the query

First our query string is tokenized into an alpha-only, downcased, porter stemmed array. We also remove common stop words and make sure the tokens are unique:

require 'stemmer'
require 'active_support'

STOP_WORDS = %w[
  a b c d e f g h i j k l m n o p q r s t u v w x y z
  an and are as at be by for from has he in is it its
  of on that the to was were will with upon without among
]

def tokenize(string)
  stripped = string.to_s.gsub(/[^a-z0-9\-\s\']/i, "") # remove punctuation
  tokens = stripped.split(/\s+/).reject(&:blank?).map(&:downcase).map(&:stem)
  tokens.reject { |t| STOP_WORDS.include?(t) }.uniq
end

2. Find corpus vocabulary

Given a document collection (or corpus):

doc1 = "I'm not even going to mention any TV series."
doc2 = "The Wire is the best thing ever. Fact."
doc3 = "Some would argue that Lost got a bit too wierd after season 2."
doc4 = "Lost is surely not in the same league as The Wire."
@docs = [doc1, doc2, doc3, doc4]

We first tokenize everything, to find our @vocab:

@vocab = tokenize(@docs.join(" "))

3. Generate vector indexes for each token in the vocabulary

@vector_keyword_index = begin
  index, offset = {}, 0      

  @vocab.each do |keyword|
    index[keyword] = offset
    offset += 1
  end

  index
end

4. Perform the search, and return the ranked results

We can generate a vector for any document (or query):

require 'matrix'
  
def vector(doc)
  arr = Array.new(@vector_keyword_index.size, 0)

  tokens = tokenize(doc)
  tokens &= @vocab # ensure all tokens are in vocab

  tokens.each do |token|
    tf = tokens.count(token)
    num_docs_with_token = @docs.count { |d| tokenize(d).include?(token) }
    idf = @docs.size / num_docs_with_token

    index = @vector_keyword_index[token]
    arr[index] = tf * idf
  end

  return Vector.elements(arr) # create a vector from arr   
end

And compare 2 vectors:

def cosine(vector1, vector2)
  dot_product = vector1.inner_product(vector2)
  dot_product / (vector1.r * vector2.r)
end

# ranks from 0 to 100
def cosine_rank(vector1, vector2)
  (cosine(vector1, vector2) + 1) / 2 * 100
end

So for our query, we just compare the cosine between each document vector and the query vector to get our rank. Then we annotate each document in our @docs collection with the rank and return the documents ordered by that value:

@query = "How can you compare The Wire with Lost?"
query_vector = vector(@query)

@docs.each do |doc|
  doc_vector = vector(doc)
  rank = cosine_rank(query_vector, doc_vector)
  doc.instance_eval %{def rank; #{rank}; end} # bit mental
end

@results = @docs.sort { |a,b| b.rank <=> a.rank } # highest to lowest

And here's what @results looks like:

>> @results.each { |doc| puts doc + " (#{doc.rank})" }
Lost is surely not in the same league as The Wire. (68.2574185835055)
The Wire is the best thing ever. Fact. (58.5749292571254)
Some would argue that Lost got a bit too wierd after season 2. (55.5215763037423)
I'm not even going to mention any TV series. (50.0)

It works! I'm sure there are lots of optimizations you could make (e.g. by caching some of the tokenization steps), but for small document collections it's plenty fast enough.

Credits

Thanks to Joseph Wilk's excellent article on building a vector space search engine in Python.

A mix of code and design by Mark Dodwell, developer and designer.

Older entries →