Performance Benchmarking Ruby with MiniTest
A handy feature of MiniTest is the performance benchmarking assertions. Here's an example testing a couple of methods that are constant and linear in time as a function of their inputs:
require 'rubygems'
require 'minitest/benchmark'
require 'minitest/autorun'
class Thing
def constant_time_method(n)
true # O(1)
end
def linear_time_method(n)
n.times { |i| constant_time_method(i) } # O(n)
end
end
class AwesomeTest < MiniTest::Unit::TestCase
def setup
@thing = Thing.new
end
def test_constant_time_method_performance
assert_performance_constant 0.99999 do |n|
@thing.constant_time_method(n)
end
end
def test_linear_time_method_performance
assert_performance_linear 0.9999 do |n|
@thing.linear_time_method(n)
end
end
end
Whilst I wouldn't go nuts with this, it's a nice solution for when you have optimized some code and want to add a check against regressions.
Scope the source and docs for more details.
Backport 1.9.3 load fixes to 1.9.2 with RVM
$ wget http://redmine.ruby-lang.org/attachments/download/1958/ruby-1.9.2-p290-load-path-backport.diff
$ rvm install 1.9.2-p290 --patch ruby-1.9.2-p290-load-path-backport.diff
This cut the load time for Rails 3 environment for a complex project from 31 seconds down to 15, and has been running fine with no issues for 2 weeks.
Made of Code Theme for Xcode 4
A port of my TextMate theme, "Made of Code". Being a comic genius, I call it "Made of Xcode". It only works with Xcode 4.
To install download it and copy to ~/Library/Developer/Xcode/UserData/FontAndColorThemes. Restart Xcode and you can select it (Preferences > Fonts & Colors).
It’s the ephemeral but very real sense when you first make contact with the product that someone really truly understands you.
Awesome Rubygem for Gmail
A small selection of what you can do:
require "gmail"
Gmail.connect(username, password) do |gmail|
gmail.logged_in?
gmail.inbox.count
gmail.inbox.count(:unread)
gmail.inbox.count(:read)
gmail.inbox.find(:unread) do |email|
email.read!
email.attachments[0].save_to_file("/path/to/location")
end
end
(thx @bbergher)
Designing Beautiful Ruby APIs
Slides from a talk given by Wen-Tien Chang (@ihower) at Ruby Conf China 2010.
They're an amazingly rich source of information, especially the second half on Ruby's object model and meta-programming. If you use Ruby, you must read them!
ActiveSupport core extensions: Module
ActiveSupport provides some very handy additions the Module class, including attr_accessor_with_default, attr_internal_accessor, included_in_classes and synchronize, amongst others.
TextMate bundles: Rails, HAML, SASS + Shoulda
An awesome set of TextMate bundles for Ruby, Rails, HAML, SASS and Shoulda from phuibonhoa. This script backs up any existing bundles you’ve installed with the same name first, then installs all of the ones above.
VSS – a vector space search engine in Ruby
VSS is a vector space search engine with tf*idf weighting. Checkout the source on Github, or gem install vss to get started. If you want to know how it works, read on:
1. Tokenize the query
First our query string is tokenized into an alpha-only, downcased, porter stemmed array. We also remove common stop words and make sure the tokens are unique:
require 'stemmer'
require 'active_support'
STOP_WORDS = %w[
a b c d e f g h i j k l m n o p q r s t u v w x y z
an and are as at be by for from has he in is it its
of on that the to was were will with upon without among
]
def tokenize(string)
stripped = string.to_s.gsub(/[^a-z0-9\-\s\']/i, "") # remove punctuation
tokens = stripped.split(/\s+/).reject(&:blank?).map(&:downcase).map(&:stem)
tokens.reject { |t| STOP_WORDS.include?(t) }.uniq
end
2. Find corpus vocabulary
Given a document collection (or corpus):
doc1 = "I'm not even going to mention any TV series."
doc2 = "The Wire is the best thing ever. Fact."
doc3 = "Some would argue that Lost got a bit too wierd after season 2."
doc4 = "Lost is surely not in the same league as The Wire."
@docs = [doc1, doc2, doc3, doc4]
We first tokenize everything, to find our @vocab:
@vocab = tokenize(@docs.join(" "))
3. Generate vector indexes for each token in the vocabulary
@vector_keyword_index = begin
index, offset = {}, 0
@vocab.each do |keyword|
index[keyword] = offset
offset += 1
end
index
end
4. Perform the search, and return the ranked results
We can generate a vector for any document (or query):
require 'matrix'
def vector(doc)
arr = Array.new(@vector_keyword_index.size, 0)
tokens = tokenize(doc)
tokens &= @vocab # ensure all tokens are in vocab
tokens.each do |token|
tf = tokens.count(token)
num_docs_with_token = @docs.count { |d| tokenize(d).include?(token) }
idf = @docs.size / num_docs_with_token
index = @vector_keyword_index[token]
arr[index] = tf * idf
end
return Vector.elements(arr) # create a vector from arr
end
And compare 2 vectors:
def cosine(vector1, vector2)
dot_product = vector1.inner_product(vector2)
dot_product / (vector1.r * vector2.r)
end
# ranks from 0 to 100
def cosine_rank(vector1, vector2)
(cosine(vector1, vector2) + 1) / 2 * 100
end
So for our query, we just compare the cosine between each document vector and the query vector to get our rank. Then we annotate each document in our @docs collection with the rank and return the documents ordered by that value:
@query = "How can you compare The Wire with Lost?"
query_vector = vector(@query)
@docs.each do |doc|
doc_vector = vector(doc)
rank = cosine_rank(query_vector, doc_vector)
doc.instance_eval %{def rank; #{rank}; end} # bit mental
end
@results = @docs.sort { |a,b| b.rank <=> a.rank } # highest to lowest
And here's what @results looks like:
>> @results.each { |doc| puts doc + " (#{doc.rank})" }
Lost is surely not in the same league as The Wire. (68.2574185835055)
The Wire is the best thing ever. Fact. (58.5749292571254)
Some would argue that Lost got a bit too wierd after season 2. (55.5215763037423)
I'm not even going to mention any TV series. (50.0)
It works! I'm sure there are lots of optimizations you could make (e.g. by caching some of the tokenization steps), but for small document collections it's plenty fast enough.
Credits
Thanks to Joseph Wilk's excellent article on building a vector space search engine in Python.
