VSS – a vector space search engine in Ruby

VSS is a vector space search engine with tf*idf weighting. Checkout the source on Github, or gem install vss to get started. If you want to know how it works, read on:

1. Tokenize the query

First our query string is tokenized into an alpha-only, downcased, porter stemmed array. We also remove common stop words and make sure the tokens are unique:

require 'stemmer'
require 'active_support'

STOP_WORDS = %w[
  a b c d e f g h i j k l m n o p q r s t u v w x y z
  an and are as at be by for from has he in is it its
  of on that the to was were will with upon without among
]

def tokenize(string)
  stripped = string.to_s.gsub(/[^a-z0-9\-\s\']/i, "") # remove punctuation
  tokens = stripped.split(/\s+/).reject(&:blank?).map(&:downcase).map(&:stem)
  tokens.reject { |t| STOP_WORDS.include?(t) }.uniq
end

2. Find corpus vocabulary

Given a document collection (or corpus):

doc1 = "I'm not even going to mention any TV series."
doc2 = "The Wire is the best thing ever. Fact."
doc3 = "Some would argue that Lost got a bit too wierd after season 2."
doc4 = "Lost is surely not in the same league as The Wire."
@docs = [doc1, doc2, doc3, doc4]

We first tokenize everything, to find our @vocab:

@vocab = tokenize(@docs.join(" "))

3. Generate vector indexes for each token in the vocabulary

@vector_keyword_index = begin
  index, offset = {}, 0      

  @vocab.each do |keyword|
    index[keyword] = offset
    offset += 1
  end

  index
end

4. Perform the search, and return the ranked results

We can generate a vector for any document (or query):

require 'matrix'
  
def vector(doc)
  arr = Array.new(@vector_keyword_index.size, 0)

  tokens = tokenize(doc)
  tokens &= @vocab # ensure all tokens are in vocab

  tokens.each do |token|
    tf = tokens.count(token)
    num_docs_with_token = @docs.count { |d| tokenize(d).include?(token) }
    idf = @docs.size / num_docs_with_token

    index = @vector_keyword_index[token]
    arr[index] = tf * idf
  end

  return Vector.elements(arr) # create a vector from arr   
end

And compare 2 vectors:

def cosine(vector1, vector2)
  dot_product = vector1.inner_product(vector2)
  dot_product / (vector1.r * vector2.r)
end

# ranks from 0 to 100
def cosine_rank(vector1, vector2)
  (cosine(vector1, vector2) + 1) / 2 * 100
end

So for our query, we just compare the cosine between each document vector and the query vector to get our rank. Then we annotate each document in our @docs collection with the rank and return the documents ordered by that value:

@query = "How can you compare The Wire with Lost?"
query_vector = vector(@query)

@docs.each do |doc|
  doc_vector = vector(doc)
  rank = cosine_rank(query_vector, doc_vector)
  doc.instance_eval %{def rank; #{rank}; end} # bit mental
end

@results = @docs.sort { |a,b| b.rank <=> a.rank } # highest to lowest

And here's what @results looks like:

>> @results.each { |doc| puts doc + " (#{doc.rank})" }
Lost is surely not in the same league as The Wire. (68.2574185835055)
The Wire is the best thing ever. Fact. (58.5749292571254)
Some would argue that Lost got a bit too wierd after season 2. (55.5215763037423)
I'm not even going to mention any TV series. (50.0)

It works! I'm sure there are lots of optimizations you could make (e.g. by caching some of the tokenization steps), but for small document collections it's plenty fast enough.

Credits

Thanks to Joseph Wilk's excellent article on building a vector space search engine in Python.

A mix of design and code, by Mark Dodwell, web designer and Ruby on Rails developer.