VSS – a vector space search engine in Ruby
VSS is a vector space search engine with tf*idf weighting. Checkout the source on Github, or gem install vss to get started. If you want to know how it works, read on:
1. Tokenize the query
First our query string is tokenized into an alpha-only, downcased, porter stemmed array. We also remove common stop words and make sure the tokens are unique:
require 'stemmer'
require 'active_support'
STOP_WORDS = %w[
a b c d e f g h i j k l m n o p q r s t u v w x y z
an and are as at be by for from has he in is it its
of on that the to was were will with upon without among
]
def tokenize(string)
stripped = string.to_s.gsub(/[^a-z0-9\-\s\']/i, "") # remove punctuation
tokens = stripped.split(/\s+/).reject(&:blank?).map(&:downcase).map(&:stem)
tokens.reject { |t| STOP_WORDS.include?(t) }.uniq
end
2. Find corpus vocabulary
Given a document collection (or corpus):
doc1 = "I'm not even going to mention any TV series."
doc2 = "The Wire is the best thing ever. Fact."
doc3 = "Some would argue that Lost got a bit too wierd after season 2."
doc4 = "Lost is surely not in the same league as The Wire."
@docs = [doc1, doc2, doc3, doc4]
We first tokenize everything, to find our @vocab:
@vocab = tokenize(@docs.join(" "))
3. Generate vector indexes for each token in the vocabulary
@vector_keyword_index = begin
index, offset = {}, 0
@vocab.each do |keyword|
index[keyword] = offset
offset += 1
end
index
end
4. Perform the search, and return the ranked results
We can generate a vector for any document (or query):
require 'matrix'
def vector(doc)
arr = Array.new(@vector_keyword_index.size, 0)
tokens = tokenize(doc)
tokens &= @vocab # ensure all tokens are in vocab
tokens.each do |token|
tf = tokens.count(token)
num_docs_with_token = @docs.count { |d| tokenize(d).include?(token) }
idf = @docs.size / num_docs_with_token
index = @vector_keyword_index[token]
arr[index] = tf * idf
end
return Vector.elements(arr) # create a vector from arr
end
And compare 2 vectors:
def cosine(vector1, vector2)
dot_product = vector1.inner_product(vector2)
dot_product / (vector1.r * vector2.r)
end
# ranks from 0 to 100
def cosine_rank(vector1, vector2)
(cosine(vector1, vector2) + 1) / 2 * 100
end
So for our query, we just compare the cosine between each document vector and the query vector to get our rank. Then we annotate each document in our @docs collection with the rank and return the documents ordered by that value:
@query = "How can you compare The Wire with Lost?"
query_vector = vector(@query)
@docs.each do |doc|
doc_vector = vector(doc)
rank = cosine_rank(query_vector, doc_vector)
doc.instance_eval %{def rank; #{rank}; end} # bit mental
end
@results = @docs.sort { |a,b| b.rank <=> a.rank } # highest to lowest
And here's what @results looks like:
>> @results.each { |doc| puts doc + " (#{doc.rank})" }
Lost is surely not in the same league as The Wire. (68.2574185835055)
The Wire is the best thing ever. Fact. (58.5749292571254)
Some would argue that Lost got a bit too wierd after season 2. (55.5215763037423)
I'm not even going to mention any TV series. (50.0)
It works! I'm sure there are lots of optimizations you could make (e.g. by caching some of the tokenization steps), but for small document collections it's plenty fast enough.
Credits
Thanks to Joseph Wilk's excellent article on building a vector space search engine in Python.