Announcements

Simhash in Ruby

Last week we ended up talking about simhashing (hash functions which generate similar hashes for similar inputs) and I found this nice article with a step by step explanation of the algorithm, so I wrote a quick Ruby version (needs 1.9):

class String
 
  def shingles
    self.split(//).each_cons(2).inject([]) { |a, c| a << c.join }.uniq
  end
 
  def simhash
    v = Array.new(64) { 0 }
    hashes = shingles.map(&:hash)
    hashes.each do |hash|
      hash.to_s(2).split(//).each_with_index do |bit, i|
        bit.to_i & 1 == 1 ? v[i] += 1 : v[i] -= 1
      end
    end
    v.map{ |i| i >= 0 ? 1 : 0 }.join
  end
 
  def hamming_distance(other)
    other_sh = other.simhash
    self.simhash.split(//).each_with_index.inject(0) do |total, (bit, i)|
      total += 1 if bit != other_sh[i]
      total
    end
  end
 
end

Getting the “features” of a string:

>> "This is a test string".shingles 
#=> ["Th", "hi", "is", "s ", " i", " a", "a ", " t", "te", "es", ...]

Simhashing a string:

>> "This is a test string".simhash 
#=> "0100110001100001001100010000000010001001001000110110000010101100"

Calculating the Hamming distance between the symhashes of two strings (higher numbers mean less similar, with 64 being the highest possible score):

"This is a test string".hamming_distance("This is another test string") 
#=> 8
Previous Post Next Post

You Might Also Like

No Comments

Leave a Reply