Last week we ended up talking about simhashing (hash functions which generate similar hashes for similar inputs) and I found this nice article with a step by step explanation of the algorithm, so I wrote a quick Ruby version (needs 1.9):
class String
def shingles
self.split(//).each_cons(2).inject([]) { |a, c| a << c.join }.uniq
end
def simhash
v = Array.new(64) { 0 }
hashes = shingles.map(&:hash)
hashes.each do |hash|
hash.to_s(2).split(//).each_with_index do |bit, i|
bit.to_i & 1 == 1 ? v[i] += 1 : v[i] -= 1
end
end
v.map{ |i| i >= 0 ? 1 : 0 }.join
end
def hamming_distance(other)
other_sh = other.simhash
self.simhash.split(//).each_with_index.inject(0) do |total, (bit, i)|
total += 1 if bit != other_sh[i]
total
end
end
end
Getting the “features” of a string:
>> "This is a test string".shingles
#=> ["Th", "hi", "is", "s ", " i", " a", "a ", " t", "te", "es", ...]
Simhashing a string:
>> "This is a test string".simhash
#=> "0100110001100001001100010000000010001001001000110110000010101100"
Calculating the Hamming distance between the symhashes of two strings (higher numbers mean less similar, with 64 being the highest possible score):
"This is a test string".hamming_distance("This is another test string")
#=> 8