All Posts By

admin

Simhash in Ruby

Last week we ended up talking about simhashing (hash functions which generate similar hashes for similar inputs) and I found this nice article with a step by step explanation of the algorithm, so I wrote a quick Ruby version (needs 1.9):

class String

  def shingles
    self.split(//).each_cons(2).inject([]) { |a, c| a << c.join }.uniq
  end

  def simhash
    v = Array.new(64) { 0 }
    hashes = shingles.map(&:hash)
    hashes.each do |hash|
      hash.to_s(2).split(//).each_with_index do |bit, i|
        bit.to_i & 1 == 1 ? v[i] += 1 : v[i] -= 1
      end
    end
    v.map{ |i| i >= 0 ? 1 : 0 }.join
  end

  def hamming_distance(other)
    other_sh = other.simhash
    self.simhash.split(//).each_with_index.inject(0) do |total, (bit, i)|
      total += 1 if bit != other_sh[i]
      total
    end
  end

end

Getting the “features” of a string:

>> "This is a test string".shingles 
#=> ["Th", "hi", "is", "s ", " i", " a", "a ", " t", "te", "es", ...]

Simhashing a string:

>> "This is a test string".simhash 
#=> "0100110001100001001100010000000010001001001000110110000010101100"

Calculating the Hamming distance between the symhashes of two strings (higher numbers mean less similar, with 64 being the highest possible score):

"This is a test string".hamming_distance("This is another test string") 
#=> 8

October 28, 2010 by admin

Announcements

GitHub issues as CSV with Ruby and HTTParty

Recently we told you about our move to Kanban. To fill the backlog our project manager Nina needed a list of our current GitHub issues, ideally in a spreadsheet format. A couple lines of Ruby produce CSV output consisting of issue title, creation date, reporter and labels, which then can be redirected to a file and opened with any spreadsheet program.

require 'httparty'

class GitHubIssues
  include HTTParty
  base_uri 'http://github.com/api/v2/yaml'
  
  def self.show
    # username must be of the form '/token:'
    opts = {:basic_auth => {:username => ''}}
    self.get('/issues/list///open', opts)["issues"].each do |issue|
      puts "#{issue["title"]};#{issue["created_at"]};#{issue["user"]};#{issue["labels"].join(',')}"
    end
  end
end

GitHubIssues.show

It’s quick and dirty (e.g. will die if there are no issues), but does exactly what we needed in just a couple of lines thanks to the awesome HTTParty.

October 7, 2010 by admin

Code, Development, Tools

Generating XML diffs with awk and bash

Did you ever wonder where Tupalo.com gets all these spots from? Apart from the ones our awesome users add, we get them from our various partners, often in the form of XML. This generally works fine, but XML’s structured nature also means that you can’t just treat it like any old text file.

That’s something we recently had to work around when we wanted to generate a daily XML diff that only contains elements which changed since the previous day’s feed. Of course there are several open source tools for exactly this purpose (e.g. diffxml or xmldiff) but since we didn’t get them to do what we want in a reasonable amount of time, we just decided to roll our own.

The final solution is a 71 line bash script, which downloads a zip, extracts it, generates MD5 sums for every element and then creates a diff between this new file and the previous list of MD5 sums. Once we know which elements have changed we merge them into a new feed which then gets handed to our importer. The awesome xmlstarlet was a great help in this, as was battle-tested old awk.

Let’s look at an interesting snippet from the script:

 xmlstarlet sel -I -t -m "//item" -v "./guid" -o "|" -c "." -n - | 
  sed -e '...' |
  awk \
    'BEGIN { 
      FS="|" 
      RS="\n"
    }
    {
      id=$1
      command="printf \"%s\" \"" $2 "\" | md5sum | cut -d\" \" -f1"
      command | getline md5
      close(command)
      print id":"md5
    }' >> $MD5_DIR/vendor-md5-$TODAY

Here we use xmlstarlet to iterate over all the items in the feed (the XPath “//item”), print the value of the “guid” element (-v “./guid”), output a pipe character (-o “|”) and then copy the current element followed by a newline (-c “.” -n) . This then gets piped through sed for some cleaning up (which I omitted here for brevity’s sake) before awk takes the part after each “|”, generates an MD5 sum and finally produces a file that looks like this:

rKKTZ:4012fced7c4cd77da607d294fbb8b5b6
hC7Jr:39245a0f9a976e6d47c0e2d76abf6238
...

Now that we are able to create a daily list of MD5 sums, it’s easy to generate the diff feed:

if [ -e $MD5_DIR/vendor-md5-last ] ; then
  changed=`diff $MD5_DIR/vendor-md5-last $MD5_DIR/vendor-md5-$TODAY | 
	   grep "^>" | 
           cut -d":" -f 1 | 
           cut -b 1-2 --complement`

for record in $changed ; do
    f=`fgrep -l "$record" $FILE_PATTERN`
    xmlstarlet sel -I -t -c "/rss/channel/item[guid='$record']" $f >> vendor-import-$TODAY.xml
  done

Here we create an array with the id of the changed elements over which we then iterate. In the loop we once again use xmlstarlet to extract the current item from the feed which contains the right guid.

This is a good example of how familiar old Unix tools can be combined to create a fairly concise solution for a non-trivial problem.

October 1, 2010 by admin