Browsing Category

Code

scoreboard
Code, Development, Tupalo

Just FYI: we updated our spot ranking algorithm

Spots at Tupalo.com face a tough competition. In order to get a better feeling which spots are trending right now and which aren’t, we updated our spot ranking algorithm. We’d like to explain how this new algorithm works.

The obvious sorting order by average 5 star rating falls short for several reasons. For instance, a five star rating from last week should account for more than one from two years ago. Maybe that pizza isn’t as good anymore as it used to be. Plus: Tupalo.com features a couple of other possibilities to show your love for a spot – marking it as a favorite; saving it as a “want to go” spot or check-in into a spot – which should count for something (ranking wise) too.

Taking this thoughts into account the brainy brains at Tupalo.com spent some time on bread and water in the Lab-dungeon and came up with the following solution:
Every signal (review rating, favorite, want to go, check-in) now contributes to an overall spot ranking score and each signal gets weighted with a decreasing time factor, so that newer signals score higher than older ones. This is a balancing act: Think of a spot, which everyone raved about for some time in the past. But over time, no new reviews appeared. Such spots shouldn’t stay at the top forever.

On the other hand, think of the flash in the pan that is all the rage for a while but never leaves the same kind of mark as a tried and true spot. We want to feature these spots and give them a chance to prove themselves, but not keep them around if they fail to deliver. Would we rank by average rating, new spots would never make it to the top. If we only showed recent reviews, we would mostly see trending spots and would forget about the classics.

Here at Tupalo.com, we are very excited to see our algorithm in place (e.g on our category pages) and watch trending spots unfold.

Be sure to check it out and discover some new hot spots for your city!

Code, Development, Tools

Generating XML diffs with awk and bash

Did you ever wonder where Tupalo.com gets all these spots from? Apart from the ones our awesome users add, we get them from our various partners, often in the form of XML. This generally works fine, but XML’s structured nature also means that you can’t just treat it like any old text file.

That’s something we recently had to work around when we wanted to generate a daily XML diff that only contains elements which changed since the previous day’s feed. Of course there are several open source tools for exactly this purpose (e.g. diffxml or xmldiff) but since we didn’t get them to do what we want in a reasonable amount of time, we just decided to roll our own.

The final solution is a 71 line bash script, which downloads a zip, extracts it, generates MD5 sums for every element and then creates a diff between this new file and the previous list of MD5 sums. Once we know which elements have changed we merge them into a new feed which then gets handed to our importer. The awesome xmlstarlet was a great help in this, as was battle-tested old awk.

Let’s look at an interesting snippet from the script:

 xmlstarlet sel -I -t -m "//item" -v "./guid" -o "|" -c "." -n - | 
  sed -e '...' |
  awk \
    'BEGIN { 
      FS="|" 
      RS="\n"
    }
    {
      id=$1
      command="printf \"%s\" \"" $2 "\" | md5sum | cut -d\" \" -f1"
      command | getline md5
      close(command)
      print id":"md5
    }' >> $MD5_DIR/vendor-md5-$TODAY

Here we use xmlstarlet to iterate over all the items in the feed (the XPath “//item”), print the value of the “guid” element (-v “./guid”), output a pipe character (-o “|”) and then copy the current element followed by a newline (-c “.” -n) . This then gets piped through sed for some cleaning up (which I omitted here for brevity’s sake) before awk takes the part after each “|”, generates an MD5 sum and finally produces a file that looks like this:

rKKTZ:4012fced7c4cd77da607d294fbb8b5b6
hC7Jr:39245a0f9a976e6d47c0e2d76abf6238
...

Now that we are able to create a daily list of MD5 sums, it’s easy to generate the diff feed:

if [ -e $MD5_DIR/vendor-md5-last ] ; then
  changed=`diff $MD5_DIR/vendor-md5-last $MD5_DIR/vendor-md5-$TODAY | 
	   grep "^>" | 
           cut -d":" -f 1 | 
           cut -b 1-2 --complement`
 
for record in $changed ; do
    f=`fgrep -l "<guid>$record</guid>" $FILE_PATTERN`
    xmlstarlet sel -I -t -c "/rss/channel/item[guid='$record']" $f >> vendor-import-$TODAY.xml
  done

Here we create an array with the id of the changed elements over which we then iterate. In the loop we once again use xmlstarlet to extract the current item from the feed which contains the right guid.

This is a good example of how familiar old Unix tools can be combined to create a fairly concise solution for a non-trivial problem.

Code, Development

Screenshots of Tupalo.com with a Kindle (3G, Wifi Edition)

Ok, the title pretty much says it all. Did you know that the new Kindle 3G has an awesome HTML5 Webkit browser?

tupalo startpage kindle

You see the skewed bars on the start page? That’s pure CSS3 webkit-transform beauty.

a cool spot page on kindle 3g

http://tupalo.com/en/burlington-north-carolina/stavros-grill

tupalo.com kindle

http://tupalo.com/en/san-francisco-california/ikes-place

For those interrested in the technical deatails of the kindle browser, the actuall user agent of the Kindle 3G is: Mozilla/5.0 (Linux; U; en-US) AppleWebKit/528.5+ (KHTML, like Gecko, Safari/528.5+) Version/4.0 Kindle/3.0 (screen 600×800; rotate).

useragentstring kindle

http://www.useragentstring.com/