andymatuschak.org: Square Signals

This article was published on Saturday, February 09th, 2008 at 12:03 pm.

B.B. King ringtonesCrosby, Stills, Nash and Young ringtones

Writing Code and Winning the NetFlix Prize

I whine a lot about my time at Caltech, but I’m actually taking a really neat course now which might as well be called “Win the NetFlix Prize—the Class.” If you’re unfamiliar with the prize, it’s $1,000,000 for improving upon NetFlix’s movie recommendation algorithm by 10%.

Not that we’re going to win or anything, but the performance and optimization implications of trying to solve this problem have been absolutely fascinating. A few details on the datasets that need to be analyzed:

  • 17,700 movies
  • 480,000 users
  • 100,000,000 ratings
  • 2 GB of data (700 MB stored optimally)
  • Density of the movie-user ratings matrix: 1.18%!
  • If computing the similarity between two movies takes 0.01 seconds, it takes 18 days to finish computing all pairs!

So that’s a little intimidating. There are a couple golden rules on performance that I’ve always been careful to follow:

  1. Code first; optimize later.

    Don’t write your code thinking about every little performance hit. Make it work. If you focus on making it fast in the first iteration, you’ll never finish the first iteration, and your code will be really ugly.

  2. Don’t optimize without knowing what’s slow.

    A lot of the time I think I know what’s slow, but half the time, I’m wrong. Shark knows what’s slow. Instruments knows what’s slow. Make sure you ask before diving in.

Golden Rules Aren’t Always Right

The really great thing about working on the NetFlix Prize is that basically all my software engineering assumptions and practices have had to go totally out the window.

The first observation, given these numbers, is that Ruby is not going to work even the tiniest bit. Which is really sad since it’d make life a lot easier. I’m using Python instead, with Numpy for computation. Most of the heavy lifting, then, is in C.

The fascinating thing about working on these algorithms is that I’m writing code that would normally work just fine: most of it’s O(n) or O(log n). But now I’m finding for the first time that the constant coefficient is actually mattering because it’s being executed a hundred million times. Every line of code has to be carefully scrutinized for speed considerations, and every loop is the potential for absolute disaster.

Last night I found myself writing a very simple regularization algorithm. The naïve implementation in Ruby was about eight lines. But the implementation I have now is still too slow by an order of magnitude and took more than four hours to write.

Moral of the story? If you’re a consumer software developer, you should probably be following the standard software engineering “golden rules” for optimization. But maybe all of us should work on a project at least once that gives us some appreciation for what Google has to go through to deliver us search results in a couple hundred milliseconds.

The Buzz {1 trackbacks/pingbacks}

  1. Pingback: the eXternal mind » links for 2008-03-06 on March 5, 2008

The Conversation {11 comments}

  1. David Smith 10 February, 08 @ 2:53 pm

    Optimization is delicious :) It’s one of the most fun things about working on WebKit, for me.

  2. Andrew Knott 11 February, 08 @ 1:39 pm

    maybe the 10% optimization in recommendations isn’t a software problem at all…

    maybe it’d be more efficient to get 10% better recommendations via social engineering. After all, better recommendations are subjective, they are about perception…

  3. Avocet 13 February, 08 @ 8:37 am

    Sounds like an interesting class.

  4. Tim Wee 14 February, 08 @ 3:48 pm

    Writing it in Python should be the same brevity/consciseness as Ruby, give or take.
    Maybe you are not as familiar with Python idioms? could you maybe post what you are doing in Ruby and the equivalent in Python? Or is it because of performance implications and having to stick with less dynamic Python code?

  5. Andy Matuschak 14 February, 08 @ 8:59 pm

    Yeah, it’s basically because if we use any loops or list comprehensions or anything that isn’t a numpy array operation, it’s way way way too slow.

  6. Mason Smith 15 February, 08 @ 9:44 am

    What kind of runtimes are you getting for your algorithm? We’re doing a factorization algo, and it takes about 4 hours on average.

  7. Andy Matuschak 17 February, 08 @ 4:55 pm

    Ours is taking more like 21 hours. :/

  8. Matt Patenaude 18 February, 08 @ 9:42 pm

    Damn. And I thought increasing speed in TuneConnect by switching from JSON to Plist was big. :P

  9. Dan Grover 17 March, 08 @ 3:11 pm

    What’s the title of the course?

  10. Andy Matuschak 17 March, 08 @ 6:54 pm

    It’s actually listed as: “CS/CNS/EE 156b: Learning Systems”, which isn’t terribly descriptive. 156a is a normal theoretical class, teaching various methods and important principles. 156b is then a project course wherein the students apply their knowledge, though the project changes yearly. This year’s just happened to be the Netflix prize.

  11. sandy 27 April, 08 @ 12:05 pm

    Hey Andy-

    Wondering if you are you still working on this? Sounds like ALOT of work and way over my head to understand, but fascinating just the same….!

Leave a Comment

Currently you have JavaScript disabled. In order to post comments, please make sure JavaScript and Cookies are enabled, and reload the page.

You can follow any responses to this entry via its RSS comments feed. You can also leave a trackback if the inclination is there.

If you're looking for something specific then give the search form below a try:

RSS Wordpress Grady (theme) Return to the Top ↑