I whine a lot about my time at Caltech, but I’m actually taking a really neat course now which might as well be called “Win the NetFlix Prize—the Class.” If you’re unfamiliar with the prize, it’s $1,000,000 for improving upon NetFlix’s movie recommendation algorithm by 10%.
Not that we’re going to win or anything, but the performance and optimization implications of trying to solve this problem have been absolutely fascinating. A few details on the datasets that need to be analyzed:
- 17,700 movies
- 480,000 users
- 100,000,000 ratings
- 2 GB of data (700 MB stored optimally)
- Density of the movie-user ratings matrix: 1.18%!
- If computing the similarity between two movies takes 0.01 seconds, it takes 18 days to finish computing all pairs!
So that’s a little intimidating. There are a couple golden rules on performance that I’ve always been careful to follow:
-
Code first; optimize later.
Don’t write your code thinking about every little performance hit. Make it work. If you focus on making it fast in the first iteration, you’ll never finish the first iteration, and your code will be really ugly.
-
Don’t optimize without knowing what’s slow.
A lot of the time I think I know what’s slow, but half the time, I’m wrong. Shark knows what’s slow. Instruments knows what’s slow. Make sure you ask before diving in.
Golden Rules Aren’t Always Right
The really great thing about working on the NetFlix Prize is that basically all my software engineering assumptions and practices have had to go totally out the window.
The first observation, given these numbers, is that Ruby is not going to work even the tiniest bit. Which is really sad since it’d make life a lot easier. I’m using Python instead, with Numpy for computation. Most of the heavy lifting, then, is in C.
The fascinating thing about working on these algorithms is that I’m writing code that would normally work just fine: most of it’s O(n) or O(log n). But now I’m finding for the first time that the constant coefficient is actually mattering because it’s being executed a hundred million times. Every line of code has to be carefully scrutinized for speed considerations, and every loop is the potential for absolute disaster.
Last night I found myself writing a very simple regularization algorithm. The naïve implementation in Ruby was about eight lines. But the implementation I have now is still too slow by an order of magnitude and took more than four hours to write.
Moral of the story? If you’re a consumer software developer, you should probably be following the standard software engineering “golden rules” for optimization. But maybe all of us should work on a project at least once that gives us some appreciation for what Google has to go through to deliver us search results in a couple hundred milliseconds.








