Stat 399: Movie Recommender Systems


Recommender systems, also known as Collaborative Filtering Systems, are used to predict user preferences in commercial (and non-commercial) environments. They function by: The catch is that these systems deal with large numbers of people (> 100,000) with many different behaviorial aspects (> 1000), few values per user (~ 1% per user), many recorded values (>> 1,000,000), and have to operate in real-time (< 0.05 seconds).


Broadly speaking, there are two types of use of commercial CF systems.

Predictive use

The system is used to select a few items from a collection. For example, a company with a web site may sell ad space to a number of other companies. For a given web page only one ad can be shown, and the company wants to select the ad that is most likely to result in the user clicking on that ad.

In this case the use of the CF system is invisible to the user. A choice of ad must be made, even if there is very little data to go on, so some prediction must be made in the small amount of time available. Absolute accuracy of prediction and high confidence are secondary concerns.

Prescriptive use

The system offers explicit advice to the user. For example, the NetFlix web site offers explicit predictions of how much a user will like a movie.

In this case accuracy and prediction are critical, as they are key to building and maintaining user confidence. Performance is, of course, still important, but if predictions can't be made quickly it's better to fail to predict than generate a low-confidence prediction.


Recording user behavior

This is highly dependent on the deployment environment. A web site that is trying to predict ad clicks would unobtrusively record mouse clicks: N percent of the time user A visited page P1 they went to page P2.

A prescriptive system like NetFlix instead asks the user to state their opinion explicitly. Typically a set of items is presented to the user for rating. The items to be shown to the user are selected to get the most information about the user as quickly as possible, while trying to build the user's confidence in the system. Items are selected based on their ratings variance and likelyhood of being rated if presented.

Neighbor selection

Neighbor selection tries to satisfy two goals:

  1. Neighbors should be strongly correlated with the user,
  2. Neighbors should be able to predict behavior for many items not rated by the user.

The Pearson Correlation Coefficient is typically used to measure correlation between people.

sum(XY) - 
sum(X) sum(Y)
sqrt ( ( sum(X2) - 
) ( sum(Y2) - 
) )

Herlocker and Breese mention comparisons with other correlation measures (Spearman, an "entropy based uncertainty measure", vector similarity measure, and others) and found that Pearson worked the best in this environment.

However, the PCC as a measure of correlation does not reflect the "confidence" of that correlation. The problem is that the set of ratings is sparse, and two users probably have few co-rated items. Users who have only one or two rated items in common should not be treated as strongly correlated, even if their ratings on those items are identical. Herlocker applied a significance weight to the correlation, based on the number of co-rated items: this improved behavior considerably.

The second goal can be addressed as a side-effect of dealing with a performance issue with neighbor selection. Searching for neighbors over the entire set of users can be computationally overwhelming. Instead, the search is performed over a smaller set of neighbor candidates. People are added to the set of neighbor candidates based (partly) on how many items they have rated. Building and maintaining the set of neighbor candidates can happen off-line.

Prediction generation

There are two types of prediction requests:

  1. Given an item (or a set of items), predict ratings for the user,
  2. Generate a list of high confidence high predictions.

Prediction requests are handled by computing the weighted average of the ratings for an item for a user's neighbors. Confidence is computed based on the number and weights of the neighbors. Note that, if no neighbor has rated an item, no predition can be made.

An example: NetFlix

At NetFlix customers can rate movies on a scale of 1 to 5 stars. Two additional "out-of-band" ratings values are allowed: Haven't Seen It and Not Interested.

People are required to rate a minimum of 20 movies before we attempt to generate predictions.

Predictions are shown throughout the site: anywhere a movie appears a prediction is shown (if the confidence is high enough). (This isn't true today, but we're working on it.)

Site metrics

Some real data from the Netflix web site.


Measuring success

Herlocker points out that not all errors in prediction are equal: for a site like NetFlix a prediction of 5 stars for a 1 star movie is much worse than a prediction of 1 star for a 5 star movie. The simple Mean Average Error computation is inadequate.

Generating the list of items to rate

Ideally this list should be computed on the fly, using the user's previous ratings to help select the next item to rate. That appears to be computationally intractable. Instead a single fixed list is used for all users, independent of their rating history.

Selecting neighbor candidates

Would it be beneficial to use clustering / centrotype methods to identify neighbor candidates?

Are people who rate many items representative of all users?

Neighbor selection

How about using genre information to select genre-specific neighbors? (Genres could be externally imposed, e.g. Horror or Mystery, or determined by finding strongly correlated clusters of movies.)

The significance weighting method used in Herlocker to deal with correlations based on small number of co-rated items is very ad hoc. What would be better?

How many neighbors should be selected for a user? Too few neighbors gives too little "coverage" (the ability to predict for a large number of items); too many neighbors can introduct noise.

Prediction generation

How should confidence in a prediction be computed?

Other sources of data

Other rating schemes, such as Would Not Like, Can't Remember, perhaps a scale of 1-5 stars for I Think I Would Like It This Much.

People pre-select movies: they only see what they think they would like. So co-rated items, even if the score is very different, imply some correlation.

Implementation issues

How long does it take the system to start up? How long does it take to load in enough data to start functioning?

How easy / difficult is it to alter configuration parameters of the system? How long does it take to recompute cached information (like neighbors)?

How much data is required before a new user can be reliably predicted?


Stan Lanning,
Last modified: Wed Apr 05 12:52:32 PDT 2000