The system is used to select a few items from a collection. For example, a company with a web site may sell ad space to a number of other companies. For a given web page only one ad can be shown, and the company wants to select the ad that is most likely to result in the user clicking on that ad.
In this case the use of the CF system is invisible to the user. A choice of ad must be made, even if there is very little data to go on, so some prediction must be made in the small amount of time available. Absolute accuracy of prediction and high confidence are secondary concerns.
The system offers explicit advice to the user. For example, the NetFlix web site offers explicit predictions of how much a user will like a movie.
In this case accuracy and prediction are critical, as they are key to building and maintaining user confidence. Performance is, of course, still important, but if predictions can't be made quickly it's better to fail to predict than generate a low-confidence prediction.
This is highly dependent on the deployment environment. A web site that is trying to predict ad clicks would unobtrusively record mouse clicks: N percent of the time user A visited page P1 they went to page P2.
A prescriptive system like NetFlix instead asks the user to state their opinion explicitly. Typically a set of items is presented to the user for rating. The items to be shown to the user are selected to get the most information about the user as quickly as possible, while trying to build the user's confidence in the system. Items are selected based on their ratings variance and likelyhood of being rated if presented.
Neighbor selection tries to satisfy two goals:
The Pearson Correlation Coefficient is typically used to measure correlation between people.
| |||||||||||||||||||
|
Herlocker and Breese mention comparisons with other correlation measures (Spearman, an "entropy based uncertainty measure", vector similarity measure, and others) and found that Pearson worked the best in this environment.
However, the PCC as a measure of correlation does not reflect the "confidence" of that correlation. The problem is that the set of ratings is sparse, and two users probably have few co-rated items. Users who have only one or two rated items in common should not be treated as strongly correlated, even if their ratings on those items are identical. Herlocker applied a significance weight to the correlation, based on the number of co-rated items: this improved behavior considerably.
The second goal can be addressed as a side-effect of dealing with a performance issue with neighbor selection. Searching for neighbors over the entire set of users can be computationally overwhelming. Instead, the search is performed over a smaller set of neighbor candidates. People are added to the set of neighbor candidates based (partly) on how many items they have rated. Building and maintaining the set of neighbor candidates can happen off-line.
There are two types of prediction requests:
Prediction requests are handled by computing the weighted average of the ratings for an item for a user's neighbors. Confidence is computed based on the number and weights of the neighbors. Note that, if no neighbor has rated an item, no predition can be made.
At NetFlix customers can rate movies on a scale of 1 to 5 stars. Two additional "out-of-band" ratings values are allowed: Haven't Seen It and Not Interested.
People are required to rate a minimum of 20 movies before we attempt to generate predictions.
Predictions are shown throughout the site: anywhere a movie appears a prediction is shown (if the confidence is high enough). (This isn't true today, but we're working on it.)
Herlocker points out that not all errors in prediction are equal: for a site like NetFlix a prediction of 5 stars for a 1 star movie is much worse than a prediction of 1 star for a 5 star movie. The simple Mean Average Error computation is inadequate.
Ideally this list should be computed on the fly, using the user's previous ratings to help select the next item to rate. That appears to be computationally intractable. Instead a single fixed list is used for all users, independent of their rating history.
Would it be beneficial to use clustering / centrotype methods to identify neighbor candidates?
Are people who rate many items representative of all users?
How about using genre information to select genre-specific neighbors? (Genres could be externally imposed, e.g. Horror or Mystery, or determined by finding strongly correlated clusters of movies.)
The significance weighting method used in Herlocker to deal with correlations based on small number of co-rated items is very ad hoc. What would be better?
How many neighbors should be selected for a user? Too few neighbors gives too little "coverage" (the ability to predict for a large number of items); too many neighbors can introduct noise.
How should confidence in a prediction be computed?
Other rating schemes, such as Would Not Like, Can't Remember, perhaps a scale of 1-5 stars for I Think I Would Like It This Much.
People pre-select movies: they only see what they think they would like. So co-rated items, even if the score is very different, imply some correlation.
How long does it take the system to start up? How long does it take to load in enough data to start functioning?
How easy / difficult is it to alter configuration parameters of the system? How long does it take to recompute cached information (like neighbors)?
How much data is required before a new user can be reliably predicted?