Why Rating Systems Sometimes Work

Goodfilms is a Melbourne based startup that aims to do a better job of recommending movies to you. Their system uses your social network, e.g., Facebook, to show you what your friends are watching, along with two attributes of films, which you rate on a 10 scale (1 to 5 stars in half-star increments). It doesn’t appear that they include a personalized recommendation system based on collaborative filtering or similar.

In today’s Goodfilms blog post, Why Ratings Systems Don’t Work, the authors point to an XKCD cartoon identifying one of the many problems with collecting ratings from users.

The Goodfilms team says the problem with averaged rating values is that they attempt to distil an entire product down to a scalar value; that is, a number along a scale from 1 to some maximum imaginable goodness. They also suggest that histograms aren’t useful, asking how seeing the distribution of ratings for a film might possibly help you judge whether you’d like it.

Goodfilms demonstrates the point using three futuristic films, Blade Runner, Starship Troopers, and Fifth Element. The Goodfilms data shows bimodal distributions for all three films; the lowest number of ratings for each film is 2, 3, or 4 stars with 1 star and 5 stars having more votes.

Goodfilms goes on to say that their system gives you better guidance. Their film-quality visualization – rather than a star bar-chart and histogram – is a two axis scatter plot of the two attributes you rate for films on their site – quality and rewatchability, how much you’d like to watch that film again.

An astute engineer or economist might note that Goodfilms assumes quality and rewatchability to be independent variables, but they clearly are not. The relationship between the two attributes is complex and may vary greatly between film watchers. Regardless of the details of how those two variables interact, they are not independent; few viewers would rate something low in quality and high in rewatchability.

But even if these attributes were independent of each other, films have many other attributes that might be more telling – length, realism, character development, skin exposure, originality, clarity of intent, provocation, explosion count, and an endless list of others. Even if you included 100 such variables (and had a magic visualization tool for such data), you might not capture the sentiment of a crowd of viewers about the film, let alone be able to decide whether you would like it based on that data. Now if you had some deep knowledge of how you, as an individual, compare, in aesthetics, values and mental process, to your Facebook friends and to a larger population of viewers – then we’d really know something, but that kind of analysis is still some distance out.

Goodfilms is correct in concluding that rating systems have their perils; but their solution, while perhaps a step in the right direction, is naive. The problem with rating systems is not that they don’t capture enough attributes of the rated product or in their presentation of results. The problem lies in soft things. Rating systems tend to deal more with attributes of products than with attributes of raters of those products. Recommendation systems don’t account for social influence well at all. And there’s the matter of actual preferences versus stated preference; we sometimes lie about what we like, even to ourselves.

Social influence, as I’ve noted in past posts, is profound, yet its sources can be difficult to isolate. In rating systems, knowledge of how peers or a broader population have rated what you’re about to rate strongly influence the outcome of ratings. Experiments by Salganik and others on this (discussed in this post) are truly mind boggling, showing that weak information about group sentiment not only exaggerates preferences but greatly destabilizes the system.

The Goodfilms data shows bimodal distributions for all three films. The 1 star and 5 star vote count is higher than the minimum count of the 2, 3, and 4 star rating counts. Interestingly, this is much less true for Imdb’s data. So what’s the difference? Goodfilms’ rating counts for these movies range from about 900 to 1800. Imdb has hundreds of thousands of votes for these films.

As described in a previous post (Wisdom and Madness of the Yelp Crowd), many ratings sites for various products have bimodal distributions when rating count is low, but more normally distributed votes as the count increases. It may be that the first people who rate feel the need to exaggerate their preferences to be heard. Any sentiment above middle might gets cast as 5 star, otherwise it’s 1 star. As more votes are cast, one of these extremes becomes dominant and attracts voters. Now just one vote in a crowd, those who rate later aren’t compelled to be extreme, yet are influenced by their knowledge of how others voted. This still results in exaggeration of group preferences (data is left or right skewed) through the psychological pressure to conform, but eliminates the bimodal distribution seen in the early phase of rating for a given product. There is also a tendency at Imdb for a film to be rated higher when it’s new than a year later. Bias originating in suggestion from experts surely plays a role in this too; advertising works.

In the Imdb data, we see a tiny bit bimodality. The number of “1” ratings is only slightly higher that the number of “2” ratings (1-10 scale). Based on Imdb data, all three movies are all better than average – “average” being not 5.5 (halfway between 1 and 10) but either 6.2, the mean Imdb rating, or 6.4, if you prefer the median.

Imdb publishes the breakdown of ratings based on gender and age (Blade Runner, Starship Troopers, Fifth Element). Starship Troopers has considerably more variation between ratings of those under 18 and those over 30 than do the other two films. Blade Runner is liked more by older audiences than younger ones. That those two facts aren’t surprising suggests that we should be able to do better than recommending products based only on what our friends like (unless you will like something because your friends like it) or based on simple collaborative filtering algorithms (you’ll like it because others who like what you like liked it).

Imdb rating count vs. rating for 3 movies

So far, attempts to predict preferences across categories – furniture you’ll like based on your music preferences – have been rather disastrous. But movie rating systems actually do work. Yes, there are a few gray sheep, who lack preference similarity with the rest of users, but compared to many things, movies are very predictable – if you adjust for rating bias. Without knowledge that Imdb ratings are biased toward good and toward new, you high think a film with an average rating of 6 is better than average, but it isn’t, according to the Imdb community. They rate high.

Algorithms can handle that minor obstacle, even when the bias toward high varies between raters. With minor tweaks of textbook filtering algorithms, I’ve gotten movie predictions to be accurate within about half a star of actual. I tested this by using the movielens database and removing one rating from each users’ data and then making predictions for the missing movie for each user, then averaging the difference between predicted and actual values. Movie preferences are very predictable. You’re likely to give a film the same rating whether you saw it yesterday or today. And you’re likely to continue liking things liked by those whose taste was similar to yours in the past.

Restaurants are slightly less predictable, but still pretty good. Yesterday the restaurant was empty and you went for an early dinner. Today, you might get seated next to a loud retirement party and get a bad waiter. Same food, but your experience would color your subjective evaluation of food quality and come out in your rating.

Predicting who you should date or whether you’d like an autumn vacation in Paris is going to require a much different approach. Predicting that you’d like Paris based on movie tastes is ludicrous. There’s no reason to expect that to work other than Silicon Valley’s exuberant AI hype. That sort of prediction capability is probably within reach. But it will require a combination of smart filtering techniques (imputation-boosting, dimensionality reduction, hybrid clustering), taxonomy-driven computation, and a whole lot more context.

Context? – you ask. How does my GPS position affect my dating preferences? Well that one should be obvious. On the dating survey, you said you love ballet, but you were in a bowling alley four nights last week. You might want to sign up for the mixed league bowling. But what about dining preferences? To really see where this is going you need to expand your definition of context (I’m guessing Robert Scoble and Shel Israel have such an expanded view of context based on the draft TOC for their upcoming Age of Context).

My expanded view of context for food recommendations would include location and whatever physical sensor info I can get, along with “soft” data like your stated preferences, your dining history and other previous activities, food restrictions, and your interactions with your social network. I might conclude that you like pork ribs, based on the fact that you checked-in 30 times this year at a joint that serves little else. But you never go there for lunch with Bob, who seems to be a vegetarian based on his lunch check-ins. Bob isn’t with you today (based on both of your geo data), you haven’t been to Roy’s Ribs in two weeks, and it’s only a mile away. Further, I see that you’re trying to limit carbohydrates, so I’ll suggest you have the salad instead of fries with those ribs. That is, unless I know what you’ve eaten this week and see that you’re well below your expected carb intake, in which case I might recommend the baked potato since you’re also minding your sodium levels. And tomorrow you might want to try the Hủ Tiếu Mì at the Vietnamese place down the road because people who share your preferences and restrictions tend to like Vietnamese pork stew. Jill’s been there twice lately. She’s single, and in the bowling league, and she rates Blade Runner a 10.

data visualization, engineering, probability and statistics, social influence, system

This entry was posted on August 22, 2012, 11:04 pm and is filed under Multidisciplinarians. You can follow any responses to this entry through RSS 2.0. You can leave a response, or trackback from your own site.

#1 by Jaco on December 24, 2012 - 2:47 pm

One other issue issue is that video games usually are seroius as the name indicated with the primary focus on studying rather than amusement. Although, it has an entertainment feature to keep children engaged, every game is frequently designed to work towards a specific experience or curriculum, such as instructional math or science. Thanks for your article.

Kaczinski, Gore, and Cool Headed Logicians « The Multidisciplinarian

The Multidisciplinarian

Leave a comment Cancel reply

Follow Blog via Email

Recent Posts

Archives

Top Posts

X

The Multidisciplinarian