STAR QUALITY: AGGREGATING REVIEWS TO RANK PRODUCTS AND MERCHANTS Mary McGlohon (CMU/Google), Natalie Glance (Google), Zach Reiter (Google)
Reviews for product Sort by average rating
Reviews for merchant
The problem
Given reviews aggregated from different sources (Amazon, Epinions, etc.): How
to measure “true quality” of a product or merchant? Can we do better than “average number of stars?” How can we tell if we’re doing better?
The challenges
Different sources have different review scales 0-5
stars, 1-5 stars, 0-100%, A/B/C…
Different sources have different rating distributions “Rant
sites” and “Rave sites”
Reviews may be plagiarized or irrelevant This
happens a lot. [Danescu-Niculescu-Mizil+ 2009]
Outline
Analyze ratings aggregated from many review sites Propose models to determine “true quality” Build evaluation framework Compare results
Outline
Analyze ratings aggregated from many review sites Propose models to determine “true quality” Build evaluation framework Compare results
Data
Product Reviews
Merchant Reviews
8M ratings (560K products, 3.8M authors, 230 sources)
1.5M ratings (17K merchants, 1.1M authors, 19 sources)
Netflix Prize
100M ratings (17K movies, 480K authors)
Obs 1: People like passing out 5’s
Product reviews: Single-review authors disproportionately more so
Count
Number of stars
Obs 2: Authors/sources have biases
Ratings for the same product will differ widely. Authors are consistent across products. (Like
everything or hate everything.)
But… even different authors on the same site rate objects similarly! Not
just “rant sites” ReviewCentre.com avg 2.9*, Pricegrabber.com 4.5*
Obs 3: The rated object matters Products
Merchants
Merchant reviews more “binary”
Movies
Netflix more “normal”
Obs 4: How much an object is rated matters
Netflix:
Movies with O(10^5) ratings (red) were 63% positive (4-5*)
Movies with O(10^3) ratings (yellow) were 42% positive (4-5*)
Outline
Analyze ratings aggregated from many review sites Propose models to rank “true quality” Build evaluation framework Compare results
Proposed Models
1. Mean rating for object (baseline) “On
2. Median rating for object “The
average, users gave it 3.8 stars” q=3.8 middle rating was 4 stars”, q=4
3. Lower bound on normal confidence interval “95%
sure that it’s at least 3.5*”, q=3.5
4. Lower bound on binomial confidence interval “95%
sure that at least 60% of users will like it”, q=.6
Proposed Models
5. Average percentile of order statistic “Most
websites liked it better than other products”
6. Filtering anonymous reviews, then average “Anonymous
7. Filtering non-prolific authors, then average “N00bs
people are spammy”
are dumb”
8. Reweighting authors by “reliability” “Account
for author bias”
Outline
Analyze ratings aggregated from many review sites Propose models to determine “true quality” Build evaluation framework Compare results
Evaluation Method
There’s no “ground truth” for quality. Goal: to see how reliably our ranking of “true quality” q^i agrees with user preferences. Idea: Hold out a pair of ratings from same author. Does
our ranking of the objects (based on other people’s ratings) agree with theirs?
Toy Example Object oi
Author aj
Robo-raptor
Alice
Nerf MK40
Alice
Splosions™ Chem Set Alice Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Robo-raptor
Danielle
Halo 2
Danielle
Splosions™ Chem Set Danielle
Author Rating r(oi,aj)
Step 1: For every “prolific” author, pick a pair to hold out for test set. Object oi
Author aj
Robo-raptor
Alice
Nerf MK40
Alice
Splosions™ Chem Set Alice Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Robo-raptor
Danielle
Halo 2
Danielle
Splosions™ Chem Set Danielle
Author Rating r(oi,aj)
Step 1: For every “prolific” author, pick a pair to hold out for test set. Object oi
Author aj
Robo-raptor
Alice
Nerf MK40
Alice
Splosions™ Chem Set Alice Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Robo-raptor
Danielle
Halo 2
Danielle
Splosions™ Chem Set Danielle
Author Rating r(oi,aj)
“prolific:” n>2 for Alice and Danielle
Step 1: For every “prolific” author, pick a pair to hold out for test set. Object oi
Author aj
Robo-raptor
Alice
Nerf MK40
Alice
Splosions™ Chem Set Alice Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Robo-raptor
Danielle
Halo 2
Danielle
Splosions™ Chem Set Danielle
Author Rating r(oi,aj)
Choose 2 random reviews from each
Step 2: The rest are training data. Object oi
Author aj
Robo-raptor
Alice
Nerf MK40
Alice
¥
Splosions™ Chem Set Alice Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Robo-raptor
Danielle
Halo 2
Danielle
Splosions™ Chem Set Danielle
Author Rating r(oi,aj)
Step 2: The rest are training data. Training set Object oi
Author aj
Splosions™ Chem Set Alice Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Splosions™ Chem Set Danielle
Author Rating r(oi,aj)
Step 3: In training data, calculate q^i , rank objects accordingly. Training set
Object oi
Author aj
Splosions™ Chem Set Alice Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Splosions™ Chem Set Danielle
Author Rating r(oi,aj)
Step 3: In training data, calculate q^i , rank objects accordingly. Training set
Object oi
Author aj
Splosions™ Chem Set Alice Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Splosions™ Chem Set Danielle
oi Nerf Halo
Author Rating r(oi,aj)
Splosions Robo
q^i
Step 3: In training data, calculate q^i , rank objects accordingly. Training set
Object oi
Author aj
Splosions™ Chem Set Alice Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Splosions™ Chem Set Danielle
oi
q^i
Nerf Halo
Author Rating r(oi,aj)
Splosions Robo
Here, qi is average rating in training data
3.0
Step 3: In training data, calculate q^i , rank objects accordingly. Training set
Object oi
Author aj
Splosions™ Chem Set Alice Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Splosions™ Chem Set Danielle
Author Rating r(oi,aj)
Our ranking oi
q^i
Nerf
5.0
Halo
4.5
Splosions
3.0
Robo
3.0
Step 3: In training data, calculate q^i , rank objects accordingly.
Our ranking oi
q^i
Nerf
5.0
Halo
4.5
Splosions
3.0
Robo
3.0
Step 4: Compare our ranking with ranking in each pair in test data. Test set
Object oi
Author aj
Robo-raptor
Alice
Nerf MK40
Alice
Robo-raptor
Danielle
Halo 2
Danielle
Author Rating r(oi,aj)
Our ranking oi
q^i
Nerf
5.0
Halo
4.5
Splosions
3.0
Robo
3.0
In test set, Alice says Robo MISCLASSIFICATIONOur ranking claims Nerf outranks Nerf. outranks Robo.
Step 4: Compare our ranking with ranking in each pair in test data. Test set
Object oi
Author aj
Robo-raptor
Alice
Nerf MK40
Alice
Halo 2
Danielle
Robo-Raptor
Danielle
Author Rating r(oi,aj)
CORRECT CLASSIFICATION In test set, Danielle says Halo outranks Robo.
Our ranking oi
q^i
Nerf
5.0
Halo
4.5
Splosions
3.0
Robo
3.0
Our ranking claims Halo outranks Robo.
80 70 Accuracy (%)
60 50 40 30 20 10 0
Nothing significantly outperformed average rating
Products Merchants Netflix
Potential improvements
Leveraging other review features Careful
selection of sources for reliability and bias Observe longitudinal user behavior Timestamps
Cleaning data for plagiarism and spam Leveraging other data sources Better
Business Bureau, etc.
Conclusion
User behavior in reviews follows interesting patterns Proposed diverse set of ranking systems. Devised evaluation methodology. Outperforming the average may be more nuanced than we thought.
Experiment
“Prolific” author = 100 or more reviews One pair from each prolific author
Step 1: Calculate qi for each object, rank. Object oi
Author aj
Robo-raptor
Alice
Nerf MK40
Alice
Splosions™ Chem Set
Alice
Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Robo-raptor
Danielle
Halo 2
Danielle
Splosions™ Chem
Danielle
Author Rating r(oi,aj)
Step 1: Calculate qi for each object, rank.
oi Robo Nerf
Object oi
Author aj
Robo-raptor
Alice
Nerf MK40
Alice
Splosions™ Chem Set
Alice
Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Robo-raptor
Danielle
Halo 2
Danielle
Splosions™ Chem
Danielle
Author Rating r(oi,aj)
Splosions Halo Halo
qi
Step 1: Calculate qi for each object, rank. Object oi
Author aj
Robo-raptor
Alice
Nerf MK40
Alice
Splosions™ Chem Set
Alice
Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Robo-raptor
Danielle
Halo 2
Danielle
Splosions™ Chem
Danielle
Author Rating r(oi,aj)
oi
qi
Robo
3.3 3
Nerf Splosions Halo Halo
Step 1: Calculate qi for each object, rank. Object oi
Author aj
Robo-raptor
Alice
Nerf MK40
Alice
Splosions™ Chem Set
Alice
Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Robo-raptor
Danielle
Halo 2
Danielle
Splosions™ Chem
Danielle
Author Rating r(oi,aj)
oi
qi
Robo
3.3 3
Nerf
4.5
Splosions Halo Halo
Step 1: Calculate qi for each object, rank. Object oi
Author aj
Robo-raptor
Alice
Nerf MK40
Alice
Splosions™ Chem Set
Alice
Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Robo-raptor
Danielle
Halo 2
Danielle
Splosions™ Chem
Danielle
Author Rating r(oi,aj)
oi
qi
Robo
3.3 3
Nerf
4.5
Splosions
3.0
Halo
5.0
Halo
3.5
Step 1: Calculate qi for each object, rank. Object oi
Author aj
Author Rating r(oi,aj)
oi
qi
Halo
5.0
Nerf
4.5
Halo
3.5
Robo-raptor
Alice
Robo
Nerf MK40
Alice
3.3 3
Splosions
3.0
Splosions™ Chem Set
Alice
Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Robo-raptor
Danielle
Halo 2
Danielle
Splosions™ Chem
Danielle
Step 1: For every “prolific” author, pick pairs to hold out for test set. Object oi
Author aj
Robo-raptor
Alice
Nerf MK40
Alice
Splosions™ Chem Set
Alice
Halo 2
Alice
Robo-raptor
Bob
Nerf MK40
Charlie
Halo 2
Charlie
Robo-raptor
Danielle
Halo 2
Danielle
Splosions™ Chem
Danielle
Author Rating r(oi,aj)