MaxDiff Questions

In a nutshell: MaxDiff reveals how customers prioritize a large set of options without overwhelming them to evaluate everything at once. The index score estimates how likely an option is to be chosen as “best” when compared against a random selection of competing options. Scores are normalized so that the average option equals 100. An option with an index score of 200 is expected to be chosen roughly twice as often as an option with a score of 100. We compute it using a state-of-the-art Hierarchical Bayes analysis.

Imagine you have 20 product feature ideas but only have the capacity to build three of them this quarter. Your first thought may be to ask respondents to rate each option on a 1 to 5 likert scale and pick the top three by average rating. While conceptually simple, the result will most likely be a large number of uninformative ties. Respondents tend to put most reasonable options at the top of the scale, especially in Western culture, where it is polite to agree. To get something informative, we need to force respondents to make tough choices: We do not ask respondents how much they like each option; instead we ask them to rank options against each other. However, asking respondents rank 20 options at once is likely to overwhelm them. This is where MaxDiff comes in.

How MaxDiff works

Instead of ranking all the options at once, your participants repeatedly select the best and worst option from random subsets of e.g. four options. Each selection in itself is simple for the respondent. But in aggregate, we can reconstruct how much they like each option relative to each other. Let’s go through this in an example. Suppose you want to know how people rank eight dessert options: Apple Pie, Chocolate Cake, Ice Cream, Cheesecake, Brownies, Donuts, Cupcakes, and Cookies. Participants will then see a series of six random subsets and select the best and worst option from that subset. For an exemplary participant, this may look as follows:

Step	Options shown	Selection
1	Chocolate Cake, Ice Cream, Apple Pie, Donuts	Best: Chocolate Cake, Worst: Donuts
2	Chocolate Cake, Cheesecake, Brownies, Cupcakes	Best: Chocolate Cake, Worst: Cupcakes
3	Chocolate Cake, Cookies, Apple Pie, Donuts	Best: Chocolate Cake, Worst: Donuts
4	Ice Cream, Cheesecake, Brownies, Cookies	Best: Ice Cream, Worst: Cookies
5	Ice Cream, Cheesecake, Apple Pie, Cupcakes	Best: Ice Cream, Worst: Cupcakes
6	Brownies, Cookies, Apple Pie, Donuts	Best: Brownies, Worst: Donuts

We may find a ranking consistent with the selections we observer, for example:

Chocolate Cake > Ice Cream > Cheesecake > Brownies > Cookies > Apple Pie > Cupcakes > Donuts

Notice, however, that we cannot tell for sure whether this particular respondent prefers cheesecake to brownies or the other way around. The selections indicate the underlying preferences but do not uniquely determine them. Therefore, we infer how respondents rank each option through a statistical model that even uses similarities between responses to better estimate what each respondent thinks about each option.

Inferring the underlying rankings

After observing all selections, we compute for each participant and option how likely they will like that option best on a random screen of other options. In the example above, the respondent chose Chocolate Cake as the best option three times. Therefore, we expect Chocolate Cake to perform pretty well against random competitors. How much participants like an option is quantified through the so-called index score. It estimates for each option how likely a random participant is to select that option against a random subset of other options. The score is calibrated so that the average option has an index score of 100. If option A has double the index score of option B, the participant is twice as likely to like A best on a random subset than to like B best. We estimate index scores with a state-of-the-art statistical model of how respondents selections on a screen: Hierarchical Bayes. This accounts for various factors such as:

Correlations between options: Imagine the example above but with no general consensus. However, our model may find that respondents who like chocolate cake tend to also like brownies, and the respondent in question liked chocolate cake. Then, our model will infer that the respondent may prefer brownies.
The hierarchy between options: If I know Johnny likes Donuts more than Cookies, and Brownies more than Donuts, then Johnny probably also likes Brownies more than Cookies, even if Johnny was never asked to pick between the two.
Correlations between respondents: Imagine that for a particular respondent, we lack indicators on whether they like e.g. brownies or cookies better. If, however, the general consensus is that cookies are preferred to brownies, our model will infer that this respondent will probably follow the trend.
Accidental misclicks: Sometimes, respondents make mistakes. Our model is robust to those. If, for example, a respondent consistently likes Chocolate Cake best but in one screen likes it worst, our model will infer that this was probably a misclick.

What Should I use MaxDiff for?

MaxDiff shines whenever you need to prioritize among a long list of options. In the real world:

Marketers cannot launch 10 campaigns at once
Engineers cannot build the entire feature roadmap in one sprint
Designers cannot highlight everything on prime real estate

Some exemplary use cases include:

Prioritizing a feature roadmap into which new capabilities are most likely to 1) drive net new app downloads, or 2) encourage existing customers to re-up their subscription
Prioritizing which messaging themes should be featured first on a LinkedIn campaign
Prioritizing CMF design, guiding which color smart speaker to launch 1st, 2nd, and 3rd, and the impact each additional color has on your customer’s likelihood to purchase your speaker vs. a competitors
Prioritizing which features belong in the free vs. pro vs. enterprise subscription tier. Placing the right features in free to attract new users, while placing the most valuable, potentially more niche features behind a paywall to maximize monetization
Prioritize which customer frustrations and pain points generate the most angst and risk of churn, guiding engineering to focus on addressing the biggest risks

MaxDiff should be avoided when

You need absolute scoring instead of relative rankings: Use matrix questions instead; MaxDiff only ranks options relative to each other.
The number of options is small: For five or less options, a ranking question is more effective.

How exactly do we compute these numbers?

We use an advanced statistical model, Hierarchical Bayes, to estimate a utility score

u_{i,j}

for each respondent

i

and option

j

. Higher utility scores indicate that respondents like an option better. Have a look at this blog post for a detailed explanation. There is one key idea for converting utility scores into interpretable numbers: Given a subset of options

S

is available for selection on screen, the model sets the probability of selecting option

j \in S

as the best option is modeled to be:

p(i \text{ selects } j \text{ selected as best from }S)=\frac{e^{u_{i,j}}}{\sum_{j' \in S} e^{u_{i,j'}}}

We zero-center the utilities by-respondent, i.e. for each respondent, the average item has a utility of zero. A standard metric is the probability of choice (POC). Given a design with

a

options per screen, POC is the probability that respondent

i

selects option

j

as the best option against the remaining

a-1

options where the remaining options are assumed to have a representative score of

0

, i.e. the average score. Given

e^0=1

, plugging this into the formula above yields:

\text{POC}_{i,j}\equiv p(i \text{ selects } j \text{ as best vs representative options }) =\frac{e^{u_{i,j}}}{e^{u_{i,j}}+a-1}

The POC of an option

j

is defined as the average POC for that option over all respondents. The index score rescales the POC by-respondent so that the average index score is 100, i.e.:

\text{index score}_{i,j} = \frac{\text{POC}_{i,j}}{\text{avg}_{j'}(\text{POC}_{i,j'})} \cdot 100

\text{index score}_j = \text{avg}_i\left(\text{index score}_{i,j}\right)

For head-to-head comparisons between two options A and B, we simply compute the selection probability restricted to the set

S=\{A,B\}

Portfolio Analysis (TURF)

Imagine you do not just want to find the best option, but you have budget to implement a limited subset of options, e.g. you can implement up to three of the ten proposed features. Your first instinct might be to implement the top three features by index score. However, this may not be optimal for maximizing reach in your customer base. For example, imagine you implemented the best-ranked option. Continuing with the second-best-ranked option may no longer be optimal: The respondents who like the second-ranked option may already be satisfied because they like the first-ranked option that you just implemented. It is possible that e.g. the fourth-ranked option could reach an entire new set of customers, providing higher marginal return. This is where Total Unduplicated Reach and Frequency (TURF) analysis comes in. The analysis broadly works in three steps:

Define a threshold for when an option is counted as reaching a customer. We support three options:
1. Top population %: Compute the $p$ -th percentile of the utility matrix $u_{:,:}$ : $u_{\text{cutoff}}(p) \equiv \text{percentile}(u_{:,:},p)$ . A respondent $i$ counts as reached by an option $j$ iff $u_{i,j} \geq u_{\text{cutoff}(p)}$ .
2. Choice probability: Each respondent counts as reached iff $\text{POC}_{i,j}$ is at least the selected probability threshold.
3. Top k: A respondent is reached by an option iff it is in their top k options by utility score.
Select a maximum portfolio size. Pick how many options you can pursue in total.
Let us find the options that maximize reach.

Get Started

Study Design

Recruitment & Fielding

Study Insights and Analysis

Integrations

Accounts & Permissions

API Documentation

How MaxDiff works

Inferring the underlying rankings

What Should I use MaxDiff for?

How exactly do we compute these numbers?

Portfolio Analysis (TURF)

​How MaxDiff works

​Inferring the underlying rankings

​What Should I use MaxDiff for?

​How exactly do we compute these numbers?

​Portfolio Analysis (TURF)

How MaxDiff works

Inferring the underlying rankings

What Should I use MaxDiff for?

How exactly do we compute these numbers?

Portfolio Analysis (TURF)