So you are doing A/B testing. That’s wonderful! However, you’ve probably already discovered that even though A/B testing looks very straightforward, it’s actually far from that.
One of the main issues with A/B testing is figuring out how many variations you should test in parallel to the control. The truth is that there are as many opinions about this as there are testers. This question branches both into mathematical statistics and psychology. Mathematically speaking, the more tests you make, the more you increase the chance for error. In mathematical terms, it’s called “cumulative alpha error.”
Patrick Runkel explains it here: “If you perform a hypothesis test many times on the same set of data. Each hypothesis test has a “built-in” error rate, called alpha, which indicates the probability that the test will find a statistically significant result based on the sample data when, in reality, no such difference actually exists. Statisticians call this a Type I error.” In other words, the more elements you test and the more tests you perform on the same set of data, the more chances for error you have.
The chance for error increases dramatically with each new variation, as this formula shows:
Cumulative alpha = 1-(1-Alpha)^k
K here represents the number of variations.
This fact might lead you to conclude that the best is to make only one variation when you run your tests. But that’s not always the case. Most A/B testing programs actually have corrective algorithms. The multi-version error, also called the multiple comparisons problem, is pretty well dealt with by the leading software, such as Optimizly and VWO. Even if the software you use doesn’t have that corrective algorithm, it is possible to do it yourself using some statistical techniques such as an F-test or Bonferroni correction.
However, even if solving the alpha error means decreasing the risk of seeing a difference when there isn’t one, you’re increasing the risk of not seeing a difference when there actually is one. So basically the idea is that you can allow some variation but don’t go wild with it. Bayesian hierarchical modeling, according to some researchers, can eliminate the chance for errors to a greater degree.
The Wikipedia definition of Bayesian hierarchical modeling is “a statistical model written in multiple levels (hierarchical form) that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and the Bayes’ theorem is used to integrate them with the observed data, and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.”
Without going into a complex technical explanation, I will only say that the Bayesian strategy estimates statistical answers by distribution of data and data range, rather than by making samples and averages. Its mathematics is pretty complex but at the bottom line, the Bayesian method can reduce chances of error in multi-version A/B tests.
Another common mathematical issue is segmentation. Unfortunately, taking a test result and finding a specific segment in which there was a clear preference for one option usually reduces your data quantity below what is the alpha error. On a positive note, segmentation of your data can be useful if you really have your statistics together and have either enough data or extremely clear results. Obviously, Google and Amazon do segmentation testing all the time. However, you probably don’t have the traffic or statistical resources that they have.
So now, putting mathematics aside, the truth is that leading testers vary in their opinions. Some prefer a very low number of testing elements and versions, and some prefer more.
Why should we have many testing variations
Andrew Anderson, Director of Optimization at Recovery Brands, recommends having many testing options. He says: “The fewer options, the less valuable the test. Anything with less than four variants is a no go as far as I am concerned for our program, because of the limited chances of discovery, success, and most importantly scale of the outcome.”
Many testers will agree with him, since the possibilities for different versions are usually vast, and you can rarely tell by intuition which of the versions will yield significant results. Marketingexperiment.com is famous for publishing cases that show dramatic, counterintuitive lifts in conversion for seemingly insignificant changes in testing versions. These case studies support the idea of testing as many versions as possible.
Another benefit of having many versions is that making an effort to come up with many alternatives forces you to think outside the box. This is part of why the director of The Onion magazine asks his writers to provide 20 headlines per article. At Upworthy, writers come up with at least 25 headlines for every story and test them. When you need to come up with alternatives beyond the predictable 5 or 6 versions, you need to deeply explore your strategy, base assumptions, and fundamental ideas. The chance you will come up with a brilliant breakthrough on the seventh or eighth version increases exponentially.
Edward de Bono, the researcher and one of the founding fathers of modern creative consulting, offers what is called Bono’s Lateral Thinking tools.
“Employees are often admonished to ‘think outside the box’ with no instructions for how to do so. Provocation & Movement designate a formal process that enables you to exit the box with ease—and return with a compelling list of innovative ideas to consider.”
One of the main techniques for lateral thinking is to force yourself to come up with more options than you think are reasonably possible. Edward de Bono’s techniques have been used far and wide in different industries and educational institutes for over 30 years. Your test can substitute for a pricey lateral thinking workshop. It can practically upgrade A/B testing from a mere test into a method of brainstorming and getting creative breakthroughs.
So why should we test with a small number of variations?
On the other hand, some leading testers argue that limited variation testing is always less time- and budget-consuming. The results are more reliable from a mathematics standpoint, and more trustable since you’re less likely to have sample pollution.
According to Angie Schottmuller, a growth marketing expert: “Tests run best in full-week increments since day-of-week and day-of-month present seasonality factors. I perform discovery to better understand the buying cycle duration and target audience daily context before setting test dates and times.” Basically, if your test takes too long (which is often the case of low-traffic tests with many versions), it means that some of the conditions that affect the beginning of the test might no longer apply.
Another sample pollution can occur with returning visitors. Let’s say you have only two versions. If a user returns to your site, he has a 50% chance of returning to the same page he saw before. However, if the test has four versions, it’s only 25% he will visit the same version. Returning visitors, in general, will corrupt your tests. As an example, returning visitors to an e-commerce site stay for an average of three minutes longer than new visitors. If your testing takes longer, usually you will have a higher percentage of returning visitors than in a short experiment.
So which is better?
The truth is that there isn’t really a straight answer to this. However, there are a few useful guidelines. If your site has low traffic, it is much more likely that few versions will work better. If your experiment has high contrast, then it is more useful to have more variations. Obviously, other considerations, like time constraints and acceptable rate of alpha error, play a role.
To conclude, choosing many versions or just a few is a matter of many factors, such as timelines, type of audience, campaign structure and clarity of data. Choosing a small or high number of variations is a strategic decision that has pluses and minuses on both sides. For this reason, many organizations chose to run both small-number and high-number variation testing in parallel.