Guide to Scientific A/B Testing

What is A/B testing?

A/B testing is a special case of experimental design, aimed at determining a better performing option of communication. In practical terms it meas that all of the targets in the A/B test get exposure to some version of marketing communication while in test vs control design only the test targets get exposed.

The analytical difference between general test vs control design and A/B test lies in what we are measuring: test vs control, in its classical case when control does not get the marketing communication, measures absolute effectiveness of the communication itself, while A/B test measures relative effectiveness of option A vs option B of the communication. To be valid A/B tests need to comply with the same design principles as regular test vs control design.

When is it appropriate to use A/B tests, and not classic test vs control design?

The most common case when A/B tests should be used is when we cannot withhold communications from our target audience. For example, if we would like to test different marketing messages on the landing page of a website, we have to use A/B test since not having a landing page is simply not an option. The A/B tests are very popular in digital/web analytics for precisely that reason – while it is easy to test different versions, it is not appropriate to withhold communications.

Another common case of the use of A/B testing is when withholding treatment is unethical. For example, it is a standard in cancer research for new drugs and protocols to be tested against standard treatment protocol. If the new drug or protocol provides advantages, it will be recommended over the standard course of treatment.

Can I combine the advantages of A/B testing and test vs control methodologies? 

It is certainly possible, and I call it A/B + Control testing. The setup of this experiment is pretty simple. You need to split your group randomly three ways: the first group gets communication A, the second group gets communication B, and the third group gets no communication. This setup allows you to measure absolute effectiveness of marketing communications and compare effectiveness of A vs B all in one test. This is a very powerful approach, and if you have the sample size to run tests this way, I would certainly recommend doing so.

Sample A/B + Control setup:

 ABC test setup


What are the most common problems with A/B testing?

From my experience, there are a few common ways to invalidate the results of the A/B tests.

  • Insufficient sample size.

I have been lucky to work for big consumer companies, so sample size is usually not an issue. My general guide is if you are looking for small response rates (around 2%) typical of marketing, and your “meaningful” cut off for the difference is in 0.3%-0.5% range, you need at least 10,000 targets in your smallest group. 20,000 is usually better. Please note that it is your smallest group size that drives the significance, so if you decided to go with 1,000 targets in your alternative treatment group, it really does not make much difference whether your main group is 10,000 or 1,000,000. Conclusion: to improve statistical power of your results, beef up your smallest group.

You can refer to a independent two-sample t-test formula to calculate your required sample size given your expected response rate and a difference/lift you need to produce.

  • Testing communications for different customer segments without exposing customers of the same segment to all treatments. 

This happens when marketers try to tailor communications to the customer segments, and in the process forget that the point of testing is to prove that tailored communication works better than a generic one for the same segment.

For example, if you are customizing your communication for customers with no kids and for customers with kids, then to prove that the kid-friendly version B of your communication works better than generic/adult-only version A, you need to expose your segment of customers with kids to both versions A and B, and then compare the outcomes. If you expose them just to their targeted version B, you will not know if an increased response of this segment is the result of this segment being better targets in general, or it was your customization that made the difference.

Here is an example of incorrect A/B + Control test design for two customer segments:

Incorrect A/B test design for segments

This test design will be able to tell absolute impact of communication A on Segment 1, but it will not be able to tell if communication A or B works better for Segment 1, thus losing the comparative insight that A/B testing provides.

Below is an example of a correct setup for A/B + Control test for two segments:

 Correct A/B test design for segments

The design above will be able to show whether treatment A or B is better for Segment 1, and the same for Segment 2. It will also be able to measure overall effectiveness of both treatments for each of the segment, thus providing full and complete information to the marketer on how to proceed forward.

  • Not looking at the bottom line.

The end goal of marketing is to produce sales. While increasing clicks or calls is important, it is not going to help company improve the resutls if the clicks and calls don’t translate into purchases. I have seen many cases in both analog and digital world when an increase in walk-ins, calls, and clicks never translated into sales, so a good analyst should always keep this in mind. It also helps us take our minds off treatment-specific behavior and look for ways to assess its overall impact: often times intermediary KPIs are not available for the control group, while the bottom line sales are.

  • Not testing on a representative sample of customers.

This last mistake is often made by the digital marketers, who sometimes stop the test before it has run its course over a full “natural” customer cycle, like a full week. While there are some tests that are specifically designed to run during specific time, for instance, testing Holiday communications around Holidays, in most general case you want to run your experiment for at least a full week.

  • Sweating the small stuff while forgetting about the big picture.

When choosing your best in-market tactics, you should always test the big stuff first. For example, in direct marketing, mail/email frequency is more important to figure out than color of a design element or copy of the offer.  Creating a testing plan that tests most influential variables first will help you optimize your marketing efforts faster and with better results. Don’t be afraid to make bold moves, like stop certain types of marketing for an extensive period of time, or radically simplify checkout. These are the things that you should expect to yield the most for your testing buck.

Another downside of the opposite strategy, i.e.  trying to test everything, from target segment to offer to copy to color, is that it leads to small sample sizes. As a result, most tests won’t show statistically significant differences – or worse, you may find some false positives.

  • Not accepting inconclusive results of A/B testing.

Often, the results of the test are inconclusive or the difference is small. This is when statisticians say that the null hypothesis can’t be ruled out, and it is still a valid result of the test that marketers need to accept. Not every small change produces a big shift in behavior. In fact, the vast majority of A/B tests that I have analyzed in direct mail failed to deliver any difference.

  • Using A/B test to create an impression of scientific testing activity.

This is my pet peeve. I always suspect this situation when it is possible to use control groups along with A/B test, but it is not done. From my experience, the total difference that direct mail and email produces on customer behavior is small, so honest marketers have to admit that their programs are not paying for themselves and redirect their budgets to more effective communications. However, many choose to cover up the overall ineffectiveness by creating a pseudo-scientific program of expensive A/B testing, which tests a lot of small changes to communications; changes that produce little difference. This last mistake of the testing incorporates many other mistakes – from sweating the small stuff and ignoring the big picture to inability to accept that the small changes tested don’t improve the results.

Summary of A/B testing recommendations:

  • A/B test does not measure effectiveness of communications
  • These tests are popular in digital marketing
  • A/B test is a type of experiment, and it should comply with the same principles as test vs control methodology
  • A/B test is justified when true control is impossible or unethical
  • If possible, always use A/B + Control testing
  • Incorporate sales/profits into your outcome measures
  • When testing best versions for a segment, always test all versions on the segment
  • Make sure you have at least 10,000 targets in your smallest group
  • If running “live” digital tests, allocate at least a week
  • Accept that versioning rarely produces big swing, so A/B tests are often inconclusive
  • Don’t sweat the small stuff
  • Don’t abuse A/B tests to cover up ineffective communications

Leave a Reply