These are observations I have made during my years of working with controlled experiments. They are not exhaustive, but a good list to start with.
The goal of control group selection is to keep the test (treatment) and control (holdout) groups representative of each other, both in terms of sample composition and in terms of measurement. The only difference between the groups should be where your treatment is applied.
Do’s of Controlled Experiments
- Use random group assignment whenever possible. Random assignment ensures that subjects from every sub-segment are equally likely to become part of the control group. Quasi-random (every Nth subject) assignment works well in most cases and should not be discounted.
- If you are running multiple controlled experiments or campaigns, use the same percentage of the total audience as a control group. This has both operational and analytical advantages, and you will avoid multiple pitfalls such as accidentally using a wrong sample size or incorrectly summarizing the results.
- Do not pre-cleanse control group for any other effects. If you treat your control differently from your test group prior to the experiment, it may become non-representative.
- Run the experiment as close to business-as-usual conditions as possible.
- Check if being in the control group results in subject being treated differently in an unrelated area. Do they become eligible or lose eligibility to a different program, discount, or customer service initiative? The idea is either to equalize the unrelated treatment or expand your definition of treatment to this area.
- Check if the response definition is equally applicable to both group. If your response is calls to 1-800 number that is being sent only to the test group, you won’t have a good measurement.
- Always include bottom line measure as one of the measurements of a control experiment. For example, if your test group gets a promotion coupon, your measure should account for both increased sales due to the promotion and lost revenue due to the discount. Total gross margin is a good metric in this case.
- If any data transformation that impact the outcome measures is applied to the test group, the same transformation must be applied to the control group. For example, if you found subjects in the test group who were not eligible for the promotion and excluded them after the groups were selected, you must do the same thing to the control group or measure both groups pre-modification.
Don’ts of Controlled Experiments
Overview of the most common ways to screw up your controlled experiment design.
- Avoid self-selection. Self-selection breaks representation, sometimes in observable ways that are not easy to observe or adjust for pre- or post- analysis. It’s insidious, and the doubt about the effects being the result of self-selection and not the treatment rarely go away.
- Assigning customers eligible for the treatment into the test group, and those not eligible for the treatment into the control group. Why it is going to screw you up every time: being eligible for the treatment makes your test group different from control group, and thus not representative. Unless you have no operational ability or legal power to assign like customers in both groups, never use this method.
- Freaking out about the test, and sending more marketing specifically to the hold out groups. I have seen marketing people reach this conclusion once they hear about a long term marketing suppression test. We can’t just leave these people alone! Not only we can, we should. That’s the whole point! In fact, it is beneficial for them to leave these people alone so their marketing program generates a positive difference in the no marketing group.
- Biased measurement of the outcome of the test. In other words, measuring the groups in a different way. The most common mistake is to find a flaw in the execution of the treatment and try to “adjust” the outcome of the test group by it. For example, if the treatment has not reached every customer, some try to only count “reachables”. The problem? The control group is still comprised of both “reachables” and “unreachables”! if your groups are representative to start with, measure them in exact same way.
- Creating a bunch of experiments where control groups are different percentage in each test – and then lumping them all together for the analysis. Here is the post with detailed analysis of this flaw.
The controlled experiment design is a very reliable analytical method, and if implemented correctly from the very beginning, it is extremely resilient to all sorts of disturbances and interventions. In my practice, I have seen every attack on the design, and when experiment was repeated, the results always held up.
Bonus: Common Myths About Controlled Experiments
These things are often perceived as threats, but they are not. You should stop trying to “fix” them.
- Not believing in the power of random assignment. Some people don’t believe that random assignment of subjects into groups can make them representative. Even when random unbiased assignment is entirely possible, analysts try to come up with complicated algorithms for assignments, such as stratified samples. For tests of decent sample sizes (1,000+ subjects in each group), it is completely unnecessary. Don’t believe me? Conduct a simple experiment: assign subjects randomly into two groups, and then check if they are representative in both profile and future behavior.
- Thinking that all other marketing communications have to be halted for the duration of the test to obtain a valid measurement. SO NOT TRUE! This is the most common objection to the results of the test, and it is wrong precisely because the controlled experiment is designed to handle just that. The point of the control group is to control for everything else that is going on in the market. Everything. A hurricane, a new media campaign, another mail piece, paid search ramp up. Everything. As long as this “everything” applies to both test and control groups equally, your measurement is valid.Sometimes I see people getting one result from the test in one business environment, and a different result in another. They claim that in the one case, the measurement is not valid because of a different environment. Nonetheless, the correct explanation is that your measurement is still valid, but the program delivers different results depending on the environment.
The point of the control group is to control for everything else that is going on in the market.
- Believing that you need special conditions to conduct the controlled test, like “resting” the groups before the test or having a universal control group. In most cases, you don’t. Since you are trying to measure the effectiveness of the treatment as it is implemented on the market under BAU conditions, your control should be subject to BAU conditions.
- Thinking that you can’t perform two (or more) controlled tests on roughly the same population at the same time. This question often arises when control groups are implemented as business as usual on all campaigns on an ongoing basis. Often, you have either calendar shifts or other changes that land two campaigns in the same or similar (i.e. overlapping) population at the same time. Now, all measurement is suddenly declared invalid.
The truth is, as long as you have random and representative group assignments, your results are going to hold. - Thinking that customers who may be accidentally assigned into both groups due to accessing the test through multiple devices will invalidate your test. This one is a little trickier, but the bottom line is that customers in both groups will be impacted by this design flaw, and the setup will produce a valid measurement for the original assignment, and possibly beyond. To make your test a bit more robust, my recommendation is to have a 50/50 split, so the proportion of “dual group” customers is the same for both test and control.